Generate Code-Based Yara Rules using Machine Learning.
I wrote clava for an industry project during my studies at Hochschule Luzern. This project researches how to automatically generate code-based Yara rules for a given malware sample using machine learning. We've kept the machine learning part intentionally rudimentary to demonstrate how much can be achieved with basic methods. The research is documented in a paper (German only). Contact me if you are interested in the paper.
TL;DR: clava creates n-grams of mnemonics (e.g., XOR
or PUSH
) of good- and malware and trains a logistic regression classifier on the n-gram's term frequency weights. We drop the operands as they are subject to change and only keep the instruction's operation part to improve the robustness of the rules. Dropping the operands requires to wildcard the Yara rules. We are using mkYARA (kudos!) for that task.
We've kept the methodology overly simplistic to demonstrate what can be achieved and also due to the project's time constraints. Using n-grams of mnemonics (where n is small) is simple but lacks semantic meaning - a malware analyst would have a hard time figuring out the context of the output sequence. Semantic meaning can be achieved by increasing the n-grams size (see also KiloGrams: Very Large N-Grams for Malware Classification) or by using semantically meaningful features in the first place, such as function bodies of the disassembled binaries. Further, one could explore more elaborate models such as sequence models like RNNs.
The trained models are not public. However, you can train a model on your own dataset. Instructions will follow.
To install clava
, clone this repository and run (preferably in a virtualenv):
$ pip install -r requirements.txt
$ python setup.py install
clava offers a simple CLI to interact. To list all available options, run:
$ clava -h
To generate a yara rule based on a sample:
$ clava yara <path/to/sample>
Use the official Yara binaries to apply the generated rule on your sample and / or corpus of samples. The binaries can be downloaded from here
For example:
# Create yara rule 'detect-evil.yara' for evil.exe:
$ clava yara evil.exe -o detect-evil.yar
# Check if any file in a corpus matches the generated rule:
$ yara detect-evil.yar my-malware-corpus/
# Tip: If you have a large corpus, you can compile the yara rule to
# increase the performance:
$ yarac detect-evil.yar detect-evil-compiled
$ yara -C detect-evil-compiled my-malware-corpus/
Important: Rules created with clava should not directly be used in production, but can assist during rule development. This project is heavily inspired by yarGen, therefore see also Floriah Roth's blog post "How to post-process YARA rules generated by yarGen".
During development, install clava
in editable mode:
$ pip install -e .[dev]
clava uses pytest. To run the test suite with a set of predefined settings, run:
$ make tests
Alternatively, you can run pytest against the tests/
directory with your own settings.
Contributions are welcome! If you plan major changes, please create an issue first to discuss the changes.
Good datasets are essential, however there are not many public datasets of good- and malware executables. You can assemble your own dataset using projects like:
- VirusShare offers access to large amounts of malware (registration required).
- MalwareBazaar offers daily collections of malware: https://mb-api.abuse.ch/downloads/
- APTMalware Github repo
- Windows Sysinternals tools is a collection of Windows system utilites; they often cause security tooling to generate false positives, hence it is a great dataset to test your rules against.
Public goodware datasets are rare - PRs are welcome :-)
Tools:
- Capstone.js for interactive disassembling, useful during development.
clava was heavily inspired by these projects:
I would also like to thank these projects: