clava 🔍

Generate Code-Based Yara Rules using Machine Learning.

About

I wrote clava for an industry project during my studies at Hochschule Luzern. This project researches how to automatically generate code-based Yara rules for a given malware sample using machine learning. We've kept the machine learning part intentionally rudimentary to demonstrate how much can be achieved with basic methods. The research is documented in a paper (German only). Contact me if you are interested in the paper.

TL;DR: clava creates n-grams of mnemonics (e.g., XOR or PUSH) of good- and malware and trains a logistic regression classifier on the n-gram's term frequency weights. We drop the operands as they are subject to change and only keep the instruction's operation part to improve the robustness of the rules. Dropping the operands requires to wildcard the Yara rules. We are using mkYARA (kudos!) for that task.

We've kept the methodology overly simplistic to demonstrate what can be achieved and also due to the project's time constraints. Using n-grams of mnemonics (where n is small) is simple but lacks semantic meaning - a malware analyst would have a hard time figuring out the context of the output sequence. Semantic meaning can be achieved by increasing the n-grams size (see also KiloGrams: Very Large N-Grams for Malware Classification) or by using semantically meaningful features in the first place, such as function bodies of the disassembled binaries. Further, one could explore more elaborate models such as sequence models like RNNs.

The trained models are not public. However, you can train a model on your own dataset. Instructions will follow.

Getting Started

To install clava, clone this repository and run (preferably in a virtualenv):

$ pip install -r requirements.txt
$ python setup.py install

clava offers a simple CLI to interact. To list all available options, run:

$ clava -h

To generate a yara rule based on a sample:

$ clava yara <path/to/sample>

Use the official Yara binaries to apply the generated rule on your sample and / or corpus of samples. The binaries can be downloaded from here

For example:

# Create yara rule 'detect-evil.yara' for evil.exe:
$ clava yara evil.exe -o detect-evil.yar

# Check if any file in a corpus matches the generated rule:
$ yara detect-evil.yar my-malware-corpus/

# Tip: If you have a large corpus, you can compile the yara rule to
# increase the performance:
$ yarac detect-evil.yar detect-evil-compiled
$ yara -C detect-evil-compiled my-malware-corpus/

Important: Rules created with clava should not directly be used in production, but can assist during rule development. This project is heavily inspired by yarGen, therefore see also Floriah Roth's blog post "How to post-process YARA rules generated by yarGen".

Development

During development, install clava in editable mode:

$ pip install -e .[dev]

Running the tests

clava uses pytest. To run the test suite with a set of predefined settings, run:

$ make tests

Alternatively, you can run pytest against the tests/ directory with your own settings.

Contribute

Contributions are welcome! If you plan major changes, please create an issue first to discuss the changes.

Resources

Good datasets are essential, however there are not many public datasets of good- and malware executables. You can assemble your own dataset using projects like:

VirusShare offers access to large amounts of malware (registration required).
MalwareBazaar offers daily collections of malware: https://mb-api.abuse.ch/downloads/
APTMalware Github repo
Windows Sysinternals tools is a collection of Windows system utilites; they often cause security tooling to generate false positives, hence it is a great dataset to test your rules against.

Public goodware datasets are rare - PRs are welcome :-)

Tools:

Capstone.js for interactive disassembling, useful during development.

Credits

clava was heavily inspired by these projects:

I would also like to thank these projects:

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
docs		docs
src/clava		src/clava
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clava 🔍

Generate Code-Based Yara Rules using Machine Learning.

Table of Contents

About

Getting Started

Development

Running the tests

Contribute

Resources

Credits

About

Languages

License

strfx/clava

Folders and files

Latest commit

History

Repository files navigation

clava 🔍

Generate Code-Based Yara Rules using Machine Learning.

Table of Contents

About

Getting Started

Development

Running the tests

Contribute

Resources

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Languages