A library for mining path-based representations of code and more supported by the Machine Learning Methods for Software Engineering group at JetBrains Research.
Supported languages of the input:
Java | Python | C/C++ | JavaScript | PHP | |
---|---|---|---|---|---|
ANTLR | ✅ | ✅ | ✅ | ✅ | |
GumTree | ✅ (JDT and srcML) | ✅ | |||
Fuzzy | ✅ | ||||
JavaParser | ✅ | ||||
TreeSitter | ✅ | ||||
JavaLang | ✅ |
astminer
lets you create an end-to-end pipeline to process code for machine learning models.
Currently, it supports the extraction of:
- Path-based representations of files/methods
- Raw ASTs of files/methods
astminer
was first implemented as a part of the pipeline in the code style extraction project and later converted into a reusable tool.
It is designed to be easily extensible to new languages.
astminer
allows you to convert source code cloned from VCSs to formats suitable for training.
To achieve that, astminer
incorporates the following processing modules:
- Filters to remove redundant samples from data.
- Label extractors to create a label for each tree.
- Storages to define the storage format.
There are two ways to use astminer
:
- As a standalone CLI tool with a pre-implemented logic for common processing and mining tasks.
- Integrated into your Kotlin/Java mining pipelines as a Gradle dependency.
-
Build the CLI from the sources.
-
Prepare your inputs and configure pipeline options. For config examples, see the configs directory.
-
To run the CLI, pass the config to the shell script:
./cli.sh <path-to-YAML-config>
Alternatively, you can run the tool inside the Docker image.
astminer
is available in the JetBrains Space package repository. You can add the dependency in your build.gradle
file:
repositories {
maven {
url "https://packages.jetbrains.team/maven/p/astminer/astminer"
}
}
dependencies {
implementation 'io.github.vovak:astminer:<VERSION>'
}
If you use build.gradle.kts
:
repositories {
maven(url = uri("https://packages.jetbrains.team/maven/p/astminer/astminer"))
}
dependencies {
implementation("io.github.vovak:astminer:<VERSION>")
}
To use a specific version of the library, navigate to the required branch and build a local version of astminer
:
./gradlew publishToMavenLocal
After that add mavenLocal()
into the repositories
section in your gradle configuration.
If you want to use astminer
as a library in your Java/Kotlin-based data mining tool, check the following usage examples:
- Simple standalone example scripts in Java and Kotlin with calling to different APIs of
astminer
. - psiminer, a mining tool that uses
astminer
to extract paths from PSI trees. See the [code2seq storage implementation] (https://github.com/JetBrains-Research/psiminer/blob/master/psiminer-core/src/main/kotlin/storage/paths/Code2SeqStorage.kt).
Please consider trying Kotlin for your data mining pipelines: from our experience, it is much better suited for data collection and transformation instruments than Java.
We believe that astminer
can find use beyond our own mining tasks.
Please help make astminer
easier to use by sharing your use cases. Pull requests are welcome as well.
Support for other languages and documentation are the key areas of improvement.
A paper dedicated to astminer
(more precisely, to its older version PathMiner) was presented at MSR'19.
If you use astminer
in your academic work, please cite it.
@inproceedings{kovalenko2019pathminer,
title={PathMiner: a library for mining of path-based representations of code},
author={Kovalenko, Vladimir and Bogomolov, Egor and Bryksin, Timofey and Bacchelli, Alberto},
booktitle={Proceedings of the 16th International Conference on Mining Software Repositories},
pages={13--17},
year={2019},
organization={IEEE Press}
}