Static source code metrics and static analysis warnings for fine-grained just-in-time defect prediction
This replication kit contains the jupyter notebooks used to aggregate the results of the mining process. We describe all steps needed for a reproduction. However, due to the nature and size of the data not everything can be contained in this replication kit.
The raw data used in this study was collected with the SmartSHARK infrastructure over a multi-year time period. For every commit and every file static source code metrics and static analysis warnings were collected amongst other features. A dump of the resulting MongoDB can be found here. The SmartSHARK documentation has the information needed to create this dataset, however a production deployment with connection to a HPC-Cluster is advisable due to the computation time required.
The SmartSHARK database contains a snapshot of the Git repositories of the projects at the time the data was mined via SmartSHARK in GridFS. After extracting the data from the GridFS to a local path the extraction can be started.
For fine-grained just-in-time defect prediction we accumulate a lot of data per file over its full history in the commit graph. Therefore a special tool is needed to extract this information in the right way. The tool can be found here. It is a constantly changing research prototype that uses breadth-first-search, path slicing and a stack per path to traverse every possible path in a commit graph while keeping states for each path.
This tool was used to create the raw mined data that can be downloaded separately.
This download contains the raw mined data used to generate the evaluation data already contained in this replication kit. You do not need to download this if you only want to re-create the plots and tables.
cd data
wget https://user.informatik.uni-goettingen.de/~trautsch2/icsme2020_mining_data.zip
unzip icsme2020_mining_data.zip
python -m venv .
source bin/activate
pip install -r requirements.txt
source bin/activate
cd notebooks
jupyter lab
Jupyter lab is used to aggregate the results and create the plots. To aggregate the raw mining data run the TrainTest.ipynb and Interval.ipynb notebooks. Both aggregate the jit_sn_*.csv to aggregated results from the model evaluation.
The plots are generated by TrainTestEval.ipynb and IntervalEval.ipynb.
Overview.ipynb and Correlation.ipynb provide an overview of the data and the top 10 features.
This replication kit contains the following folders:
- notebooks contains the jupyter notebooks
- data contains the train/test model evaluation as well as the interval approach model evaluation the raw mining data (if needed) is also put here
- figures is the output directory for the figures of the paper generated by the notebooks
Three types of CSV files are used / generated by this replication kit. The first two contain the model evaluation for train/test split and interval approach. The last one is the mined data.
This file contains the model evaluation for the train/test split. It contains the name of the project, label, feature set as well as the model performance metrics and upper and lower bounds for the cost model.
Field | Description |
---|---|
project | Name of the project |
label | Dependent variable, one of: pascarella_commit (same as pascarella replication kit, not used in the paper), adhoc_label (ad-hoc SZZ), bug_label (ITS SZZ) |
metric_set | Feature set used in this evaluation, one of: jit (only just-in-time change metrics), static (only static source code metrics), pmd (only pmd warnings), jit_static_pmd (combined feature set) |
rf_f1 | Random Forest model performance metric F-measure |
rf_roc_auc | Random Forest model performance metric AUC |
lr_f1 | Elastic Net Logistic Regression model performance metric F-measure |
lr_roc_auc | Elastic Net Logistic Regression model performance metric AUC |
rf_ub | Random Forest cost model upper bound |
rf_lb | Random Forest cost model lower bound |
lr_ub | Elastic Net Logistic Regression cost model upper bound |
lr_lb | Elastic Net Logistic Regression cost model lower bound |
This file contains the model evaluation for the interval approach. There is one CSV file for each project.
Field | Description |
---|---|
project | Name of the project |
train_size | Number of instances in the training data |
train_pos | Number of true positives, i.e., bug-inducing instances in the training data |
test_size | Number of instances in the test data |
test_pos | Number of true positives, i.e., bug-inducing instances in the test data |
train_start | Start date of the training data |
train_end | End date of the training data |
test_start | Start date of the test data |
test_end | End date of the test data |
features | Feature set used in this evaluation, one of: jit (only just-in-time change metrics), static (only static source code metrics), pmd (only pmd warnings), jit_static_pmd (combined feature set) |
label | Dependent variable, one of: label_adhoc (ad-hoc SZZ), label_bug (ITS SZZ) |
ignore_dates | Unused, always true |
use_smote | If SMOTE sampling was used, always true |
metric_set | Duplicate of features |
rf_f1 | Random Forest model performance metric F-measure |
rf_roc_auc | Random Forest model performance metric AUC |
lr_f1 | Elastic Net Logistic Regression model performance metric F-measure |
lr_roc_auc | Elastic Net Logistic Regression model performance metric AUC |
rf_ub | Random Forest cost model upper bound |
rf_lb | Random Forest cost model lower bound |
lr_ub | Elastic Net Logistic Regression cost model upper bound |
lr_lb | Elastic Net Logistic Regression cost model lower bound |
Contained in the optional ZIP file. There is one CSV for every project. The file contains the raw mined data including all features and meta information.
Field | Description |
---|---|
commit | Revision hash of the commit |
committer_date | Datetime of the commit (UTC) |
file | Name of the file that was changed |
comm, adev, ddev, add, del, own, minor, sctr, nd, entropy, la, ld, cexp, rexp, sexp, nuc, age, oexp, exp, nsctr, ncomm, nadev, nddev, lt, fix_bug | JIT Features |
parent_*, current_*, delta_* [PDA, LOC, CLOC, PUA, McCC, LLOC, LDC, NOS, MISM, CCL, TNOS, TLLOC, NLE, CI, HPL, MI, HPV, CD, NOI, NUMPAR, MISEI, CC, LLDC, NII, CCO, CLC, TCD, NL, TLOC, CLLC, TCLOC, MIMS, HDIF, DLOC, NLM, DIT, NPA, TNLPM, TNLA, NLA, AD, TNLPA, NM, TNG, NLPM, TNM, NOC, NOD, NOP, NLS, NG, TNLG, CBOI, RFC, NLG, TNLS, TNA, NLPA, NOA, WMC, NPM, TNPM, TNS, NA, LCOM5, NS, CBO, TNLM, TNPA] | Static Features |
parent_*, current_*, delta_* [PMD_ABSALIL, PMD_ADLIBDC, PMD_AMUO, PMD_ATG, PMD_AUHCIP, PMD_AUOV, PMD_BII, PMD_BI, PMD_BNC, PMD_CRS, PMD_CSR, PMD_CCEWTA, PMD_CIS, PMD_DCTR, PMD_DUFTFLI, PMD_DCL, PMD_ECB, PMD_EFB, PMD_EIS, PMD_EmSB, PMD_ESNIL, PMD_ESI, PMD_ESS, PMD_ESB, PMD_ETB, PMD_EWS, PMD_EO, PMD_FLSBWL, PMD_JI, PMD_MNC, PMD_OBEAH, PMD_RFFB, PMD_UIS, PMD_UCT, PMD_UNCIE, PMD_UOOI, PMD_UOM, PMD_FLMUB, PMD_IESMUB, PMD_ISMUB, PMD_WLMUB, PMD_CTCNSE, PMD_PCI, PMD_AIO, PMD_AAA, PMD_APMP, PMD_AUNC, PMD_DP, PMD_DNCGCE, PMD_DIS, PMD_ODPL, PMD_SOE, PMD_UC, PMD_ACWAM, PMD_AbCWAM, PMD_ATNFS, PMD_ACI, PMD_AICICC, PMD_APFIFC, PMD_APMIFCNE, PMD_ARP, PMD_ASAML, PMD_BC, PMD_CWOPCSBF, PMD_ClR, PMD_CCOM, PMD_DLNLISS, PMD_EMIACSBA, PMD_EN, PMD_FDSBASOC, PMD_FFCBS, PMD_IO, PMD_IF, PMD_ITGC, PMD_LI, PMD_MBIS, PMD_MSMINIC, PMD_NCLISS, PMD_NSI, PMD_NTSS, PMD_OTAC, PMD_PLFICIC, PMD_PLFIC, PMD_PST, PMD_REARTN, PMD_SDFNL, PMD_SBE, PMD_SBR, PMD_SC, PMD_SF, PMD_SSSHD, PMD_TFBFASS, PMD_UEC, PMD_UEM, PMD_ULBR, PMD_USDF, PMD_UCIE, PMD_ULWCC, PMD_UNAION, PMD_UV, PMD_ACF, PMD_EF, PMD_FDNCSF, PMD_FOCSF, PMD_FO, PMD_FSBP, PMD_DIJL, PMD_DI, PMD_IFSP, PMD_TMSI, PMD_UFQN, PMD_DNCSE, PMD_LHNC, PMD_LISNC, PMD_MDBASBNC, PMD_RINC, PMD_RSINC, PMD_SEJBFSBF, PMD_JUASIM, PMD_JUS, PMD_JUSS, PMD_JUTCTMA, PMD_JUTSIA, PMD_SBA, PMD_TCWTC, PMD_UBA, PMD_UAEIOAT, PMD_UANIOAT, PMD_UASIOAT, PMD_UATIOAE, PMD_GDL, PMD_GLS, PMD_PL, PMD_UCEL, PMD_APST, PMD_GLSJU, PMD_LINSF, PMD_MTOL, PMD_SP, PMD_MSVUID, PMD_ADS, PMD_AFNMMN, PMD_AFNMTN, PMD_BGMN, PMD_CNC, PMD_GN, PMD_MeNC, PMD_MWSNAEC, PMD_NP, PMD_PC, PMD_SCN, PMD_SMN, PMD_SCFN, PMD_SEMN, PMD_SHMN, PMD_VNC, PMD_AES, PMD_AAL, PMD_RFI, PMD_UWOC, PMD_UALIOV, PMD_UAAL, PMD_USBFSA, PMD_AISD, PMD_MRIA, PMD_ACGE, PMD_ACNPE, PMD_ACT, PMD_ALEI, PMD_ARE, PMD_ATNIOSE, PMD_ATNPE, PMD_ATRET, PMD_DNEJLE, PMD_DNTEIF, PMD_EAFC, PMD_ADL, PMD_ASBF, PMD_CASR, PMD_CLA, PMD_ISB, PMD_SBIWC, PMD_StI, PMD_STS, PMD_UCC, PMD_UETCS, PMD_ClMMIC, PMD_LoC, PMD_SiDTE, PMD_UnI, PMD_ULV, PMD_UPF, PMD_UPM] | PMD Features (Rules) |
system_WD (System/WD), file_system_sum_WD (File/System/WD), author_delta_sum_WD (Author/Delta/WD) | PMD Features (Warning density based) |
previous_inducing | Not used in the paper, marks files that were previously bug-inducing |
pascarella_commit | Pascarella label (from pascarella replication kit) |
pascarella_file | Pascarella label (from pascarella replication kit) |
label_adhoc | Ad-hoc SZZ |
label_bug | ITS SZZ |
adhoc_**, JIRAKEY*_* | Bug Matrix, contains ad-hoc and ITS SZZ, is 1 if the file was inducing for the bug. The field name contains the revision_hash and datetime of the bug-fixing commit and the issue id in the case of Jira (ITS SZZ). |