Skip to content

Replication kit for Static source code metrics and static analysis warnings for fine-grained just-in-time defect prediction

Notifications You must be signed in to change notification settings

atrautsch/icsme2020_replication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Static source code metrics and static analysis warnings for fine-grained just-in-time defect prediction

This replication kit contains the jupyter notebooks used to aggregate the results of the mining process. We describe all steps needed for a reproduction. However, due to the nature and size of the data not everything can be contained in this replication kit.

SmartSHARK Data

The raw data used in this study was collected with the SmartSHARK infrastructure over a multi-year time period. For every commit and every file static source code metrics and static analysis warnings were collected amongst other features. A dump of the resulting MongoDB can be found here. The SmartSHARK documentation has the information needed to create this dataset, however a production deployment with connection to a HPC-Cluster is advisable due to the computation time required.

Fine-grained just-in-time defect prediction data mining

The SmartSHARK database contains a snapshot of the Git repositories of the projects at the time the data was mined via SmartSHARK in GridFS. After extracting the data from the GridFS to a local path the extraction can be started.

For fine-grained just-in-time defect prediction we accumulate a lot of data per file over its full history in the commit graph. Therefore a special tool is needed to extract this information in the right way. The tool can be found here. It is a constantly changing research prototype that uses breadth-first-search, path slicing and a stack per path to traverse every possible path in a commit graph while keeping states for each path.

This tool was used to create the raw mined data that can be downloaded separately.

Download data

This download contains the raw mined data used to generate the evaluation data already contained in this replication kit. You do not need to download this if you only want to re-create the plots and tables.

cd data
wget https://user.informatik.uni-goettingen.de/~trautsch2/icsme2020_mining_data.zip
unzip icsme2020_mining_data.zip

Install dependencies for jupyter lab

python -m venv .
source bin/activate
pip install -r requirements.txt

Run jupyter lab

source bin/activate
cd notebooks
jupyter lab

Jupyter lab is used to aggregate the results and create the plots. To aggregate the raw mining data run the TrainTest.ipynb and Interval.ipynb notebooks. Both aggregate the jit_sn_*.csv to aggregated results from the model evaluation.

The plots are generated by TrainTestEval.ipynb and IntervalEval.ipynb.

Overview.ipynb and Correlation.ipynb provide an overview of the data and the top 10 features.

Folder structure

This replication kit contains the following folders:

  • notebooks contains the jupyter notebooks
  • data contains the train/test model evaluation as well as the interval approach model evaluation the raw mining data (if needed) is also put here
  • figures is the output directory for the figures of the paper generated by the notebooks

Data structure

Three types of CSV files are used / generated by this replication kit. The first two contain the model evaluation for train/test split and interval approach. The last one is the mined data.

train_test_all.csv

This file contains the model evaluation for the train/test split. It contains the name of the project, label, feature set as well as the model performance metrics and upper and lower bounds for the cost model.

Field Description
project Name of the project
label Dependent variable, one of: pascarella_commit (same as pascarella replication kit, not used in the paper), adhoc_label (ad-hoc SZZ), bug_label (ITS SZZ)
metric_set Feature set used in this evaluation, one of: jit (only just-in-time change metrics), static (only static source code metrics), pmd (only pmd warnings), jit_static_pmd (combined feature set)
rf_f1 Random Forest model performance metric F-measure
rf_roc_auc Random Forest model performance metric AUC
lr_f1 Elastic Net Logistic Regression model performance metric F-measure
lr_roc_auc Elastic Net Logistic Regression model performance metric AUC
rf_ub Random Forest cost model upper bound
rf_lb Random Forest cost model lower bound
lr_ub Elastic Net Logistic Regression cost model upper bound
lr_lb Elastic Net Logistic Regression cost model lower bound

interval_mean_*.csv

This file contains the model evaluation for the interval approach. There is one CSV file for each project.

Field Description
project Name of the project
train_size Number of instances in the training data
train_pos Number of true positives, i.e., bug-inducing instances in the training data
test_size Number of instances in the test data
test_pos Number of true positives, i.e., bug-inducing instances in the test data
train_start Start date of the training data
train_end End date of the training data
test_start Start date of the test data
test_end End date of the test data
features Feature set used in this evaluation, one of: jit (only just-in-time change metrics), static (only static source code metrics), pmd (only pmd warnings), jit_static_pmd (combined feature set)
label Dependent variable, one of: label_adhoc (ad-hoc SZZ), label_bug (ITS SZZ)
ignore_dates Unused, always true
use_smote If SMOTE sampling was used, always true
metric_set Duplicate of features
rf_f1 Random Forest model performance metric F-measure
rf_roc_auc Random Forest model performance metric AUC
lr_f1 Elastic Net Logistic Regression model performance metric F-measure
lr_roc_auc Elastic Net Logistic Regression model performance metric AUC
rf_ub Random Forest cost model upper bound
rf_lb Random Forest cost model lower bound
lr_ub Elastic Net Logistic Regression cost model upper bound
lr_lb Elastic Net Logistic Regression cost model lower bound

jit_sn_*.csv

Contained in the optional ZIP file. There is one CSV for every project. The file contains the raw mined data including all features and meta information.

Field Description
commit Revision hash of the commit
committer_date Datetime of the commit (UTC)
file Name of the file that was changed
comm, adev, ddev, add, del, own, minor, sctr, nd, entropy, la, ld, cexp, rexp, sexp, nuc, age, oexp, exp, nsctr, ncomm, nadev, nddev, lt, fix_bug JIT Features
parent_*, current_*, delta_* [PDA, LOC, CLOC, PUA, McCC, LLOC, LDC, NOS, MISM, CCL, TNOS, TLLOC, NLE, CI, HPL, MI, HPV, CD, NOI, NUMPAR, MISEI, CC, LLDC, NII, CCO, CLC, TCD, NL, TLOC, CLLC, TCLOC, MIMS, HDIF, DLOC, NLM, DIT, NPA, TNLPM, TNLA, NLA, AD, TNLPA, NM, TNG, NLPM, TNM, NOC, NOD, NOP, NLS, NG, TNLG, CBOI, RFC, NLG, TNLS, TNA, NLPA, NOA, WMC, NPM, TNPM, TNS, NA, LCOM5, NS, CBO, TNLM, TNPA] Static Features
parent_*, current_*, delta_* [PMD_ABSALIL, PMD_ADLIBDC, PMD_AMUO, PMD_ATG, PMD_AUHCIP, PMD_AUOV, PMD_BII, PMD_BI, PMD_BNC, PMD_CRS, PMD_CSR, PMD_CCEWTA, PMD_CIS, PMD_DCTR, PMD_DUFTFLI, PMD_DCL, PMD_ECB, PMD_EFB, PMD_EIS, PMD_EmSB, PMD_ESNIL, PMD_ESI, PMD_ESS, PMD_ESB, PMD_ETB, PMD_EWS, PMD_EO, PMD_FLSBWL, PMD_JI, PMD_MNC, PMD_OBEAH, PMD_RFFB, PMD_UIS, PMD_UCT, PMD_UNCIE, PMD_UOOI, PMD_UOM, PMD_FLMUB, PMD_IESMUB, PMD_ISMUB, PMD_WLMUB, PMD_CTCNSE, PMD_PCI, PMD_AIO, PMD_AAA, PMD_APMP, PMD_AUNC, PMD_DP, PMD_DNCGCE, PMD_DIS, PMD_ODPL, PMD_SOE, PMD_UC, PMD_ACWAM, PMD_AbCWAM, PMD_ATNFS, PMD_ACI, PMD_AICICC, PMD_APFIFC, PMD_APMIFCNE, PMD_ARP, PMD_ASAML, PMD_BC, PMD_CWOPCSBF, PMD_ClR, PMD_CCOM, PMD_DLNLISS, PMD_EMIACSBA, PMD_EN, PMD_FDSBASOC, PMD_FFCBS, PMD_IO, PMD_IF, PMD_ITGC, PMD_LI, PMD_MBIS, PMD_MSMINIC, PMD_NCLISS, PMD_NSI, PMD_NTSS, PMD_OTAC, PMD_PLFICIC, PMD_PLFIC, PMD_PST, PMD_REARTN, PMD_SDFNL, PMD_SBE, PMD_SBR, PMD_SC, PMD_SF, PMD_SSSHD, PMD_TFBFASS, PMD_UEC, PMD_UEM, PMD_ULBR, PMD_USDF, PMD_UCIE, PMD_ULWCC, PMD_UNAION, PMD_UV, PMD_ACF, PMD_EF, PMD_FDNCSF, PMD_FOCSF, PMD_FO, PMD_FSBP, PMD_DIJL, PMD_DI, PMD_IFSP, PMD_TMSI, PMD_UFQN, PMD_DNCSE, PMD_LHNC, PMD_LISNC, PMD_MDBASBNC, PMD_RINC, PMD_RSINC, PMD_SEJBFSBF, PMD_JUASIM, PMD_JUS, PMD_JUSS, PMD_JUTCTMA, PMD_JUTSIA, PMD_SBA, PMD_TCWTC, PMD_UBA, PMD_UAEIOAT, PMD_UANIOAT, PMD_UASIOAT, PMD_UATIOAE, PMD_GDL, PMD_GLS, PMD_PL, PMD_UCEL, PMD_APST, PMD_GLSJU, PMD_LINSF, PMD_MTOL, PMD_SP, PMD_MSVUID, PMD_ADS, PMD_AFNMMN, PMD_AFNMTN, PMD_BGMN, PMD_CNC, PMD_GN, PMD_MeNC, PMD_MWSNAEC, PMD_NP, PMD_PC, PMD_SCN, PMD_SMN, PMD_SCFN, PMD_SEMN, PMD_SHMN, PMD_VNC, PMD_AES, PMD_AAL, PMD_RFI, PMD_UWOC, PMD_UALIOV, PMD_UAAL, PMD_USBFSA, PMD_AISD, PMD_MRIA, PMD_ACGE, PMD_ACNPE, PMD_ACT, PMD_ALEI, PMD_ARE, PMD_ATNIOSE, PMD_ATNPE, PMD_ATRET, PMD_DNEJLE, PMD_DNTEIF, PMD_EAFC, PMD_ADL, PMD_ASBF, PMD_CASR, PMD_CLA, PMD_ISB, PMD_SBIWC, PMD_StI, PMD_STS, PMD_UCC, PMD_UETCS, PMD_ClMMIC, PMD_LoC, PMD_SiDTE, PMD_UnI, PMD_ULV, PMD_UPF, PMD_UPM] PMD Features (Rules)
system_WD (System/WD), file_system_sum_WD (File/System/WD), author_delta_sum_WD (Author/Delta/WD) PMD Features (Warning density based)
previous_inducing Not used in the paper, marks files that were previously bug-inducing
pascarella_commit Pascarella label (from pascarella replication kit)
pascarella_file Pascarella label (from pascarella replication kit)
label_adhoc Ad-hoc SZZ
label_bug ITS SZZ
adhoc_**, JIRAKEY*_* Bug Matrix, contains ad-hoc and ITS SZZ, is 1 if the file was inducing for the bug. The field name contains the revision_hash and datetime of the bug-fixing commit and the issue id in the case of Jira (ITS SZZ).

About

Replication kit for Static source code metrics and static analysis warnings for fine-grained just-in-time defect prediction

Resources

Stars

Watchers

Forks

Packages

No packages published