Static source code metrics and static analysis warnings for fine-grained just-in-time defect prediction

This replication kit contains the jupyter notebooks used to aggregate the results of the mining process. We describe all steps needed for a reproduction. However, due to the nature and size of the data not everything can be contained in this replication kit.

SmartSHARK Data

The raw data used in this study was collected with the SmartSHARK infrastructure over a multi-year time period. For every commit and every file static source code metrics and static analysis warnings were collected amongst other features. A dump of the resulting MongoDB can be found here. The SmartSHARK documentation has the information needed to create this dataset, however a production deployment with connection to a HPC-Cluster is advisable due to the computation time required.

Fine-grained just-in-time defect prediction data mining

The SmartSHARK database contains a snapshot of the Git repositories of the projects at the time the data was mined via SmartSHARK in GridFS. After extracting the data from the GridFS to a local path the extraction can be started.

For fine-grained just-in-time defect prediction we accumulate a lot of data per file over its full history in the commit graph. Therefore a special tool is needed to extract this information in the right way. The tool can be found here. It is a constantly changing research prototype that uses breadth-first-search, path slicing and a stack per path to traverse every possible path in a commit graph while keeping states for each path.

This tool was used to create the raw mined data that can be downloaded separately.

Download data

This download contains the raw mined data used to generate the evaluation data already contained in this replication kit. You do not need to download this if you only want to re-create the plots and tables.

cd data
wget https://user.informatik.uni-goettingen.de/~trautsch2/icsme2020_mining_data.zip
unzip icsme2020_mining_data.zip

Install dependencies for jupyter lab

python -m venv .
source bin/activate
pip install -r requirements.txt

Run jupyter lab

source bin/activate
cd notebooks
jupyter lab

Jupyter lab is used to aggregate the results and create the plots. To aggregate the raw mining data run the TrainTest.ipynb and Interval.ipynb notebooks. Both aggregate the jit_sn_*.csv to aggregated results from the model evaluation.

The plots are generated by TrainTestEval.ipynb and IntervalEval.ipynb.

Overview.ipynb and Correlation.ipynb provide an overview of the data and the top 10 features.

Folder structure

This replication kit contains the following folders:

notebooks contains the jupyter notebooks
data contains the train/test model evaluation as well as the interval approach model evaluation the raw mining data (if needed) is also put here
figures is the output directory for the figures of the paper generated by the notebooks

Data structure

Three types of CSV files are used / generated by this replication kit. The first two contain the model evaluation for train/test split and interval approach. The last one is the mined data.

train_test_all.csv

This file contains the model evaluation for the train/test split. It contains the name of the project, label, feature set as well as the model performance metrics and upper and lower bounds for the cost model.

Field	Description
project	Name of the project
label	Dependent variable, one of: pascarella_commit (same as pascarella replication kit, not used in the paper), adhoc_label (ad-hoc SZZ), bug_label (ITS SZZ)
metric_set	Feature set used in this evaluation, one of: jit (only just-in-time change metrics), static (only static source code metrics), pmd (only pmd warnings), jit_static_pmd (combined feature set)
rf_f1	Random Forest model performance metric F-measure
rf_roc_auc	Random Forest model performance metric AUC
lr_f1	Elastic Net Logistic Regression model performance metric F-measure
lr_roc_auc	Elastic Net Logistic Regression model performance metric AUC
rf_ub	Random Forest cost model upper bound
rf_lb	Random Forest cost model lower bound
lr_ub	Elastic Net Logistic Regression cost model upper bound
lr_lb	Elastic Net Logistic Regression cost model lower bound

interval_mean_*.csv

This file contains the model evaluation for the interval approach. There is one CSV file for each project.

Field	Description
project	Name of the project
train_size	Number of instances in the training data
train_pos	Number of true positives, i.e., bug-inducing instances in the training data
test_size	Number of instances in the test data
test_pos	Number of true positives, i.e., bug-inducing instances in the test data
train_start	Start date of the training data
train_end	End date of the training data
test_start	Start date of the test data
test_end	End date of the test data
features	Feature set used in this evaluation, one of: jit (only just-in-time change metrics), static (only static source code metrics), pmd (only pmd warnings), jit_static_pmd (combined feature set)
label	Dependent variable, one of: label_adhoc (ad-hoc SZZ), label_bug (ITS SZZ)
ignore_dates	Unused, always true
use_smote	If SMOTE sampling was used, always true
metric_set	Duplicate of features
rf_f1	Random Forest model performance metric F-measure
rf_roc_auc	Random Forest model performance metric AUC
lr_f1	Elastic Net Logistic Regression model performance metric F-measure
lr_roc_auc	Elastic Net Logistic Regression model performance metric AUC
rf_ub	Random Forest cost model upper bound
rf_lb	Random Forest cost model lower bound
lr_ub	Elastic Net Logistic Regression cost model upper bound
lr_lb	Elastic Net Logistic Regression cost model lower bound

jit_sn_*.csv

Contained in the optional ZIP file. There is one CSV for every project. The file contains the raw mined data including all features and meta information.

Field	Description
commit	Revision hash of the commit
committer_date	Datetime of the commit (UTC)
file	Name of the file that was changed
comm, adev, ddev, add, del, own, minor, sctr, nd, entropy, la, ld, cexp, rexp, sexp, nuc, age, oexp, exp, nsctr, ncomm, nadev, nddev, lt, fix_bug	JIT Features
parent_, current_, delta_* [PDA, LOC, CLOC, PUA, McCC, LLOC, LDC, NOS, MISM, CCL, TNOS, TLLOC, NLE, CI, HPL, MI, HPV, CD, NOI, NUMPAR, MISEI, CC, LLDC, NII, CCO, CLC, TCD, NL, TLOC, CLLC, TCLOC, MIMS, HDIF, DLOC, NLM, DIT, NPA, TNLPM, TNLA, NLA, AD, TNLPA, NM, TNG, NLPM, TNM, NOC, NOD, NOP, NLS, NG, TNLG, CBOI, RFC, NLG, TNLS, TNA, NLPA, NOA, WMC, NPM, TNPM, TNS, NA, LCOM5, NS, CBO, TNLM, TNPA]	Static Features
parent_, current_, delta_* [PMD_ABSALIL, PMD_ADLIBDC, PMD_AMUO, PMD_ATG, PMD_AUHCIP, PMD_AUOV, PMD_BII, PMD_BI, PMD_BNC, PMD_CRS, PMD_CSR, PMD_CCEWTA, PMD_CIS, PMD_DCTR, PMD_DUFTFLI, PMD_DCL, PMD_ECB, PMD_EFB, PMD_EIS, PMD_EmSB, PMD_ESNIL, PMD_ESI, PMD_ESS, PMD_ESB, PMD_ETB, PMD_EWS, PMD_EO, PMD_FLSBWL, PMD_JI, PMD_MNC, PMD_OBEAH, PMD_RFFB, PMD_UIS, PMD_UCT, PMD_UNCIE, PMD_UOOI, PMD_UOM, PMD_FLMUB, PMD_IESMUB, PMD_ISMUB, PMD_WLMUB, PMD_CTCNSE, PMD_PCI, PMD_AIO, PMD_AAA, PMD_APMP, PMD_AUNC, PMD_DP, PMD_DNCGCE, PMD_DIS, PMD_ODPL, PMD_SOE, PMD_UC, PMD_ACWAM, PMD_AbCWAM, PMD_ATNFS, PMD_ACI, PMD_AICICC, PMD_APFIFC, PMD_APMIFCNE, PMD_ARP, PMD_ASAML, PMD_BC, PMD_CWOPCSBF, PMD_ClR, PMD_CCOM, PMD_DLNLISS, PMD_EMIACSBA, PMD_EN, PMD_FDSBASOC, PMD_FFCBS, PMD_IO, PMD_IF, PMD_ITGC, PMD_LI, PMD_MBIS, PMD_MSMINIC, PMD_NCLISS, PMD_NSI, PMD_NTSS, PMD_OTAC, PMD_PLFICIC, PMD_PLFIC, PMD_PST, PMD_REARTN, PMD_SDFNL, PMD_SBE, PMD_SBR, PMD_SC, PMD_SF, PMD_SSSHD, PMD_TFBFASS, PMD_UEC, PMD_UEM, PMD_ULBR, PMD_USDF, PMD_UCIE, PMD_ULWCC, PMD_UNAION, PMD_UV, PMD_ACF, PMD_EF, PMD_FDNCSF, PMD_FOCSF, PMD_FO, PMD_FSBP, PMD_DIJL, PMD_DI, PMD_IFSP, PMD_TMSI, PMD_UFQN, PMD_DNCSE, PMD_LHNC, PMD_LISNC, PMD_MDBASBNC, PMD_RINC, PMD_RSINC, PMD_SEJBFSBF, PMD_JUASIM, PMD_JUS, PMD_JUSS, PMD_JUTCTMA, PMD_JUTSIA, PMD_SBA, PMD_TCWTC, PMD_UBA, PMD_UAEIOAT, PMD_UANIOAT, PMD_UASIOAT, PMD_UATIOAE, PMD_GDL, PMD_GLS, PMD_PL, PMD_UCEL, PMD_APST, PMD_GLSJU, PMD_LINSF, PMD_MTOL, PMD_SP, PMD_MSVUID, PMD_ADS, PMD_AFNMMN, PMD_AFNMTN, PMD_BGMN, PMD_CNC, PMD_GN, PMD_MeNC, PMD_MWSNAEC, PMD_NP, PMD_PC, PMD_SCN, PMD_SMN, PMD_SCFN, PMD_SEMN, PMD_SHMN, PMD_VNC, PMD_AES, PMD_AAL, PMD_RFI, PMD_UWOC, PMD_UALIOV, PMD_UAAL, PMD_USBFSA, PMD_AISD, PMD_MRIA, PMD_ACGE, PMD_ACNPE, PMD_ACT, PMD_ALEI, PMD_ARE, PMD_ATNIOSE, PMD_ATNPE, PMD_ATRET, PMD_DNEJLE, PMD_DNTEIF, PMD_EAFC, PMD_ADL, PMD_ASBF, PMD_CASR, PMD_CLA, PMD_ISB, PMD_SBIWC, PMD_StI, PMD_STS, PMD_UCC, PMD_UETCS, PMD_ClMMIC, PMD_LoC, PMD_SiDTE, PMD_UnI, PMD_ULV, PMD_UPF, PMD_UPM]	PMD Features (Rules)
system_WD (System/WD), file_system_sum_WD (File/System/WD), author_delta_sum_WD (Author/Delta/WD)	PMD Features (Warning density based)
previous_inducing	Not used in the paper, marks files that were previously bug-inducing
pascarella_commit	Pascarella label (from pascarella replication kit)
pascarella_file	Pascarella label (from pascarella replication kit)
label_adhoc	Ad-hoc SZZ
label_bug	ITS SZZ
adhoc_*, JIRAKEY_*	Bug Matrix, contains ad-hoc and ITS SZZ, is 1 if the file was inducing for the bug. The field name contains the revision_hash and datetime of the bug-fixing commit and the issue id in the case of Jira (ITS SZZ).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Static source code metrics and static analysis warnings for fine-grained just-in-time defect prediction

SmartSHARK Data

Fine-grained just-in-time defect prediction data mining

Download data

Install dependencies for jupyter lab

Run jupyter lab

Folder structure

Data structure

train_test_all.csv

interval_mean_*.csv

jit_sn_*.csv

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
figures		figures
notebooks		notebooks
readme.md		readme.md
requirements.txt		requirements.txt

atrautsch/icsme2020_replication

Folders and files

Latest commit

History

Repository files navigation

Static source code metrics and static analysis warnings for fine-grained just-in-time defect prediction

SmartSHARK Data

Fine-grained just-in-time defect prediction data mining

Download data

Install dependencies for jupyter lab

Run jupyter lab

Folder structure

Data structure

train_test_all.csv

interval_mean_*.csv

jit_sn_*.csv

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages