Skip to content

Future Work

Thomas Schweizer edited this page Oct 18, 2023 · 4 revisions

There are many opportunities for the future of this project. We list a few ideas below.

Research Questions

  • (medium) Synthetic commits:
    • Are untangling tools' performance on synthetic and real commits similar?
    • What are synthetic dataset made of? Calculate tangledness at file, hunk, and line granularity


  • (low) Measure performance on different granularities
    • Iteratively changes parameters to have a smaller number or modify SmartCommit and Flexeme implementation to create 2-3 groups.
    • Also applies to file-based implementation: if too many files, increase granularity to folder, package.
    • Take Flexeme and coarsen it to hunk granularity with the same process as SmartCommit? How much does it help?
    • Refactor the implementation to separate coarsening from clustering so we can evaluate on hunks or lines using the same process for each tool.
  • (low) Help identify the importance of each part by replacing graph clustering with another graph clustering for SmartCommit and Flexeme
  • (medium) Try using the Flexeme original data to calculate the line-based rand index. This would be great data to have, and we already have the scripts to calculate it. The only thing we need to do is
    1. Unzip the PDGs from Flexeme's data
    2. Update the script that translates a PDG into CSV to also export a column with the true label (it’s another attribute of the node)
    3. Run the script that translates a PDG into CSV.
    4. Calculate the rand score between the true label and the grouping.


  • (high) There are a few offshoots of Flexeme that should be easier to integrate to the pipeline
  • (high) There are also a few more recent untangling tools.
    • CoRA (from WangLZX2019, ASE 2019)
    • UTango (2022). No tool available publicly. We reached out to authors.
    • ComUnt (2022). No tool available publicly. We reached out to authors.
  • (low) We make our own tool! Use past changes to predict whether changes are similar or not.
  • (low) Add ChatGPT as a tool in the evaluation


  • (high) Add dataset "A fine-grained data set and analysis of tangling in bug fixing commits" Herbold et., al. 2022 (DONE)
  • (medium) Add the CoRA dataset "CoRA: Decomposing and Describing Tangled Code Changes for Reviewer" Wang et., al. 2019. This is only 50 commits.
Clone this wiki locally