Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Filters by file size #201

Merged
merged 4 commits into from
Feb 19, 2019
Merged

Filters by file size #201

merged 4 commits into from
Feb 19, 2019

Conversation

smacker
Copy link
Contributor

@smacker smacker commented Feb 15, 2019

  1. Remove empty files from hashing

On real test dataset it produces 1500 duplicates and is not valuable for
a user.

  1. Exclude very small files from similarity hashing

On real dataset too small files produce too much false positives.
Also very small files as duplicates aren't very valuable, it doesn't
make sense to abstract common code for them.

@se7entyse7en
Copy link

Also very small files as duplicates aren't very valuable, it doesn't
make sense to abstract common code for them.

I was thinking about this. Actually even in case of small files there could be a high potential of abstracting common parts. I'm thinking about an artificial example such as:

utils_1.py:

def hello_world():
    return "Hello world!"

utils_2.py:

def hello_world(who):
    return f"Hello world ${who}!"

Or what about if there's a big file containing let's say func_1, ..., func_n and a small file containing exactly func_1?

I agree that for small files there could be many false positives. But I guess that there's some sort of similarity threshold being used somewhere, and maybe this threshold could be dynamic as a function of the size of the file being compared. Again I don't know what kind of algorithm it uses and I don't even know if the function used to compute the similarity score is commutative or whatever else, I'm just letting ideas flowing out 😂.

@carlosms
Copy link
Contributor

I agree with @se7entyse7en, I wouldn't assume that similarity reports in small files is not valuable, that will depend on the contents.

If the current code produces too many false positives, then let's remove the small files as a short term fix.
But I would prefer to open an issue to remove this limitation (improving the similarity detection for smaller files), or maybe making it a configuration flag.

@carlosms
Copy link
Contributor

It looks like this size limit makes the tests fail.

@smacker
Copy link
Contributor Author

smacker commented Feb 15, 2019

@se7entyse7en it's a good note. But currently it isn't feasible to cover all the cases and I'm not sure we will ever have time for that. This solution improves most common case to fix the current problem with false positives.

I totally agree that the way how we calculate similarities should depend on size of file and also size of dataset. For example right now on 600 small repos it produces 9mb of text with results. On such scale small improvements as you suggested above don't make much sense. But on a smaller dataset they most probably do.

Another note, looking into your artificial example I think function level similarity should help there.

In my opinion we should improve gemini to the state it provides resonable results for our most important use case (1 middle size organization) and only after that we should start working on edge cases.

What do you think?

@smacker
Copy link
Contributor Author

smacker commented Feb 15, 2019

@carlosms thanks for the heads up about CI! I missed it.

@se7entyse7en
Copy link

@smacker

In my opinion we should improve gemini to the state it provides resonable results for our most important use case (1 middle size organization) and only after that we should start working on edge cases.

Sure! As @carlosms said, I wouldn't block this PR from merging, but open an issue so that we're aware of this limitation and that its removal is a NTH.

@smacker
Copy link
Contributor Author

smacker commented Feb 19, 2019

Sorry, I was not clear enough. I think it doesn't make sense to open such an issue because in the current state it has status "won't fix".
Solving the problem of small/big files and make it adaptive would require lots of work from ML team. Either by improving the algorithm or at least search for a range of hyper-params depending on some features (like file size).

I'm not aware of any plans to do that in any reasonable feature.
The issue that might make sense should be opened in "feature-idea" repository but it should be much more general issue for improving gemini not just solving the problem with small files.

@carlosms
Copy link
Contributor

👍 but if it's not going to change, I think the file size threshold should be documented.

@smacker
Copy link
Contributor Author

smacker commented Feb 19, 2019

I have created an issue about configuration: #203

On real test dataset it produces 1500 duplicates and is not valuable for
a user.

Signed-off-by: Maxim Sukharev <max@smacker.ru>
On real dataset too small files produce too much false positives.
Also very small files as duplicates aren't very valuable, it doesn't
make sense to abstract common code for them.

Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
without too small files gemini is able to find one more similar file
I validated content of all results manually

Signed-off-by: Maxim Sukharev <max@smacker.ru>
@smacker
Copy link
Contributor Author

smacker commented Feb 19, 2019

Tests fixed:

  • I was applying limit in function mode too which was unintentional
  • After excluding of small files gemini is able to find one more bigger similar files in our test dataset (results validated manually)

@smacker smacker merged commit 2871ddc into src-d:master Feb 19, 2019
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants