Filters by file size #201

smacker · 2019-02-15T15:11:31Z

Remove empty files from hashing

On real test dataset it produces 1500 duplicates and is not valuable for
a user.

Exclude very small files from similarity hashing

On real dataset too small files produce too much false positives.
Also very small files as duplicates aren't very valuable, it doesn't
make sense to abstract common code for them.

se7entyse7en · 2019-02-15T16:11:08Z

Also very small files as duplicates aren't very valuable, it doesn't
make sense to abstract common code for them.

I was thinking about this. Actually even in case of small files there could be a high potential of abstracting common parts. I'm thinking about an artificial example such as:

utils_1.py:

def hello_world():
    return "Hello world!"

utils_2.py:

def hello_world(who):
    return f"Hello world ${who}!"

Or what about if there's a big file containing let's say func_1, ..., func_n and a small file containing exactly func_1?

I agree that for small files there could be many false positives. But I guess that there's some sort of similarity threshold being used somewhere, and maybe this threshold could be dynamic as a function of the size of the file being compared. Again I don't know what kind of algorithm it uses and I don't even know if the function used to compute the similarity score is commutative or whatever else, I'm just letting ideas flowing out 😂.

carlosms · 2019-02-15T17:20:20Z

I agree with @se7entyse7en, I wouldn't assume that similarity reports in small files is not valuable, that will depend on the contents.

If the current code produces too many false positives, then let's remove the small files as a short term fix.
But I would prefer to open an issue to remove this limitation (improving the similarity detection for smaller files), or maybe making it a configuration flag.

carlosms · 2019-02-15T17:21:53Z

It looks like this size limit makes the tests fail.

smacker · 2019-02-15T17:28:52Z

@se7entyse7en it's a good note. But currently it isn't feasible to cover all the cases and I'm not sure we will ever have time for that. This solution improves most common case to fix the current problem with false positives.

I totally agree that the way how we calculate similarities should depend on size of file and also size of dataset. For example right now on 600 small repos it produces 9mb of text with results. On such scale small improvements as you suggested above don't make much sense. But on a smaller dataset they most probably do.

Another note, looking into your artificial example I think function level similarity should help there.

In my opinion we should improve gemini to the state it provides resonable results for our most important use case (1 middle size organization) and only after that we should start working on edge cases.

What do you think?

smacker · 2019-02-15T17:29:52Z

@carlosms thanks for the heads up about CI! I missed it.

se7entyse7en · 2019-02-15T17:31:38Z

@smacker

In my opinion we should improve gemini to the state it provides resonable results for our most important use case (1 middle size organization) and only after that we should start working on edge cases.

Sure! As @carlosms said, I wouldn't block this PR from merging, but open an issue so that we're aware of this limitation and that its removal is a NTH.

smacker · 2019-02-19T10:04:41Z

Sorry, I was not clear enough. I think it doesn't make sense to open such an issue because in the current state it has status "won't fix".
Solving the problem of small/big files and make it adaptive would require lots of work from ML team. Either by improving the algorithm or at least search for a range of hyper-params depending on some features (like file size).

I'm not aware of any plans to do that in any reasonable feature.
The issue that might make sense should be opened in "feature-idea" repository but it should be much more general issue for improving gemini not just solving the problem with small files.

carlosms · 2019-02-19T10:14:59Z

👍 but if it's not going to change, I think the file size threshold should be documented.

smacker · 2019-02-19T10:25:10Z

I have created an issue about configuration: #203

On real test dataset it produces 1500 duplicates and is not valuable for a user. Signed-off-by: Maxim Sukharev <max@smacker.ru>

On real dataset too small files produce too much false positives. Also very small files as duplicates aren't very valuable, it doesn't make sense to abstract common code for them. Signed-off-by: Maxim Sukharev <max@smacker.ru>

Signed-off-by: Maxim Sukharev <max@smacker.ru>

without too small files gemini is able to find one more similar file I validated content of all results manually Signed-off-by: Maxim Sukharev <max@smacker.ru>

smacker · 2019-02-19T12:28:11Z

Tests fixed:

I was applying limit in function mode too which was unintentional
After excluding of small files gemini is able to find one more bigger similar files in our test dataset (results validated manually)

smacker requested review from carlosms and se7entyse7en February 15, 2019 15:11

se7entyse7en approved these changes Feb 15, 2019

View reviewed changes

carlosms approved these changes Feb 19, 2019

View reviewed changes

smacker added 4 commits February 19, 2019 13:07

Remove empty files from hashing

8c02daa

On real test dataset it produces 1500 duplicates and is not valuable for a user. Signed-off-by: Maxim Sukharev <max@smacker.ru>

Exclude very small files from similarity hashing

bdffa62

On real dataset too small files produce too much false positives. Also very small files as duplicates aren't very valuable, it doesn't make sense to abstract common code for them. Signed-off-by: Maxim Sukharev <max@smacker.ru>

apply file limit only in file-similarity mode

f1131f5

Signed-off-by: Maxim Sukharev <max@smacker.ru>

update test due to changed output of similarity

62566ea

without too small files gemini is able to find one more similar file I validated content of all results manually Signed-off-by: Maxim Sukharev <max@smacker.ru>

smacker merged commit 2871ddc into src-d:master Feb 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filters by file size #201

Filters by file size #201

smacker commented Feb 15, 2019

se7entyse7en commented Feb 15, 2019

carlosms commented Feb 15, 2019

carlosms commented Feb 15, 2019

smacker commented Feb 15, 2019

smacker commented Feb 15, 2019

se7entyse7en commented Feb 15, 2019

smacker commented Feb 19, 2019

carlosms commented Feb 19, 2019

smacker commented Feb 19, 2019

smacker commented Feb 19, 2019

Filters by file size #201

Filters by file size #201

Conversation

smacker commented Feb 15, 2019

se7entyse7en commented Feb 15, 2019

carlosms commented Feb 15, 2019

carlosms commented Feb 15, 2019

smacker commented Feb 15, 2019

smacker commented Feb 15, 2019

se7entyse7en commented Feb 15, 2019

smacker commented Feb 19, 2019

carlosms commented Feb 19, 2019

smacker commented Feb 19, 2019

smacker commented Feb 19, 2019