-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Filters by file size #201
Filters by file size #201
Conversation
I was thinking about this. Actually even in case of small files there could be a high potential of abstracting common parts. I'm thinking about an artificial example such as:
Or what about if there's a big file containing let's say I agree that for small files there could be many false positives. But I guess that there's some sort of similarity threshold being used somewhere, and maybe this threshold could be dynamic as a function of the size of the file being compared. Again I don't know what kind of algorithm it uses and I don't even know if the function used to compute the similarity score is commutative or whatever else, I'm just letting ideas flowing out 😂. |
I agree with @se7entyse7en, I wouldn't assume that similarity reports in small files is not valuable, that will depend on the contents. If the current code produces too many false positives, then let's remove the small files as a short term fix. |
It looks like this size limit makes the tests fail. |
@se7entyse7en it's a good note. But currently it isn't feasible to cover all the cases and I'm not sure we will ever have time for that. This solution improves most common case to fix the current problem with false positives. I totally agree that the way how we calculate similarities should depend on size of file and also size of dataset. For example right now on 600 small repos it produces 9mb of text with results. On such scale small improvements as you suggested above don't make much sense. But on a smaller dataset they most probably do. Another note, looking into your artificial example I think function level similarity should help there. In my opinion we should improve gemini to the state it provides resonable results for our most important use case (1 middle size organization) and only after that we should start working on edge cases. What do you think? |
@carlosms thanks for the heads up about CI! I missed it. |
Sure! As @carlosms said, I wouldn't block this PR from merging, but open an issue so that we're aware of this limitation and that its removal is a NTH. |
Sorry, I was not clear enough. I think it doesn't make sense to open such an issue because in the current state it has status "won't fix". I'm not aware of any plans to do that in any reasonable feature. |
👍 but if it's not going to change, I think the file size threshold should be documented. |
I have created an issue about configuration: #203 |
On real test dataset it produces 1500 duplicates and is not valuable for a user. Signed-off-by: Maxim Sukharev <max@smacker.ru>
On real dataset too small files produce too much false positives. Also very small files as duplicates aren't very valuable, it doesn't make sense to abstract common code for them. Signed-off-by: Maxim Sukharev <max@smacker.ru>
Signed-off-by: Maxim Sukharev <max@smacker.ru>
without too small files gemini is able to find one more similar file I validated content of all results manually Signed-off-by: Maxim Sukharev <max@smacker.ru>
Tests fixed:
|
On real test dataset it produces 1500 duplicates and is not valuable for
a user.
On real dataset too small files produce too much false positives.
Also very small files as duplicates aren't very valuable, it doesn't
make sense to abstract common code for them.