-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Precision over 100% reported if ground truth contains pairs of identical ids #20
Comments
Some more data. From pyJedAI/src/pyjedai/clustering.py Line 366 in 2e41af4
I printed eval_obj.__dict__ :
So |
Ah I got it. We have matching pairs of the same id in our ground truth. So sth. like "id1|id1" as row in the csv file. Thinking about it, this is not incorrect: An entity obviously is identical to itself, but I see also that the gt is not as clean as it should be. I will cleanup the gt, but an additional approach might be to check for identity of the ids here: pyJedAI/src/pyjedai/clustering.py Line 362 in 2e41af4
and in that case not increase true_positives to make evaluation more robust. But of course, one would need to investigate also for clean clean ER case and the other steps' evaluations, that calculations remain correct / consistent.
|
We hadn't considered this scenario before. I fully agree that it should be addressed, given the prevalence of errors in data. We will address this by adding a validation check. Thanks for the detailed trace and feedback! |
We added a drop_duplicates when we parse the GT file. Here: pyJedAI/src/pyjedai/datamodel.py Line 159 in c19399a
I think this will work better. Cheers, |
We have a dirty ER workflow, where the EntityMatching graph is generated with
similarity_threshold=0.0
(to get all compared edges) and then we optimize the clustering for the optimalsimilarity_threshold
using optuna. We encountered this:On the top end, where the threshold gets towards 1.0 and as such the clustering does not produce alot of matches, the reported precision goes beyond 100%. I would have to dig deeper what exactly causes this, but maybe you have an idea, possibly it is only a bug regarding edge cases where the number of matches is low.
best
The text was updated successfully, but these errors were encountered: