Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Precision over 100% reported if ground truth contains pairs of identical ids #20

Closed
mrckzgl opened this issue Apr 16, 2024 · 4 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@mrckzgl
Copy link

mrckzgl commented Apr 16, 2024

We have a dirty ER workflow, where the EntityMatching graph is generated with similarity_threshold=0.0 (to get all compared edges) and then we optimize the clustering for the optimal similarity_threshold using optuna. We encountered this:
Figure_1

On the top end, where the threshold gets towards 1.0 and as such the clustering does not produce alot of matches, the reported precision goes beyond 100%. I would have to dig deeper what exactly causes this, but maybe you have an idea, possibly it is only a bug regarding edge cases where the number of matches is low.

best

@mrckzgl
Copy link
Author

mrckzgl commented Apr 16, 2024

Some more data. From

eval_obj.calculate_scores(true_positives=true_positives)

I printed eval_obj.__dict__:

{'total_matching_pairs': 76.0, 'data': <pyjedai.datamodel.Data object at 0x7e11d1839db0>, 'true_positives': 102, 'true_negatives': 185456764.0, 'false_positives': -26.0, 'false_negatives': 553360, 'all_gt_ids': {0, 1, 2, [...], 19316}, 'num_of_true_duplicates': 553462, 'precision': 1.3421052631578947, 'recall': 0.00018429449537637633, 'f1': 0.00036853838399531744}

So total_matching_pairs is smaller than true_positives.

@mrckzgl
Copy link
Author

mrckzgl commented Apr 16, 2024

Ah I got it. We have matching pairs of the same id in our ground truth. So sth. like "id1|id1" as row in the csv file. Thinking about it, this is not incorrect: An entity obviously is identical to itself, but I see also that the gt is not as clean as it should be. I will cleanup the gt, but an additional approach might be to check for identity of the ids here:

if id1 in entity_index and \

and in that case not increase true_positives to make evaluation more robust. But of course, one would need to investigate also for clean clean ER case and the other steps' evaluations, that calculations remain correct / consistent.

@mrckzgl mrckzgl changed the title Precision over 100% reported in some edge cases Precision over 100% reported if ground truth contains pairs of identical ids Apr 16, 2024
@Nikoletos-K
Copy link
Member

We hadn't considered this scenario before. I fully agree that it should be addressed, given the prevalence of errors in data. We will address this by adding a validation check.

Thanks for the detailed trace and feedback!

@Nikoletos-K Nikoletos-K self-assigned this Apr 16, 2024
@Nikoletos-K Nikoletos-K added the bug Something isn't working label Apr 16, 2024
@Nikoletos-K
Copy link
Member

We added a drop_duplicates when we parse the GT file. Here:

self.ground_truth.drop_duplicates(inplace=True)

I think this will work better.

Cheers,
Konstantinos

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants