Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Issue with cluster_id in gazetteer_example.py #134

Open
jmiller558 opened this issue May 26, 2023 · 0 comments
Open

Issue with cluster_id in gazetteer_example.py #134

jmiller558 opened this issue May 26, 2023 · 0 comments

Comments

@jmiller558
Copy link

jmiller558 commented May 26, 2023

There seems to an issue with the clusters in gazetteer_example.py

When adding the cluster_ids it has a for loop that uses enumerate to create the cluster_ids, but then it also uses a += 1 counter for the cluster_ids. This is resulting in strange behavior where multiple groups of matches end up sharing the same cluster_id.

In addition, the gazetteer.search function can return the same entry from the canonical dataset for multiple entries in the messy dataset, however the clustering code enforces 1 cluster_id for each entry in the canonical dataset. This results in the cluster_id getting overwritten for the canonical dataset.

For example if Messy_Entry_1 matches with Canonical_Entry_1, they will both be assigned to cluster_id = 1.

Then if Messy_Entry_2 also matches with Canonical_Entry_1, they will both be assigned to cluster_id = 2.

The result will be
{Messy_Entry_1: {'Cluster ID': 1} }
{Messy_Entry_2: {'Cluster ID': 2} }
{Canonical_Entry_1: {'Cluster ID': 2}}

pull request #135 has been created resolving this issue

@jmiller558 jmiller558 changed the title Typo in gazetteer_example.py Issue with cluster_id in gazetteer_example.py May 26, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant