Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Resolving cluster_id issue in gazetteer_example.py #135

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jmiller558
Copy link

Currently gazetteer_example.py has an issue with the cluster_id assignment. (see #134 )

This pull resolves that issue by assigning a unique cluster_id to each entry in the messy dataset, and then assigning that same cluster_id to all the matches from the canonical dataset. It allows entries in the canonical dataset to have multiple cluster_ids, and then outputs a csv that can be sorted by cluster_id to see each entry in messy dataset and all its corresponding matches from the canonical dataset.

Fixing type in gazetteer_example.py

For loop should either leverage enumerate for the cluster_id, or else should set cluster_id to 0 and then increment on each loop.

Current code has both resulting in incorrect cluster_id's
Have updated the code to resolve the cluster_id issue.

Now each entry from messy dataset is assigned a unique cluster_id, and matching entries from the canonical dataset will also be assigned to that cluster_id.  

The relationship is one to many, and entries from the canonical dataset can belong to multiple cluster_ids.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant