Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Bug in similarity calculation in EntityMatching and incorrect documentation for dirtyER #21

Closed
mrckzgl opened this issue Apr 23, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@mrckzgl
Copy link

mrckzgl commented Apr 23, 2024

Incorrect Docs

At the top of https://pyjedai.readthedocs.io/en/latest/tutorials/DirtyER.html
an attribute list is used for the data attr = ['Entity Id','author', 'title'] (by the way IMHO it does not make sense to include the Entity Id as it always will be different for each entity, as such it will just reduce the similarity score of identical entities, so I would suggest to remove 'Entity Id' from the attr list).
Later entity matching is instantiated without specifying an attribute list:

em = EntityMatching(
    metric='jaccard',
    similarity_threshold=0.0
)

This, however will result in all attributes of the entities to be compared, as EntityMatching is not falling back to using the attributes specified in the Data, see:

self.attributes: list = attributes

The constructor uses the provided attributes or none. I would suggest to either update the tutorial:

em = EntityMatching(
    metric='jaccard',
    similarity_threshold=0.0,
    attributes=attr
)

or even better, fallback to the use the data.attributes in the em.predict method if self.attributes is None.

Issues regarding similarity calculation

As I understand the _similarity method, attributes can be either a dict, a list or None. For reflecting the dict use case self.attributes should be allowed to be a dict, by changing its type to any here:

self.attributes: list = attributes

More severe is that currently calculation of similarity is only correct if no attributes are specified.
For dict case if should be elif here:

if isinstance(self.attributes, list):

Currently, last else case will overwrite calculated dict similarity.

For list case denominator should be outside the loop, not inside. So this line:

similarity /= len(self.attributes)

should be deindented one step, otherwise sum will be divided by len(self.attributes)^2.

best

@Nikoletos-K Nikoletos-K self-assigned this Apr 24, 2024
@Nikoletos-K Nikoletos-K added the bug Something isn't working label Apr 24, 2024
Nikoletos-K added a commit that referenced this issue Apr 24, 2024
@Nikoletos-K
Copy link
Member

Hi,
well the attributes selection, is not a bug, as we thought that you may do blocking and matching with totally different set of params. For sure this increases the complexity and I think that we may re-consider this.

As far as Issues regarding similarity calculation you're totally right, it is fixed in the new release 0.1.7.

Cheers,
Konstantinos

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants