Bug in similarity calculation in EntityMatching and incorrect documentation for dirtyER #21

mrckzgl · 2024-04-23T13:35:42Z

Incorrect Docs

At the top of https://pyjedai.readthedocs.io/en/latest/tutorials/DirtyER.html
an attribute list is used for the data attr = ['Entity Id','author', 'title'] (by the way IMHO it does not make sense to include the Entity Id as it always will be different for each entity, as such it will just reduce the similarity score of identical entities, so I would suggest to remove 'Entity Id' from the attr list).
Later entity matching is instantiated without specifying an attribute list:

em = EntityMatching(
    metric='jaccard',
    similarity_threshold=0.0
)

This, however will result in all attributes of the entities to be compared, as EntityMatching is not falling back to using the attributes specified in the Data, see:

pyJedAI/src/pyjedai/matching.py

Line 355 in 4b0a621

self.attributes: list = attributes

The constructor uses the provided attributes or none. I would suggest to either update the tutorial:

em = EntityMatching(
    metric='jaccard',
    similarity_threshold=0.0,
    attributes=attr
)

or even better, fallback to the use the data.attributes in the em.predict method if self.attributes is None.

Issues regarding similarity calculation

As I understand the _similarity method, attributes can be either a dict, a list or None. For reflecting the dict use case self.attributes should be allowed to be a dict, by changing its type to any here:

pyJedAI/src/pyjedai/matching.py

Line 355 in 4b0a621

self.attributes: list = attributes

More severe is that currently calculation of similarity is only correct if no attributes are specified.
For dict case if should be elif here:

pyJedAI/src/pyjedai/matching.py

Line 507 in 4b0a621

if isinstance(self.attributes, list):

Currently, last else case will overwrite calculated dict similarity.

For list case denominator should be outside the loop, not inside. So this line:

pyJedAI/src/pyjedai/matching.py

Line 515 in 4b0a621

similarity /= len(self.attributes)

should be deindented one step, otherwise sum will be divided by len(self.attributes)^2.

best

The text was updated successfully, but these errors were encountered:

Nikoletos-K · 2024-04-24T08:25:09Z

Hi,
well the attributes selection, is not a bug, as we thought that you may do blocking and matching with totally different set of params. For sure this increases the complexity and I think that we may re-consider this.

As far as Issues regarding similarity calculation you're totally right, it is fixed in the new release 0.1.7.

Cheers,
Konstantinos

Nikoletos-K self-assigned this Apr 24, 2024

Nikoletos-K added the bug Something isn't working label Apr 24, 2024

Nikoletos-K added a commit that referenced this issue Apr 24, 2024

Fixed issue #21

0dd5fcf

Nikoletos-K closed this as completed Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in similarity calculation in EntityMatching and incorrect documentation for dirtyER #21

Bug in similarity calculation in EntityMatching and incorrect documentation for dirtyER #21

mrckzgl commented Apr 23, 2024 •

edited

Loading

Nikoletos-K commented Apr 24, 2024

Bug in similarity calculation in EntityMatching and incorrect documentation for dirtyER #21

Bug in similarity calculation in EntityMatching and incorrect documentation for dirtyER #21

Comments

mrckzgl commented Apr 23, 2024 • edited Loading

Incorrect Docs

Issues regarding similarity calculation

Nikoletos-K commented Apr 24, 2024

mrckzgl commented Apr 23, 2024 •

edited

Loading