You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the top of https://pyjedai.readthedocs.io/en/latest/tutorials/DirtyER.html
an attribute list is used for the data attr = ['Entity Id','author', 'title'] (by the way IMHO it does not make sense to include the Entity Id as it always will be different for each entity, as such it will just reduce the similarity score of identical entities, so I would suggest to remove 'Entity Id' from the attr list).
Later entity matching is instantiated without specifying an attribute list:
em = EntityMatching(
metric='jaccard',
similarity_threshold=0.0
)
This, however will result in all attributes of the entities to be compared, as EntityMatching is not falling back to using the attributes specified in the Data, see:
The constructor uses the provided attributes or none. I would suggest to either update the tutorial:
em = EntityMatching(
metric='jaccard',
similarity_threshold=0.0,
attributes=attr
)
or even better, fallback to the use the data.attributes in the em.predict method if self.attributes is None.
Issues regarding similarity calculation
As I understand the _similarity method, attributes can be either a dict, a list or None. For reflecting the dict use case self.attributes should be allowed to be a dict, by changing its type to any here:
Hi,
well the attributes selection, is not a bug, as we thought that you may do blocking and matching with totally different set of params. For sure this increases the complexity and I think that we may re-consider this.
As far as Issues regarding similarity calculation you're totally right, it is fixed in the new release 0.1.7.
Incorrect Docs
At the top of https://pyjedai.readthedocs.io/en/latest/tutorials/DirtyER.html
an attribute list is used for the data
attr = ['Entity Id','author', 'title']
(by the way IMHO it does not make sense to include the Entity Id as it always will be different for each entity, as such it will just reduce the similarity score of identical entities, so I would suggest to remove 'Entity Id' from the attr list).Later entity matching is instantiated without specifying an attribute list:
This, however will result in all attributes of the entities to be compared, as EntityMatching is not falling back to using the attributes specified in the Data, see:
pyJedAI/src/pyjedai/matching.py
Line 355 in 4b0a621
The constructor uses the provided attributes or none. I would suggest to either update the tutorial:
or even better, fallback to the use the data.attributes in the em.predict method if self.attributes is None.
Issues regarding similarity calculation
As I understand the _similarity method, attributes can be either a dict, a list or None. For reflecting the dict use case self.attributes should be allowed to be a dict, by changing its type to any here:
pyJedAI/src/pyjedai/matching.py
Line 355 in 4b0a621
More severe is that currently calculation of similarity is only correct if no attributes are specified.
For dict case
if
should beelif
here:pyJedAI/src/pyjedai/matching.py
Line 507 in 4b0a621
Currently, last else case will overwrite calculated dict similarity.
For list case denominator should be outside the loop, not inside. So this line:
pyJedAI/src/pyjedai/matching.py
Line 515 in 4b0a621
should be deindented one step, otherwise sum will be divided by len(self.attributes)^2.
best
The text was updated successfully, but these errors were encountered: