Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

pgsql_big_dedupe_example fails #129

Open
wilko77 opened this issue Jul 11, 2022 · 3 comments
Open

pgsql_big_dedupe_example fails #129

wilko77 opened this issue Jul 11, 2022 · 3 comments

Comments

@wilko77
Copy link

wilko77 commented Jul 11, 2022

I ran the postgres example as-is with a postgres database version 14.2 and dedupe version 2.0.17.
After training and clustering, it will eventually fail during 'writing results' with the following error:

writing results
WARNING:dedupe.clustering:A component contained 656982 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.8445158759995937
Traceback (most recent call last):
  File "/Users/******/Code/dedupe-examples/pgsql_big_dedupe_example/pgsql_big_dedupe_example.py", line 304, in <module>
    write_cur.copy_expert('COPY entity_map FROM STDIN WITH CSV',
psycopg2.errors.QueryCanceled: COPY from stdin failed: error in .read() call: ValueError Iteration of zero-sized operands is not enabled
CONTEXT:  COPY entity_map, line 1
@evanmuller
Copy link

I'm experiencing a similar issue with the mysql_example:

creating entity_map database
A component contained 56250 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.9027395568206275
Traceback (most recent call last):
  File "mysql_example.py", line 277, in <module>
    write_cur.executemany('INSERT INTO entity_map VALUES (%s, %s, %s)',
  File "/home/ubuntu/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 230, in executemany
    return self._do_execute_many(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 258, in _do_execute_many
    for arg in args:
  File "mysql_example.py", line 50, in cluster_ids
    for cluster, scores in clustered_dupes:
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/api.py", line 341, in cluster
    yield from clustering.cluster(scores, threshold)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 238, in cluster
    for sub_graph in dupe_sub_graphs:
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 51, in connected_components
    yield from _connected_components(edgelist, max_components)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 99, in _connected_components
    for sub_graph in _connected_components(filtered_sub_graph, max_components):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 59, in _connected_components
    component_stops = union_find(edgelist)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 114, in union_find
    it = numpy.nditer(edgelist, ["external_loop"])
ValueError: Iteration of zero-sized operands is not enabled

@evanmuller
Copy link

I got the mysql example to work by adding the "zerosize_ok" option to numpy.nditer in clustering.py. I imagine that this would also resolve the OP postres example. I'm not a python developer so I don't want to issue a PR for this until I have a better understanding of what's going on. In the union_find function in clustering.py, I changed...

it = numpy.nditer(edgelist, ["external_loop"])

to...

it = numpy.nditer(edgelist, ["external_loop", "zerosize_ok"])

@twright8
Copy link

twright8 commented Feb 7, 2023

This still doesnt work for me, even with the fix above. Any new solutions?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants