Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Duplicate Rows May Remain After dropDuplicateRows Due to Early Return in isDuplicate #1248

Open
vlevy-pci opened this issue Feb 13, 2024 · 3 comments

Comments

@vlevy-pci
Copy link

Description:
When using dropDuplicateRows to eliminate duplicate entries from a table, I observed that duplicates were still present in the output. Upon investigation, the root cause was identified in the isDuplicate function. This function is designed to iterate over rows that share a hash with the row being evaluated to determine if it is a duplicate. However, it incorrectly returns false (indicating the row is unique) during the first iteration if the first checked row does not match, without examining the remaining rows.

Expected Behavior:
The isDuplicate function should only return false after all rows with the matching hash have been checked and none are found to be identical to the row being evaluated. This ensures that a row is only considered unique if it has been verified against all potential duplicates.

Actual Behavior:
The function returns false prematurely after comparing with the first row that shares a hash, potentially leaving unexamined duplicates in the table.

Resolution:
The issue was resolved by modifying isDuplicate to complete its iteration over all rows with a matching hash before deciding that the row is not a duplicate. This change ensured that dropDuplicateRows correctly removed all duplicates from the table.

@frankwondon
Copy link

frankwondon commented Feb 13, 2024 via email

@frankzengjj
Copy link

has this issue been taken? if not, I would like to work on it.

@vlevy-pci
Copy link
Author

vlevy-pci commented Jun 6, 2024 via email

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants