Duplicate Rows May Remain After dropDuplicateRows Due to Early Return in isDuplicate #1248

vlevy-pci · 2024-02-13T23:25:22Z

Description:
When using dropDuplicateRows to eliminate duplicate entries from a table, I observed that duplicates were still present in the output. Upon investigation, the root cause was identified in the isDuplicate function. This function is designed to iterate over rows that share a hash with the row being evaluated to determine if it is a duplicate. However, it incorrectly returns false (indicating the row is unique) during the first iteration if the first checked row does not match, without examining the remaining rows.

Expected Behavior:
The isDuplicate function should only return false after all rows with the matching hash have been checked and none are found to be identical to the row being evaluated. This ensures that a row is only considered unique if it has been verified against all potential duplicates.

Actual Behavior:
The function returns false prematurely after comparing with the first row that shares a hash, potentially leaving unexamined duplicates in the table.

Resolution:
The issue was resolved by modifying isDuplicate to complete its iteration over all rows with a matching hash before deciding that the row is not a duplicate. This change ensured that dropDuplicateRows correctly removed all duplicates from the table.

The text was updated successfully, but these errors were encountered:

frankwondon · 2024-02-13T23:25:54Z

这是来自QQ邮箱的假期自动回复邮件。你好，我最近正在休假中，无法亲自回复你的邮件。我将在假期结束后，尽快给你回复。

frankzengjj · 2024-06-06T03:45:29Z

has this issue been taken? if not, I would like to work on it.

vlevy-pci · 2024-06-06T21:08:26Z

Hi Frank, I wrote a fix for my project but I have not submitted a PR for it. Please feel free to take it over. Hopefully it will be straightforward to work it from my description, but if you want my version as a reference, you are welcome to it. Best regards, Vic From: Frank Tianyu Zeng ***@***.***> Sent: Wednesday, June 5, 2024 11:46 PM To: jtablesaw/tablesaw ***@***.***> Cc: Vic Levy ***@***.***>; Author ***@***.***> Subject: Re: [jtablesaw/tablesaw] Duplicate Rows May Remain After dropDuplicateRows Due to Early Return in isDuplicate (Issue #1248) has this issue been taken? if not, I would like to work on it. — Reply to this email directly, view it on GitHub <#1248 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AK2UY3H2CUDAK2HBA6M6WB3ZF7LO7AVCNFSM6AAAAABDHK5MYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGM2TMNBWGA> . You are receiving this because you authored the thread. <https://github.com/notifications/beacon/AK2UY3DTH5P4LEV6VCGPX5DZF7LO7A5CNFSM6AAAAABDHK5MYWWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUAHMMCY.gif> Message ID: ***@***.*** ***@***.***> >

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Rows May Remain After dropDuplicateRows Due to Early Return in isDuplicate #1248

Duplicate Rows May Remain After dropDuplicateRows Due to Early Return in isDuplicate #1248

vlevy-pci commented Feb 13, 2024

frankwondon commented Feb 13, 2024 via email

frankzengjj commented Jun 6, 2024

vlevy-pci commented Jun 6, 2024 via email

Duplicate Rows May Remain After dropDuplicateRows Due to Early Return in isDuplicate #1248

Duplicate Rows May Remain After dropDuplicateRows Due to Early Return in isDuplicate #1248

Comments

vlevy-pci commented Feb 13, 2024

frankwondon commented Feb 13, 2024 via email

frankzengjj commented Jun 6, 2024

vlevy-pci commented Jun 6, 2024 via email