Skip to content

Latest commit

 

History

History
29 lines (21 loc) · 1.3 KB

README.md

File metadata and controls

29 lines (21 loc) · 1.3 KB

rustdedup

Follow on Twitter GitHub last commit GitHub stars

Deduplicate files at fast speeds! Written in RUST.

Memory optimizations made:

  • Well rust.
  • Input lines directly streamed to the processing threads without collecting them all first.
  • Partitions the hash space to reduce lock contention.

Some stats

In the below test we utilise a small 75mb file (else we wait too long for hyperfine) with 1 595 966 lines of data. image

When we up the anty a little bit going to large files 2.3gb we see some improvements. image

When we compare with the likes of duplicut (https://github.com/nil0x42/duplicut) some significant improvements can be seen, however, I'm not sure if this boils down to the rust usage over c. image

Usage

cat file.txt | rustdedup

rustdedup -i /diska9.txtextra.csvmodded.csv -o output2.txt