To make dedup[0] fast, I use a tree with device id, size, first byte, last byte,...

gblargg · on Feb 26, 2025

How much time is saved by not comparing full file contents? Given that this is a tool some people will only run occasionally, having it take 30 seconds instead of 15 is a small price to pay for ensuring it doesn't treat two differing files as equal.

xanderlewis · on Feb 26, 2025

Same size, same first and last bytes, and same SHA-256.

…and you’re not worried about shark attacks, are you?

gblargg · on March 5, 2025

I just don't understand why one would intentionally throw chance into a tool that one wants to be 100% robust. It's baffling.

wink · on Feb 26, 2025

FWIW, when I wrote a tool like this I used same size + some hash function, not MD5 but maybe SHA1, don't remember. First and last bytes is a good idea, didn't think of that.

pornel · on Feb 26, 2025

Reading just the first byte is probably wasting a read of the whole block.

Hashing the whole file after that is wasteful. You need to read (and hash) only as much as needed to demonstrate uniqueness of the file in the set.

The tree concept can be extended to every byte in the file:

https://github.com/kornelski/dupe-krill?tab=readme-ov-file#n...

jonhohle · on Feb 27, 2025

Yeah, there is definitely some merit to more efficient hashing. Trees with a lot of duplicates require a lot of hashing, but hashing the entire file would be required regardless of whether partial hashes or done or not.

I have one data set where `dedup` was 40% faster than `dupe-krill` and another where `dupe-drill` was 45% faster than `dedup`.

`dupe-krill` uses blake3, which last I checked, was not hardware accelerated on M series processors. What's interesting is that because of hardware acceleration, `dedup` is mostly CPU-idle, waiting on the hash calculation, while `dupe-krill` is maxing out 3 cores.

Thanks for the link!