Something else that might work is content-dependent deduplication, with variable chunk boundaries determined by a sliding Rabin-Karp-style (or XOR) fingerprint on the content and a second cryptographic hash calculated for chunks where the cheap fingerprint has a match. It's naive and can find matches across headers, body and attachment parts.