Almost a decade ago I wrote a tool to search forum dumps from various EVE Online alliances. The content was acquired by spies and often watermarked.
The first barrier was that homoglyphs would inhibit text search, so I had to build an automated homogylph detection and substitution layer.
Once homoglyphs were stripped, the challenge was then to fit the entire search corpus into memory, so I compressed each page with LZMA, loaded it into memory, and decompressed on the fly when searching—probably not optimal, but still way faster than loading from disk.
I always wanted to try reverse engineering some of the watermarking systems so we could modify the watermarks on certain material, subtly leak it, and effectively frame adversaries while protecting our own spies in the process. Fortunately or unfortunately I never got around to that.
The first barrier was that homoglyphs would inhibit text search, so I had to build an automated homogylph detection and substitution layer.
Once homoglyphs were stripped, the challenge was then to fit the entire search corpus into memory, so I compressed each page with LZMA, loaded it into memory, and decompressed on the fly when searching—probably not optimal, but still way faster than loading from disk.
I always wanted to try reverse engineering some of the watermarking systems so we could modify the watermarks on certain material, subtly leak it, and effectively frame adversaries while protecting our own spies in the process. Fortunately or unfortunately I never got around to that.