Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Efficiently searching compressed text files (github.com/mattgodbolt)
125 points by mattgodbolt on May 2, 2015 | hide | past | favorite | 35 comments


The article's method is nice, it allows you to decompress portions of the file. I thought this would be about the FM index. The FM index allows you to search what is essentially a bzip'd archive for a substring far more efficiently than you would be able to search an uncompressed string for a substring. Pretty mind blowing if you think about it.

It works because the Burrows-Wheeler transform which is used in compression is closely related to the suffix array, which is a data structure used as a substring index. By adding a small amount of additional metadata to the compressed Burrows-Wheeler transformed string you can search for any substring. Its practical performance is slower than a suffix array, but the memory usage can be 100x lower.


I have been wanting this for a long time, although my primary use case would be indexing a particular field.

It is nice that you were able to make it work with ordinary gzip files. In bioinformatics, an approach to this problem has been to create a gzip-backwards-compatible (i.e., can be read by zcat but not written to by gzip) format called BGZF [1] which acts like ordinary gzip except each compressed block is the same size, which eases indexing and partial decompression.

I would be interested to know how you solved the problem of finding the right block and offset when each compressed block can be an arbitrary size as it is in ordinary gzip format.

[1] http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-b...


Fantastic! I had hoped others would have the same problem as me. I didn't have the luxury of being able to change my many input gzip files (unlike BGZF).

The trick is to store the gzip internal state as a checkpoint every few hundred MB. Then to decode data at a particular offset, you find the further internal state that's before your offset; reinitialize the compression internal state with that, and then scan on until you get to the offset you need.

The zindex file that's generated contains both these checkpoints and the index that lets one scan there.

There's an open issue I'd like to resolve to add support for indexing a particular field: would support for arguments like those the "cut" UNIX utility allow for your use case?


Smart. I will see for myself in a minute how much space such an approach takes; I would imagine the compressor size gets nontrivially large on big files.

Well, my exact use case, which I imagine I'm not alone in, is indexing a tab-delimited file with a single-line header containing the column names. So indexing by column index would be adequate, but it would present two very minor problems: 1) I'd have to figure out a way to exclude the header row from output (or alternately, to always output it), and 2) it would be slightly more convenient to refer to the columns to be indexed by name rather than position, as some of these have thousands of columns.

Can you have multiple indexes on the same GZIP file? And if so, does it create multiple .zindex files or re-use the same one?

I don't know if SQLite supports custom data retrieval backends like Postgres, but what would be REALLY interesting is to write one such that the indexing information is in SQLite, and you can use SQL to query row data directly from the GZIP index (I know this is beyond the scope of your project, I'm just thinking out loud now).


Short reply as I'm on a phone, but thanks! I can add an option to pick the nth field, and to skip the first X lines of the input; I think this'd cover your use case!

The code and file format support multiple indices in the same zindex file but I haven't got a decent way to pass command line arguments to define them yet!

Watch this space: also feel free to ping me on email for more comments!


As of just now, zindex and zq now support field-based indexing. Hope this helps your use case!


Yep, this is a neat idea. I first saw it as part of Patrick Collison's wikipedia-iphone project (2007-ish), which could pull Wikipedia articles out of a compressed bz2 archive, by indexing from article name into bzip2 block and then only decompressing the needed blocks for the article you asked for.

We adapted his code at One Laptop Per Child to preload Wikipedia snapshots on every laptop:

http://blog.printf.net/articles/2008/06/02/wikipedia-on-xo/

http://blog.printf.net/articles/2010/05/09/peru-olpc-and-wik...


Something people don't know about gzip is that you can write a file as multiple compressed segments and gzip will decompress one segment and then decompress the next segment. Thus you can get the benefit of being restartable the way bzip2 is but it is optional and flexible so you can break only on record boundaries.


Indeed! Restarting at record boundaries is one option. Then an external index can point to all the ragged restart offsets, or some applications may be able to just scan for the distinct byte-patterns of likely restart offsets.

Restarting at every N-bytes boundary – for example, every 128K – is another option, though that requires a little more fancy thought to padding.

Another way to skip the need for an external index using any of these techniques, but still know the raw-data line/byte-offset when jumping to a deep position, is to encode those tallies (either chunk-by-chunk or from-start) as GZIP header "extra field" extensions. (Here's an example extension that just records per-chunk compressed- and uncompressed- byte-lengths: http://archive-access.sourceforge.net/warc/warc_file_format-...)


Also relevant to how gzip's --rsyncable flag works (although in debian it was broken for most of 2013 [#708423], the `pigz` version works well)


Is it possible to write a gzip file like that from the command line?


Take a look at "Succinct: Enabling Queries on Compressed Data", NSDI 2015 [1].

[1] https://www.usenix.org/conference/nsdi15/technical-sessions/...



Why not store the log files in an FM-index? (https://en.wikipedia.org/wiki/FM-index)


Interesting point, thanks for the link!

For my use case I wanted to put stuff together without writing too much novel code; zindex uses a SQLite file to store indices and checkpoints. I'm hoping to add json support to zq/zindex, which means a general text search isn't that useful.

Additionally the original log files I run against in my use case are generated upstream of me and are many many multi-gigabyte gzipped log files. I didn't want to add too much extra storage for my logs: the "key" I index on is a tiny part of my log, so the storage space for the index is considerably less than the original text (even accounting for the gzip checkpoints).


I only wish that the matching could be 'fuzzy' in easily defined ways.

My use case is MAC addresses. So, if I search for 'dead.beef.cafe', this would be perfect if it could know that I wanted to see entries indexed by 'DE:AD:BE:EF:CA:FE" also.

Is that possible? This tool is so amazing. If it did this, it would be able to hugely improve my daily workflow.

edit: Actually, it is going to be able to hugely improve my workflow anyway, because I often want to see data about IP version 4 addresses, too, and it handles those fine. But I'd still like to be able to use it for MACs.


Great idea. zq uses the SQLite index built by zindex; so in theory I can modify it to use a 'WHERE index LIKE xxx' instead of 'WHERE index=xxx' -- then SQLish wildcards would be supported.

I've raised an issue to capture this use case; feel free to chime in there! :) https://github.com/mattgodbolt/zindex/issues/12


[deleted]


edit: The deleted comment suggested to store the index in all lowercase, with no separators, like 'deadbeefcafe,' and then apply the same transform to my query before running it.

I like that idea! But the way to provide the index criterion seems to be via regex, and I don't think I can use just a regex to apply transformations to the index before storing it.

If I were better with C-type languages, I would likely submit a pull request.


Sorry, deleted after realizing that zindex just wanted a regex match. If we could make zindex allow specifying a substitution regex when deciding what string to index under, that would work; your regex could be one that matches a MAC address and converts it to lowercase/strips punctuation, and then on the query side you could do the same with e.g. sed before piping to zq.


About how long does it take to create the index, maybe compared to a zgrep? Could it be parallelized with something like `parallel --pipe`?

I haven't tried yet, but this seems like a great tool.


Thanks! For my logfiles it runs through them at around 3MB/sec (that's 3MB of compressed file per second). Most of it is probably I/O, my giant logs are on NFS.

I run the indexer in parallel on multiple log files: in general there's no way to parallelize the indexing of a single file as decompressing a gzipped file requires all the bytes up to that point. Unless you already have a gzip checkpoint around somewhere, you need to sequentially scan: this is part of what zindex does to allow quick random access.

Thanks for the feedback :)


See "succinct data structures" and "wavelet trees" for a similar / more general idea.

news.ycombinator.com/item?id=7079427

I also must thank Edward Kmett for introducing me to these.


It suddenly strike me that in a compressed file, you don't have to decompress then grep, you can grep for Huffman table and know exactly where the occurrences are.


The table doesn't contain the location in the file, only a mapping of bit sequence to original sequence. LZW also doesn't contain information on where sequences are.

However, you could construct the compressed version of your search string and search for it (or in the case of LZW, generate it as you continue to search and decompress only small portions as needed).

Either way, you still have to go through the file. at least once.

EDIT: Actually, if your search string falls inside a compression block, it may be more difficult without uncompressing (and then discarding).


The search patterns don't usually align at byte boundary.


I understand that generated log files are generated upstream of you, but how much does compressing them actually wins?

I guess for most of us it's better to just search plain text files, if possible?


As a rule of thumb, English in ASCII compresses 8:1. I would guess that log files compress even better because they tend to be repetitive.

Edit: well best case is 8:1 if you put a lot of time into it. Gzip with default settings is more like 4:1 which is still significant.


Yes, log files definitely do better. Doing a few spot checks, my log files compress anywhere from 10:1 to 50:1 with default gzip settings, depending on the kind of log.


How can you optimize for better compression ratios with gzip?


gzip's default is -6, but -9 gives better compression. If you can sort your data or otherwise organize it so that similar data is grouped together, it has a better chance of finding the redundancy. Some implementations get better compression than the GNU implementation. The best one I know of is from the 7zip project.

Other compression formats can give better ratios, if that's an option for your project. bzip2, xz, 7z.


Columnar databases all do this. Handy having a library to have a play with.


Isn't there a company doing this called Pied Piper...


Check out the technology section at http://www.piedpiper.com

Btw. welcome to Silicon Valley.


DTF


Strigi desktop search (2006) implementation was very powerful (http://en.wikipedia.org/wiki/Strigi). It was possible to search through an compressed archive inside an compressed archive - in a memory and space efficient way -> "streaming". Strigi uses the Jstream C++ library which allows for deep indexing of files. See this slides from 2006: http://www.vandenoever.info/software/strigi/akademy2006.pdf . The CLucene (C++ port of Java Lucene) also uses the Jstream library (http://sourceforge.net/projects/clucene/). Jstream works similar to the Java "InputStream".

Given the discontinued status of Strigi and CLucene, it would be great if one would maintain the Jstream C++ library. Or take its ideas and implement it in zindex.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: