If I read this correctly, the largest test reported on this page is the "enwik9" dataset, which compresses to 213 MB with xz and only 135 MB with this method, a 78 MB difference... using a model that is 340 MB (and was probably trained on the test data).
No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
"The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers" means the model is stored as 1 byte per parameter even though it's using a 2 byte type during compute. This is backed up by checking the size of from the download which comes out as 171,363,973 bytes for the model file.
> and was probably trained on the test data
This is likely a safe assumption (enwik8 is the default training set for RWKV and no mention of using other data was given) however:
> No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
The Ts_zip+enwik9 size comes out to less than the 197,368,568 for xz+enwik9 listed in the Large Text Compression Benchmark
despite the large model file. Getting 20,929,618 total bytes smaller while keeping a good runtime speed is not bad and puts it decently high in the list (even when sorted by total size) despite the difference in approach. Keep in mind the top entry at 107,261,318 total bytes in the table is nncp by the same author (neural net but not LLM based) so it makes sense to keep an open mind as to why they thought this would be worth publishing.
I wouldn't be surprised if my math was wrong but I can't quite follow yours. ts_zip(171 MB you say)+llm-enwik9(135MB) = 306MB is still larger than xz(0.3MB)+xz-enwik9(213MB) = 213MB.
I done did went and copied the enwik8 value for ts_zip when doing that compare, good catch!
I guess that leaves the question of "how well does the LLM's predictions work for things we're certain weren't in the test data set". If it's truly just the prebuilt RWKV then it is only trained on enwik8 and enwik9 is already a generalization but there's nothing really guaranteeing that assumption. On the other hand... I can't think of GB class open datasets of plain english to test with that aren't already in use on the page.
Of the two nncp uses transformers but isn't an LLM while ts_zip doesn't use transformers but is an LLM. Remember LLM just means large language model, it doesn't make any assumptions about how it's built. Similarly transformers just relate tokens according to attention, they don't make any assumptions those tokens must represent natural language.
I.e. anything you can tokenize can be wrangled using a transformer, not just language. Thankfully the same author also has a handy example of this: transformer based audio compression https://bellard.org/tsac/
If you’re compressing 100 or 100k such datasets, presuming that it is not custom tuned for this corpus, then wouldn’t you still save much more than you spend?
I'm not saying the result is completely useless, I am comparing it to the age-old technique of using a dictionary. Does this new LLM-powered technique improve upon the old dictionary technique?
Dictionaries also don't require a GPU or this amount of RAM.
Where I assume LLMs would shine is lossy compression.
Ah ok, I think we made different assumptions about whether the model was specific to the particular dataset so each one would need a new model — a dictionary is specific to the particular dataset being compressed, right? I was thinking the LLM would be a general-purpose text compression model.
AIUI, a dictionary is built during compression to specify the heuristics of a particular dataset and belongs to that specific dataset only. For example, it could be a ranking of the most frequent 10 symbols in the compressed file. That will be different for every input file.
That could be different for every input file, but it doesn't have to be. It could also be a fixed dictionary. For example, ZLIB allows for a user-defined dictionary [1].
In this case, I'd consider the LLM to be a fixed dictionary of sorts. A very large, fixed dictionary with probabilistic return values.
Admittedly, I don’t think it is common, but I think there was a project a few years ago (Google?) that tried to compress HTML using at least a partially fixed dictionary.
Nowadays though, it’s apparently still something that’s being tried. Chrome now supports shared dictionaries for Zstd and Brotli. One idea being, you would likely benefit from having a shared dictionary used to decompress multiple artifacts for a site. But, you many not want everything compressed all together, so this way you get the compression benefit, but can have those artifacts split into different files.
No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?
Please let me know if I misunderstand.