If I read this correctly, the largest test reported on this page is the "enwik9"...

zamadatix · on Dec 30, 2024

> using a model that is 340 MB

"The model is quantized to 8 bits per parameter and evaluated using BF16 floating point numbers" means the model is stored as 1 byte per parameter even though it's using a 2 byte type during compute. This is backed up by checking the size of from the download which comes out as 171,363,973 bytes for the model file.

> and was probably trained on the test data

This is likely a safe assumption (enwik8 is the default training set for RWKV and no mention of using other data was given) however:

> No one would be impressed with saving 78 MB on compression using a 340 MB dictionary so I am not sure why this is good?

The Ts_zip+enwik9 size comes out to less than the 197,368,568 for xz+enwik9 listed in the Large Text Compression Benchmark despite the large model file. Getting 20,929,618 total bytes smaller while keeping a good runtime speed is not bad and puts it decently high in the list (even when sorted by total size) despite the difference in approach. Keep in mind the top entry at 107,261,318 total bytes in the table is nncp by the same author (neural net but not LLM based) so it makes sense to keep an open mind as to why they thought this would be worth publishing.

remram · on Dec 31, 2024

I wouldn't be surprised if my math was wrong but I can't quite follow yours. ts_zip(171 MB you say)+llm-enwik9(135MB) = 306MB is still larger than xz(0.3MB)+xz-enwik9(213MB) = 213MB.

zamadatix · on Dec 31, 2024

I done did went and copied the enwik8 value for ts_zip when doing that compare, good catch!

I guess that leaves the question of "how well does the LLM's predictions work for things we're certain weren't in the test data set". If it's truly just the prebuilt RWKV then it is only trained on enwik8 and enwik9 is already a generalization but there's nothing really guaranteeing that assumption. On the other hand... I can't think of GB class open datasets of plain english to test with that aren't already in use on the page.

pmayrgundter · on Dec 30, 2024

Not following. That top entry is marked as Transformer, which does mean it's an LLM

zamadatix · on Dec 31, 2024

Of the two nncp uses transformers but isn't an LLM while ts_zip doesn't use transformers but is an LLM. Remember LLM just means large language model, it doesn't make any assumptions about how it's built. Similarly transformers just relate tokens according to attention, they don't make any assumptions those tokens must represent natural language.

I.e. anything you can tokenize can be wrangled using a transformer, not just language. Thankfully the same author also has a handy example of this: transformer based audio compression https://bellard.org/tsac/

pmayrgundter · on Dec 31, 2024

Fair nuff. Thanks!

binary132 · on Dec 30, 2024

If you’re compressing 100 or 100k such datasets, presuming that it is not custom tuned for this corpus, then wouldn’t you still save much more than you spend?

remram · on Dec 30, 2024

I'm not saying the result is completely useless, I am comparing it to the age-old technique of using a dictionary. Does this new LLM-powered technique improve upon the old dictionary technique?

Dictionaries also don't require a GPU or this amount of RAM.

Where I assume LLMs would shine is lossy compression.

binary132 · on Dec 30, 2024

Ah ok, I think we made different assumptions about whether the model was specific to the particular dataset so each one would need a new model — a dictionary is specific to the particular dataset being compressed, right? I was thinking the LLM would be a general-purpose text compression model.

remram · on Dec 31, 2024

Not particularly. You could make a dictionary from "the English web", with common character sequences found on those sites you use as input.

ksec · on Dec 30, 2024

I have the same question, what is the different between LLM and Dictionary in the context of compression. Can I not "train" a dictionary?

binary132 · on Dec 30, 2024

AIUI, a dictionary is built during compression to specify the heuristics of a particular dataset and belongs to that specific dataset only. For example, it could be a ranking of the most frequent 10 symbols in the compressed file. That will be different for every input file.

mbreese · on Dec 31, 2024

> That will be different for every input file

That could be different for every input file, but it doesn't have to be. It could also be a fixed dictionary. For example, ZLIB allows for a user-defined dictionary [1].

In this case, I'd consider the LLM to be a fixed dictionary of sorts. A very large, fixed dictionary with probabilistic return values.

[1] https://www.rfc-editor.org/rfc/rfc1950#page-9

binary132 · on Dec 31, 2024

Ah, I see. I’d never thought of the possibility of using a dictionary not created specifically from the given input dataset, heh

mbreese · on Dec 31, 2024

Admittedly, I don’t think it is common, but I think there was a project a few years ago (Google?) that tried to compress HTML using at least a partially fixed dictionary.

Nowadays though, it’s apparently still something that’s being tried. Chrome now supports shared dictionaries for Zstd and Brotli. One idea being, you would likely benefit from having a shared dictionary used to decompress multiple artifacts for a site. But, you many not want everything compressed all together, so this way you get the compression benefit, but can have those artifacts split into different files.

https://developer.chrome.com/blog/shared-dictionary-compress...

KTibow · on Dec 30, 2024

Notably, solutions specialized for enwik9 (specifically fx2-cmix) take up only 110 MB, including the size of the decompressor.