FYI "run-length encoded" means compressing repetitions. "123 repeats of the char...

josephg · on May 22, 2024

It’s not “123 repeats of the character x”, and it’s not a tree of characters either. The sibling comment is right.

The data is crdt metadata. If you type a series of characters, each character has some fields. In my case, this is an id (autoincrementing integer), and origin-left and origin-right (the id of the items immediately to the left and right when the character was typed).

If a user types a continuous run of 100 characters, we don’t need to actually store all 100 IDs - since they’re just, say, the numbers 1000-1100 in order. The origin-right for all items will be the same. And we don’t need to store the origin-left for each one. The first item in the run has an unguessable left origin (say, 600). But the rest will have an origin-left equal to the id of the previous item. Ie, 1000, 1001, 1002, and so on up to 1099.

Naively you’d need 300 integers to store the metadata for 100 contiguous items. But if you’re clever, a run of 100 items only needs to store a start id, length (or end id), the origin-left of the first item in the run and the common origin-right for all the items in the run. That’s just 4 integers, which I store in a single fixed sized entry in the btree. Because of how humans actually type, this compression makes a big difference in practice.

In code, I have a pair of rust traits for Splitable and Mergable which define functions like len(), can_append(), append(), truncate(), split() and so on. Then I have a generic order statistic tree implementation which stores Splitable & Mergable items.

Run-length encoding is the best term I’ve seen for this kind of packing. It’s definitely not a tree of strings - the document text itself is stored in a completely separate data structure. (In my case, a skip list).

(Credit for the run length encoding goes to Kevin Jahns, author of Yjs.)

yencabulator · on May 22, 2024

I would recommend not using a well-established term for something that doesn't actually match the well-established term..

josephg · on May 22, 2024

I’m encoding a set of runs. Each run has a value component and a length. The only difference from traditional RLE is that, when expanded, the values are computed from the base+offset rather than repeated. Eg the IDs expand to 1000,1001,1002,… instead of just repeating; 1000,1000,1000,…

What would you call that?

yencabulator · on May 22, 2024

I can't tell from your writing if this is really a compression of an output format or an inherent part of your data structure. What determines when the repeating or sequential IDs are skipped? Is it part of the shape of the data structure, or is it a pure byte manipulation at encode time?

If it's purely in the encoder, "The encoded runs omit repetitive or sequential values of IDs.", and describe how the decoder knows what was done (e.g. naive RLE might have to say 1h1e2l1o, repeating lengths everywhere).

If it's part of the data structure, say something like "Within a <chunk>, only origin-left of the first item and the common origin-right need to be stored, as the individual item origin-left and origin-right values can be derived from those."

josephg · on May 22, 2024

> I can't tell from your writing if this is really a compression of an output format or an inherent part of your data structure.

It’s both. It’s part of the internal in-memory data structure, and part of the encoded output format.

At runtime, the data structure also needs to support split and join operations (when possible). For example, a user types a paragraph of text then moves their cursor in the middle of the paragraph and types more text. Initially we create (and extend) a single run representing the paragraph as it’s typed. Inserting in the middle of the run will split the run with the added text. So we end up with 3 entries in our order statistic tree.

I like your suggested text - but we’re talking about this in the context of order statistic trees. I have a generic order statistic tree implementation that splits and, when possible, joins the stored items. I’ve been calling that “internal run length encoding”, but you’re right that that’s not the common use of the “rle” term. But I would like a term for that that doesn’t involve explaining my actual data. My order statistic tree implementation doesn’t understand origin-left and origin-right values. That’s an implementation detail of how my particular data implements its splitting and joining operations.

yencabulator · on May 22, 2024

That sounds like how the Zed editor's SumTree is parameterized with Summary, which is whatever means of summarizing information makes sense for that use.

https://zed.dev/blog/zed-decoded-rope-sumtree

josephg · on May 22, 2024

Yes they’re quite similar. And I don’t think that’s a coincidence.

But they have some important differences too:

- Sumtree at its heart is a tree of characters, with a summary per node in the tree. I haven’t read that link in detail, but I assume summaries are derived from the stored text itself. I’m storing a tree of metadata entries, not text. The metadata is not derived from the text. A string of 100 characters in the document may be able to be summarised by 1 metadata entry, or we may need up to 100 entries - 1 for each character. Each leaf node in my tree stores up to 32 metadata entries. These entries are split when needed, and opportunistically joined. Entries describe characters which were at some point inserted. (We also need to keep around the metadata even after the text has been deleted.)

- My metadata entries also have IDs, and my data structure also needs to be able to look up any item in log(n) time via its ID.

The text itself is stored in a separate data structure. I assume any text editor which integrates a crdt will, like zed, have its own data structure for this purpose already.

You’re right that the data structures are similar, but the implementations are not interchangeable.

kragen · on May 22, 2024

you're rle-compressing the first differences of the ids (x[n] - x[n-1]) rather than the ids themselves. though maybe only in the special case of a run of 1s

josephg · on May 22, 2024

Yep, only in the case of runs of 1s - since that’s the predominant case when typing.

Runs with other difference values show up when typing in multi-cursor mode in vscode and other editors, but I haven’t optimised my data structure for that use case.

kragen · on May 22, 2024

yeah, that seems quite reasonable

reversibly transforming a data sequence into a more easily compressible one of the same size (sometimes called 'whitening' or 'predicting') is a very general technique for data compression, and first differences are a very commonly used transformation here. others include the burrows–wheeler transform, paeth prediction, and linear predictive coding

josephg · on May 23, 2024

I didn't know that had a name! Thanks - time to do some reading.

o11c · on May 22, 2024

It's almost like a RangeMap, of which there are at least 3 notable variations:

* The basic RangeMap, where a range of adjacent keys all map to the same value

* The computed RangeMap, where a range of adjacent keys all map to adjacent values (common for e.g. case-conversion of Unicode, can also come up for IP addresses)

* The DenseMap, where a range of adjacent keys map to arbitrary values, but use an array for memory optimization.

The first two actually can be made to work for string-like keys. The last is only possible for integer-like keys; use a carefully chosen trie design for string-like keys.