Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> A text file may be inefficient as is, but it's perfectly compressible, even with primitive tools like gzip. I'm not sure the SQLite binary format compresses equality well, though it might.

I hope you’re saying because of indexes? I think you may want to revisit how compression works to fix your intuition. Text+compression will always be larger and slower than equivalent binary+compression assuming text and binary represent the same contents? Why? Binary is less compressible as a percentage but starts off smaller in absolute terms which will result in a smaller absolute binary. A way to think about it is information theory - binary should generally represent the data more compactly already because the structure lived in the code. Compression is about replacing common structure with noise and it works better if there’s a lot of redundant structure. However while text has a lot of redundant structure, that’s actually bad for the compressor because it has to find that structure and process more data to do that. Additionally, is using generic mathematical techniques to remove that structure which are genetically optimal but not as optimal as removing that structure by hand via binary is.

There’s some nuance here because the text represents slightly different things than the raw binary SQLite (how to restore data in the db vs the precise relationships + data structures for allowing insertion/retrieval. But still I’d expect it to end up smaller compressed for non trivial databases



Below I'm discussing compressed size here rather than how "fast" it is to copy databases.

Yeah there are indexes. And even without indexes there is an entire b-tree sitting above the data. So we're weighing the benefits of having a domain dependent compression (binary format) vs dropping all of the derived data. I'm not sure how that will go, but lets try one.

Here is sqlite file containing metadata for apple's photo's application:

    767979520 May  1 07:28 Photos.sqlite
Doing a VACUUM INTO:

    719785984 May  1 08:56 photos.sqlite
gzip -k photos.sqlite (this took 20 seconds):

    303360460 May  1 08:56 photos.sqlite.gz
sqlite3 -readonly photos.sqlite .dump > photos.dump (10 seconds):

    1277903237 May  1 09:01 photos.dump
gzip -k photos.dump (21 seconds):

    285086642 May  1 09:01 photos.dump.gz
About 6% smaller for dump vs the original binary (but there are a bunch of indexes in this one). For me, I don't think it'd be worth the small space savings to spend the extra time doing the dump.

With indexes dropped and vacuumed, the compressed binary is 8% smaller than compressed text (despite btree overhead):

    566177792 May  1 09:09 photos_noindex.sqlite
    262067325 May  1 09:09 photos_noindex.sqlite.gz
About 13.5% smaller than compressed binary with indices. And one could re-add the indices on the other side.


Yup, these results are pretty consistent with what I'd expect (& why I noted the impact of indices) cause even string data has a lot of superfluous information when expressed in the DDL ("INSERT INTO foo ...") - I would expect all of that to exceed any bookkeeping within the btree. And non-string values like blobs or numbers are going to be stored more efficiently than in the dump which is a text encoding (or even hex for blobs) which is going to blow things up further.


Some more anecdata - from this it looks like you could `VACUUM INTO` + `zstd --long -12` using 19.1s and get 109% of the size you'd get from `dump` + `zstd --long -5` using 32.8s. Saves 13.7s at the cost of 76M. YMMV, obvs.

sqlite3 3.49.1, zstd 64bit 1.5.7, gzip (Apple) 457.120.3

Original file (3.3G)

    3264290816 Photos.sqlite
VACUUM INTO (10.3s, 3.1G, 94.3%)

    3078881280 test.sqlite
gzip -k (76s, 1.1G, 33.1%)

    1080119337 test.sqlite.gz
zstd --long (3.2s, 987M, 30.2%)

     986252298 test.sqlite.zst
zstd --long -9 (8.8s, 903M, 27.6%)

     902282663 test.sqlite.9.zst
zstd --long -12 (21.5s, 885M, 27.1%)

     884863443 test.sqlite.12.zst
.dump (27.6s, 4.7G)

    4693437307 photos.dump
gzip -k (72s, 942M, 28.8%)

     941018021 photos.dump.gz
zstd --long (5.2s, 860M, 26.3%)

     859204016 photos.dump.zst
zstd --long -12 (31.7s, 827M, 25.3%)

     826776415 photos.dump.12.zst
(edited to fix a typo in a size and a conclusion that came from that)


Brilliant. >60% savings. 700mb? wow.


Is that really necessary?


Depending on the bandwidth at the target site, which may be pretty remote, and not exposing a public internet service.


Ah no, I meant “is the snark necessary?” to the parent comment. I enjoyed the read!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: