> A text file may be inefficient as is, but it's perfectly compressible, even wi...

dunham · 2025-05-01T16:15:19 1746116119

Below I'm discussing compressed size here rather than how "fast" it is to copy databases.

Yeah there are indexes. And even without indexes there is an entire b-tree sitting above the data. So we're weighing the benefits of having a domain dependent compression (binary format) vs dropping all of the derived data. I'm not sure how that will go, but lets try one.

Here is sqlite file containing metadata for apple's photo's application:

    767979520 May  1 07:28 Photos.sqlite

Doing a VACUUM INTO:

    719785984 May  1 08:56 photos.sqlite

gzip -k photos.sqlite (this took 20 seconds):

    303360460 May  1 08:56 photos.sqlite.gz

sqlite3 -readonly photos.sqlite .dump > photos.dump (10 seconds):

    1277903237 May  1 09:01 photos.dump

gzip -k photos.dump (21 seconds):

    285086642 May  1 09:01 photos.dump.gz

About 6% smaller for dump vs the original binary (but there are a bunch of indexes in this one). For me, I don't think it'd be worth the small space savings to spend the extra time doing the dump.

With indexes dropped and vacuumed, the compressed binary is 8% smaller than compressed text (despite btree overhead):

    566177792 May  1 09:09 photos_noindex.sqlite
    262067325 May  1 09:09 photos_noindex.sqlite.gz

About 13.5% smaller than compressed binary with indices. And one could re-add the indices on the other side.

vlovich123 · 2025-05-01T20:15:08 1746130508

Yup, these results are pretty consistent with what I'd expect (& why I noted the impact of indices) cause even string data has a lot of superfluous information when expressed in the DDL ("INSERT INTO foo ...") - I would expect all of that to exceed any bookkeeping within the btree. And non-string values like blobs or numbers are going to be stored more efficiently than in the dump which is a text encoding (or even hex for blobs) which is going to blow things up further.

zimpenfish · 2025-05-03T08:51:54 1746262314

Some more anecdata - from this it looks like you could `VACUUM INTO` + `zstd --long -12` using 19.1s and get 109% of the size you'd get from `dump` + `zstd --long -5` using 32.8s. Saves 13.7s at the cost of 76M. YMMV, obvs.

sqlite3 3.49.1, zstd 64bit 1.5.7, gzip (Apple) 457.120.3

Original file (3.3G)

    3264290816 Photos.sqlite

VACUUM INTO (10.3s, 3.1G, 94.3%)

    3078881280 test.sqlite

gzip -k (76s, 1.1G, 33.1%)

    1080119337 test.sqlite.gz

zstd --long (3.2s, 987M, 30.2%)

     986252298 test.sqlite.zst

zstd --long -9 (8.8s, 903M, 27.6%)

     902282663 test.sqlite.9.zst

zstd --long -12 (21.5s, 885M, 27.1%)

     884863443 test.sqlite.12.zst

.dump (27.6s, 4.7G)

    4693437307 photos.dump

gzip -k (72s, 942M, 28.8%)

     941018021 photos.dump.gz

zstd --long (5.2s, 860M, 26.3%)

     859204016 photos.dump.zst

zstd --long -12 (31.7s, 827M, 25.3%)

     826776415 photos.dump.12.zst

(edited to fix a typo in a size and a conclusion that came from that)

ForOldHack · 2025-05-01T18:35:39 1746124539

Brilliant. >60% savings. 700mb? wow.

cnewey · 2025-05-01T18:49:13 1746125353

Is that really necessary?

nine_k · 2025-05-01T18:59:24 1746125964

Depending on the bandwidth at the target site, which may be pretty remote, and not exposing a public internet service.

cnewey · 2025-05-01T19:12:03 1746126723

Ah no, I meant “is the snark necessary?” to the parent comment. I enjoyed the read!