How do you manage HA?

andersmurphy · 2026-04-12T13:38:35 1776001115

Backups, litestream gives you streaming replication to the second.

Deployment, caddy holds open incoming connections whilst your app drains the current request queue and restarts. This is all sub second and imperceptible. You can do fancier things than this with two version of the app running on the same box if that's your thing. In my case I can also hot patch the running app as it's the JVM.

Server hard drive failing etc you have a few options:

1. Spin up a new server/VPS and litestream the backup (the application automatically does this on start).

2. If your data is truly colossal have a warm backup VPS with a snapshot of the data so litestream has to stream less data.

Pretty easy to have 3 to 4 9s of availability this way (which is more than github, anthropic etc).

rienbdj · 2026-04-12T14:17:21 1776003441

My understanding is litestream can lose data if a crash occurs before the backup replication to object storage. This makes it an unfair comparison to a Postgres in RDS for example?

andersmurphy · 2026-04-12T14:30:40 1776004240

Last I checked RDS uploads transaction logs for DB instances to Amazon S3 every five minutes. Litestream by default does it every second (you can go sub second with litestream if you want).

rienbdj · 2026-04-12T20:45:56 1776026756

Yes but there is still a (small) window where confirmed writes can be lost

andersmurphy · 2026-04-12T21:01:30 1776027690

Right and that window is bigger for RDS by the looks of it.

rienbdj · 2026-04-15T13:09:34 1776258574

Interesting - I had not looked deep into this before.

Is suppose the difference is RDS has high 9s, whereas in the Litestream case the frequency of crashes is tied to your application code and deployment process. In practice this will take more work to reach the same uptime?

sudodevnull · 2026-04-12T14:41:40 1776004900

your understanding is very wrong. please read the docs or better yet the actual code.

rienbdj · 2026-04-12T20:30:52 1776025852

Please can you link to the relevant guarantees? I did read the documentation just today so clearly misunderstood something!

locknitpicker · 2026-04-12T14:43:27 1776005007

> Backups, litestream gives you streaming replication to the second.

You seem terribly confused. Backups don't buy you high availability. At best, they buy you disaster recovery. If your node goes down in flames, your users don't continue to get service because you have an external HD with last week's db snapshots.

andersmurphy · 2026-04-12T14:57:37 1776005857

If anything backups are the key to high availability.

Streaming replication lets you spin up new nodes quickly with sub second dataloss in the event of anything happening to your server. It makes having a warm standby/failover trivial (if your dataset is large enough to warrant it).

If your backups are a week old snapshots, you have bigger problems to worry about than HA.

locknitpicker · 2026-04-14T05:02:13 1776142933

> If anything backups are the key to high availability.

Not really. Backups are complementary in disaster recovery. They play no role in high availability. Putting your data in cold storage plays no role in keeping your system up and handling traffic.

> Streaming replication lets you spin up new nodes (...)

You seem to be confused. Replication and backups are two entirely separate things. Replication is used to preserve consistency across a distributed system and improve fault tolerance, whereas backups just means you are able to recover the state of your system at each checkpoint. Either you're using a word while giving it a new personal meaning, or you're confusing concepts.

andersmurphy · 2026-04-14T05:42:45 1776145365

Depends how you do your backups. If you do them by replicating. They are both. See litestream [1].

With SQLite this is even more obvious as a database is just a file (or three in the case of WAL). Which means you can replicate to not just another machine (or any file system) but much more resilient object storage like S3 (most cloud provider offer S3 compatible object storage).

- [1] https://litestream.io/how-it-works/

I think you might need to rethink your idée fixe.

rovr138 · 2026-04-12T13:30:19 1776000619

No offense, you wait. Like everyone's been doing for years in the internet and still do

- When AWS/GCP goes down, how do most handle HA?

- When a database server goes down, how do most handle HA?

- When Cloudflare goes down, how do most handle HA?

The down time here is the server crashed, routing failed or some other issue with the host. You wait.

One may run pingdom or something to alert you.

locknitpicker · 2026-04-12T14:41:42 1776004902

> When AWS/GCP goes down, how do most handle HA?

This is a disingenuous scenario. SQLite doesn't buy you uptime if you deploy your app to AWS/GCP, and you can just as easily deploy a proper RDBMS such as postgres to a small provider/self-host.

Do you actually have any concrete scenario that supports your belief?

runako · 2026-04-12T15:32:53 1776007973

> SQLite doesn't buy you uptime if you deploy your app to AWS/GCP

This is...not true of many hyperscaler outages? Frequently, outages will leave individual VMs running but affect only higher-order services typically used in more complex architectures. Folks running an SQLite on a EC2 often will not be affected.

And obviously, don't use us-east-1. This One Simple Trick can improve your HA story.

locknitpicker · 2026-04-14T05:07:58 1776143278

> This is...not true of many hyperscaler outages? Frequently, outages will leave individual VMs running but affect only higher-order services typically used in more complex architectures. Folks running an SQLite on a EC2 often will not be affected.

You're trying too hard to move goalposts. Look at your comment: you're trying to argue that SQLite is immune to outages in AWS even when AWS is out, and your whole logic lies in asserting the hypothetical outage will be surgically designed to somehow not affect your deployment because it may or may not consume a service that was affected.

In the meantime, the last major AWS outage was Iran blowing up a datacenter. They should have just used SQLite to avoid that, is it?

rovr138 · 2026-04-12T21:15:12 1776028512

All I'm saying is that people mention HA, when there isn't a need for it or when most people are fine with some downtime. For example,

> When AWS/GCP goes down, how do most handle HA?

When they go down, what do most do? Honestly, people still go about their day and are okay. Look how many systems do go down. What ends up happening? An article goes out that X cloud took out large parts of the internet.. and that's it.

Even when there's ways of doing it, they just go down and we accept it. I never said this doesn't go down or can't go down, it's just that it's okay and totally fine if it does.

locknitpicker · 2026-04-14T05:15:20 1776143720

> All I'm saying is that people mention HA, when there isn't a need for it or when most people are fine with some downtime.

I don't think it's smart to just cherry pick the design constraints you feel don't apply to you, and proceed to argue others should also ignore them.

Just because you are ok to let your pet project crash and be out for long periods of time, why do you assume it's ok for everyone to do the same?

Think about it for a second: what would be the impact of a storefront to crash during a black Friday type event? Do you think people don't get fired for dropping the ball in these circumstances? Heck, you have papers that document how a few extra milliseconds of latency in a store page is correlated to measurable drops in revenue, and here you are claiming that having businesses crash is no biggie.