Distributed SQLite for Go applications

atombender · on April 6, 2018

It's fascinating how Raft has democratized distributed computing -- writing a consistent, distributed state machine now can be done in a few lines of code, assuming you have a Raft implementation lying around, like Hashicorp's excellent Go library.

What this project (and similar projects such as Rqlite) doesn't address is the hard problem -- sharding. Raft makes it quite trivial to write a master/slave system where every node is guaranteed to be identical, but it won't scale writes and won't distribute storage across your cluster.

You can build sharding on top of Raft, but in an SQL setting this would require distributed transactions, something that's difficult enough that relatively few projects have tackled it so far, the notable ones being CockroachDB and TiDB. (Google Spanner, being proprietary, is not relevant in this context.)

jxub · on April 6, 2018

An excellent interactive explanation of Raft can be found here: http://thesecretlivesofdata.com/raft/. It would be interesting as hell to build a small implementation in the language of choice (elixir?, pony?).

_sveq · on April 6, 2018

That was good to see. Thank you.

otoolep · on April 6, 2018

rqlite author here. I am considering adding distributed transaction support to rqlite (it already has a form of transaction support), if I can come up with a solid implementation, while keeping rqlite simple to deploy and operate (which is a primary goal of rqlite):

https://groups.google.com/forum/#!topic/raft-dev/BaS5Z2NkDxA

https://github.com/rqlite/rqlite/issues/266

dqlite is very interesting. If its patches to SQLite are accepted upstream, dqlite could become the storage engine for rqlite. But right now rqlite is deliberately built on vanilla-SQLite, so it can offer the same correctness guarantees as SQLite.

eis · on April 6, 2018

  > It's fascinating how Raft has democratized distributed computing

First I thought you probably ment commoditized but then I realized how democratized actually applies perfectly well to consensus algorithms because by definition they apply decisions via voting by all participants :)

But anyways, I agree very much with you that Raft has transformed the industry and we can't be thankful enough for it. But like you said, it's just one of the core building blocks of a distributed system that also shards data. Sharding is more use-case specific and so it can be valid for different databases to have different sharding logics.

atombender · on April 6, 2018

It's the second meaning of democratize, meaning "to make something accessible to everyone", that I intended, but I suppose Raft also has commoditized the technology, since there are now off-the-shelf libraries you can pick and choose from somewhat interchangeably.

FWIW, the CockroachDB people developed a modified version they call MultiRaft [1] that deals with the scaling challenges they have around shards.

[1] https://www.cockroachlabs.com/blog/scaling-raft/

free-ekanayaka · on April 6, 2018

If you require SQL, sharding and concurrent writes, yes do use CockroachDB or other equivalent solution.

If you don't have those requirements, dqlite or rqlite might get the job done with less moving parts and operational overhead.

It's pretty much the same argument of using SQLite vs (say) PostgreSQL, but translated into distributed systems. See:

https://www.sqlite.org/whentouse.html

shalabhc · on April 6, 2018

Check out http://www.actordb.com/ for a massively sharded sqlite system.

atombender · on April 6, 2018

ActorDB is very interesting. It uses two-phase commit, similar to CockroachDB, TiDB and Spanner, but as I understand it, relies on locking to update the participating shards. I'm not sure how well ActorDB performs in a large cluster with lots of distributed transactions, but I've never tried it.

shalabhc · on April 6, 2018

IIUC these are micro shards - e.g. one sqlite instance per user. Depending on the key, distributed transactions may be rare.

atombender · on April 6, 2018

The size of the shards depends on the data. ActorDB wants you to design your application around explicit containment boundaries (the actors). For example, a social networking site might collect all of a single user profile's data in one actor. Or there might be one actor for each grouping of a user's stuff (one actor for all of user X's photos, one actor for all of their tweets, etc.). So the size of those actors might not be "micro". You could decide to shard it further, of course, e.g. shard a user's photos by date. Since ActorDB supports cross-actor queries (though not joins), you can still query multiple shards at the same time.

But ActorDB does seem to promote inter-actor transactions, and a pattern I've seen encouraged is where you use shared actors (especially with the key-value store functionality) to keep some kind of central lookup table that your app then can use to find actors. For example, if you're modeling HN with ActorDB, you'd have a global list of stories, but each story/comment tree could be a separate actor, and each user would be a separate actor. Posting a story only needs to access the story actor, but to post a comment transactionally, you have to do a transaction that inserts the comment and updates the user's comment history together.

shalabhc · on April 6, 2018

> The size of the shards depends on the data.

You're right. What I meant was that a 'shard' in actordb isn't meant to be 'all data on one machine/node' - which is how many other databases view sharding, but is expected to be much more fine-grained. Thanks for the other explanations.

Another interesting implementation aspect is its use of LMDB for storing the SQLite pages.

otoolep · on April 6, 2018

> What this project (and similar projects such as rqlite) doesn't address is the hard problem -- sharding.

Yes, this is technically correct. However to meet its goals rqlite (and dqlites) does not require sharding. The point of rqlite is to provide fault-tolerance and high-availability. Again, you are technically correct that rqlite doesn't support sharding, but sharding simply isn't required for rqlite to do what it wants to do. Sharding is a solution to a different type of problem.

jknoepfler · on April 6, 2018

is raft a fancy name for distinguished leader multipaxos?

vertex-four · on April 6, 2018

Raft does the things that Paxos does, using a mechanism that is much simpler for humans to understand and therefore implement correctly.

jknoepfler · on April 7, 2018

if it does the things paxos does, it's provably isomorphic to paxos. what I saw in the demo was a flavor of paxos that's been around since the early 90's, I'm pretty sure that's where the credit is due.

vertex-four · on April 7, 2018

You can't think of any case, for anything, in which you can do something in two (or more!) different ways, but one of them is simpler to explain, understand, and implement?

zzzcpan · on April 6, 2018

> writing a consistent, distributed state machine now can be done in a few lines of code

Well, if you run it on a very fast and reliable local network without much load, not multi dc deployments over public internet, than sure, it can sort of work. Not without problems though once a fault occurs.

otoolep · on April 6, 2018

I don't agree. The whole point of putting a solid Raft implementation at the center of your system is deal with the faults. I myself built a simple distributed state machine which deals with faults perfectly well. Of course I built on top of a good Raft implementation, which certainly is not a few lines of code.

https://github.com/otoolep/hraftd

zzzcpan · on April 6, 2018

No, Raft doesn't deal with the faults, it deals with consensus and can help tolerate certain faults. You then deal with the faults by other means. And even on a fast and reliable local network where Raft can work well, once you have a significant amount of data replacing failing nodes, healing broken records, resyncing, rebalancing all without affecting operations is not trivial and unlikely to be done properly.

otoolep · on April 6, 2018

> Raft doesn't deal with the faults

I'm not sure what this means. The Raft paper explicitly states that Raft is a fault-tolerant system -- and by definition this means it deals with faults. To quote the paper:

"Replicated state machines are used to solve a variety of fault tolerance problems in distributed systems."

Raft is a type of replicated state machine. To say that "Raft doesn't deal with faults" is not correct. Perhaps you mean that there is other work to be done to take the fault tolerance offered by Raft and build a functioning application, and a system that will stay up in the real world -- and deal with an even wider range of faults. That I definitely agree with.

otoolep · on April 6, 2018

Rereading this, I see what you mean in the sense that if dealing == "fixing the fault", but tolerating == "keep going in the face of faults". So yes, you are right -- Raft doesn't fix the faults. One must still fix the network, replace the node etc. When I wrote "deal" I meant "tolerate the fault", but I think "deal" meant "fix the fault" to you.

heavenlyblue · on April 6, 2018

Well, if you’re being smart - do please tell us how Raft solves the problem of online faulty node replacement: how how does it re-sync the current, up-to-date state to the new node, how does it make sure that when one of the nodes fails - that all of the clients do not overwhelm the rest of the system.

As far as Google went with their paper; they were incredibly correct by saying that actually implementing a distributed system is hard because of the failure modes; rather than the success modes of the running system.

free-ekanayaka · on April 6, 2018

If you read the raft paper, or just watch presentations of it, you'll see that all the issues you mentioned are covered:

- faulty node replacement: you can take any node off (or any node can crash) at any time. As long as there are enough nodes left to reach a quorum, your system will be available. If there are not enough nodes left, your system will be unavailable (but keep consistency)

- re-syncs, snapshots and the rest are all covered in the raft paper

- clients wanting to perform writes always talk to the node that is currently the leader, that node fails, clients will look for next leader (which will be eventually elected as long as there's a quorum of surviving nodes)

- overwhelming a system is a different concern, raft writes are serialized so the goal is usually not throughput (for that you might look at AP/AC storage solutions in the CAP spectrum, raft is CP). Designs that need high write throughput might still use raft as internal building block for coordination.

dis-sys · on April 6, 2018

> how does it re-sync the current, up-to-date state to the new node

it is a core feature of the protocol.

> how does it make sure that when one of the nodes fails - that all of the clients do not overwhelm the rest of the system.

for write access (proposals), in a typical implementation (defined below [1]), one failed nodes actually slightly speed up the system with the cost of reduced reliability. Consider a typical setup with 3 nodes, the leader normally replicate state to 2 followers, the replication cost get cut in half once your cluster loses one member.

given that you can implement linearizable read by going through the write procedure, one can argue the above "speed up" can obviously be achieved on reads as well.

[1] independent replication to followers, for N followers, the same entry will be serialised and sent N times by the leader.

again - just talking about typical implementation and the described "speed up" comes with degraded reliability.

otoolep · on April 6, 2018

>Well, if you’re being smart - do please tell us how Raft solves the problem of online faulty node replacement:

I never said that it did. That is what I mean by my statement that more work is required to build a real system in the real world. Much of the code of rqlite is about cluster management, built on a Raft substrate.

>how how does it re-sync the current, up-to-date state to the new node

I suggest you read the Raft paper, and study the Hashicorp Go implementation (https://github.com/hashicorp/raft) to see how this is done. It's all there.

>As far as Google went with their paper; they were incredibly correct by saying that actually implementing a distributed system is hard because of the failure modes; rather than the success modes of the running system.

I couldn't agree more.

hardwaresofton · on April 6, 2018

Damn it, I was going to work on exactly this -- my goals were slightly different but basically the same paxos/raft/gossip + sqlite and see how far I could take it.

Note for those who are interested in cool-stuff-with-SQLite, there's a similar project called RQLite (mentioned in the README):

https://github.com/rqlite/rqlite

The main differences are laid out in the README, but IMO it really all boils down to the fact that since it's single-writer quorum'd there's only one node actually doing writes, and they've just chosen to replicate right-before (right after?) the WAL commit (I think this is what they're calling frames, basically one chunk of the WAL content), not at the point of receiving a query. This is more like having a read secondary more than anything (I also assume writes are redirected to master).

It says in the readme that it should be expected to be slower than RQlite, I'd love to see some numbers.

I do really like this though -- can't wait to pour through the code and see if there's anything I can learn.

free-ekanayaka · on April 6, 2018

Hello, author of dqlite here.

Things are still in flux, but yes, I plan to publish benchmarks before making a 1.0 release, as well as improving documentations and introduce some more abstractions to make it easier to use.

The reason it might be slower is some cases is that RQlite replicates statements and dqlite replicates WAL frames (where "frames" roughly means "disk pages"), which are typically bigger.

I didn't run benchmarks against RQlite yet, but I'd expect performance differences to be negligible for most use cases.

hardwaresofton · on April 6, 2018

Hey thanks for putting this out there, it's a pretty cool project.

I see what you meant by it being slower, but that's only replication speed right? maybe that should be pointed out in the documentation when you find time to update it.

Does it seem like the replication patch is going to land in Sqlite? I sure would like it to...

free-ekanayaka · on April 6, 2018

"slower" means this:

when you perform a SQLite write, SQLite writes a new page on disk to the its write-ahead log. dqlite needs to replicate that page write across all nodes. A page is typically 4kb, so when you do a write on the leader node you need to transfer 4kb across the wire to all other follower nodes and wait for a quorum of them to come back and say "I got it" (which includes writing the page to disk). On the contrary rqlite just needs to transfer and store the SQL text, which is typically less than 4kb.

In practice it's still pretty performant, but I'll publish benchmarks later on.

I plan to submit the patch to SQLite upstream too, yes.

xenadu02 · on April 6, 2018

The WAL frames are by definition the result of the executed transactions so replicating the frames is a pretty smart way of ensuring you correctly distribute the results.

You're correct: only the master accepts writes. During a failover clients can get an error and are expected to retry against the new master.

hardwaresofton · on April 6, 2018

Oh this is what I figured it was, it's the only thing left! I just have never come acros the teminology of a WAL "frame" before, so I wasn't completley sure. since it's a log I would have expected it to be called a record.

Loud and clear on the failure modes, the README was pretty clear about it

bluejekyll · on April 6, 2018

Just because somethings already been done, doesn’t mean you can do it too, or even better.

hardwaresofton · on April 6, 2018

Yeah, but it's probably not smart to re-invent the wheel too many times. Especially when the projects that came before don't look badly written or unreasonable.

I think I do have something to offer -- mostly trying gossip and making use of aggressive automatic sharding -- but I'm also more interested in what I wanted to build on top of distributed SQLite as well. No need to get down in the weeds if someone's already done the hard work for me :)

bluejekyll · on April 6, 2018

All great points. I was just responding to what seemed like a defeated perspective.

By all means contribute to this project. Everything starts somewhere!

free-ekanayaka · on April 6, 2018

I implemented dqlite in Go for convenience (a production-ready raft library was available, github.com/hashicorp/raft). It'd be interesting to implement something similar in C or event batter in rust, so you can link the library from languages than just Go. But not sure there'd be good use cases for that.

twic · on April 6, 2018

There's a Raft implementation in Rust to help get started:

https://github.com/pingcap/raft-rs

hardwaresofton · on April 6, 2018

Rust is exactly what I wanted to implement mine in -- I figure any performance savings can go towards making SQLite look more performant :)

I've been looking at https://github.com/pingcap/raft-rs

thinkersilver · on April 6, 2018

I'm in the same boat mate. Kudos for getting this far this quickly

otoolep · on April 6, 2018

rqlite is also a full application, whereas dqlite is more like the SQLite library -- you must wrap an application layer around it.

In principle rqlite could use dqlite.

bbayer · on April 6, 2018

sqlite is fantastic piece of software. I am using it in every single side projects of mine. I can not describe how I satisfied with results even in relatively high load (300K pages/day on $5 Linode VPS). I found that most of time using a full featured RDBMS is not necessary depending on use case.(I am doing mostly reads)

riku_iki · on April 6, 2018

> using a full featured RDBMS is not necessary

But what kind of overhead postgress adds comparing to sqlite?

Advantages are clear:

- more advanced functionality is available when you need it.

- pgsql likely has better performance comparing to sqlite

majewsky · on April 6, 2018

> 300K pages/day

Does that qualify as "high load"? That's just below 4 pages/second. Given how beefy server CPUs and SSDs are, I would consider something like 100+ requests/second "high load".

bbayer · on April 6, 2018

You are right. That is why I said "relatively". It is relatively high for my setup.($5 linode with 2gb ram that runs 4 instance gunicorn server.)

monsieurbanana · on April 6, 2018

If only it worked like that, we would avoid so many problems.

Sadly you cannot extrapolate at all the load of a server with their numbers of pages seen per day, depending on the use you can easily have peaks of more than 10x that.

cbluth · on April 6, 2018

Try accounting for peak load times.

nudpiedo · on April 6, 2018

I wonder how this kind of project can be tested in order to ensure the correctness of the use cases and error cases.

does anyone know a resource to learn more about testing in this kind of situation? I hope there is an alternative to just try all cases in different real hardware setups (which is the mantra at my company)

lifty · on April 6, 2018

One very good resource are the writings of Aphyr where he tests various distributed databases: https://aphyr.com/tags/jepsen . Jepsen is the framework he wrote to test them. Next would be formal verification but you get very far by understanding the distributed db semantics and applying practical tests.

free-ekanayaka · on April 6, 2018

Unit and fuzzy-based test coverage that includes all possible edge cases is being put in place. I'd like to write a Jepsen test suite too.

Hardware does not matter that much, since SQLite is pretty hardened for handling any possible hardware failure (out of memory, out of disk space, memory/disk corruption) and propagate that to client code (including dqlite). Regarding network failures (hardware-related or not), raft guards you against them, and it's formally proved.

notamy · on April 6, 2018

After looking through the repo, I'm still not sure exactly why you would want / need this. What exactly is the use-case here?

free-ekanayaka · on April 6, 2018

If your application is:

1) Distributed (e.g. n equal processes running on n machines) 2) Needs some shared state across nodes 3) Would like that state to have SQL semantics (relations, transactions, etc.) 4) Not to heavy on writes to this shared state (I might provide some ballpark numbers at some point) 5) Wants to avoid the operational overhead of an external storage system (mysql, postgresql) 6) Wants to be fault-tolerant and have transparent failovers

then dqlite might be a good choice.

notamy · on April 6, 2018

Interesting. Thank you for the explanation!

hamandcheese · on April 6, 2018

I would imagine it would be used similarly to etcd, zookeeper, consul, etc. as a distributed configuration store, with the advantage of being SQL and the schema goodness that comes along with that.

PirateBay · on April 6, 2018

Projects like this(sparsely documented, tiny) always seem to be just internal tools that the company just threw out into the world just because they can. Not that I'm ungrateful, but without any sort of explanation as to why I'd use this over a clustered MySQL or Postgres it's a little hard to make heads or tails about whether I should care.

slashdev · on April 6, 2018

It's work someone else did and shared with you for free. It strikes me as lazy to complain about how you don't know what to do with it and want everything spoon-fed to you. It's go+sqlite+raft. That means it's an embedded database for Go that can be used across a cluster of Go processes while maintaining ACID goodness. I got that out of it in 2 minutes of looking at the readme, I think you could manage the same if you take the time it took you here to make a new account and comment.

PirateBay · on April 7, 2018

I don't ever remember complaining, just saying I have no idea what it's even useful for. This is usually where a good, comprehensive, thoughtful readme helps people figure out why they should care about a project over others that do the same thing and what sets it apart. But I guess expecting explanations for libraries is spoonfeeding. Oh well!

free-ekanayaka · on April 6, 2018

I'm working towards a 1.0 release. At that point documentation will be more complete and the plumbing needed to use it in an app should be also reduced.

I think the best explanation of why you'd use it over PG or MySQL is the same as why you'd use SQLite over them:

https://www.sqlite.org/whentouse.html

With dqlite you also get HA, fault-tolerance and transparent failover, if you're willing to trade it with a bit of performance.

strken · on April 6, 2018

It looks like a dependency of lxd, the successor to lxc, which is being developed by Canonical (of Ubuntu fame). You can see how they're using it in https://github.com/lxc/lxd/tree/master/lxd/cluster.

Dqlite and rqlite are not in the same space as postgres and mysql, they're more comparable to tools like Zookeeper, etcd, or Consul[0] - low throughput CP data stores that are generally used for coordinating distributed systems. In this case, it saves users from having to manage an external data store to run lxd clustered.

[0] in fact, dqlite uses the same raft implementation as Consul, which is cool

contingencies · on April 6, 2018

Just what the world needs, yet another would-you-like-my-version-of-clustering-with-that virtual environments tool...

strken · on April 6, 2018

Hopefully someone more acquainted with the ecosystem can respond, but docker is built on lxc, and it appears that the intention of lxd is to do something similar - provide a low-level tool for docker and related tools to use, with some extra tools and security guarantees.

thotaway · on April 6, 2018

You would probably integrate this into a self contained application binary.

throwaway5752 · on April 6, 2018

How does it compare to RocksDB?

edit: I'm sorry, I was thinking of BedrockDB.

free-ekanayaka · on April 6, 2018

BedrockDB requires to operate a separate process, whereas you can embed dqlite in your Go application (pretty much in the SQLite philosophy). Also, afaik BedrockDB patches upstream SQLite with some more intrusive changes than dqlite (e.g. for supporting concurrent writes). The SQLite patch that dqlite requires is pretty minimal and just adds hooks to internal WAL events.

anentropic · on April 6, 2018

so the use case is: you're building a distributed Go application that needs some shared state in the form of an SQL db?

free-ekanayaka · on April 6, 2018

Yes, dqlite is currently used for clustering support in LXD 3.0: https://github.com/lxc/lxd

gregwebs · on April 6, 2018

RocksDB is a storage engine (a non-distriuted ordered Key-Value store with transactions). SQLite is a non-distributed SQL database that uses a storage engine (see SQLite4). The linked tool here makes SQLite distributed.

c17r · on April 6, 2018

Reminds me of the last Stripe CTF. I miss those.