It's fascinating how Raft has democratized distributed computing -- writing a consistent, distributed state machine now can be done in a few lines of code, assuming you have a Raft implementation lying around, like Hashicorp's excellent Go library.
What this project (and similar projects such as Rqlite) doesn't address is the hard problem -- sharding. Raft makes it quite trivial to write a master/slave system where every node is guaranteed to be identical, but it won't scale writes and won't distribute storage across your cluster.
You can build sharding on top of Raft, but in an SQL setting this would require distributed transactions, something that's difficult enough that relatively few projects have tackled it so far, the notable ones being CockroachDB and TiDB. (Google Spanner, being proprietary, is not relevant in this context.)
An excellent interactive explanation of Raft can be found here: http://thesecretlivesofdata.com/raft/. It would be interesting as hell to build a small implementation in the language of choice (elixir?, pony?).
rqlite author here. I am considering adding distributed transaction support to rqlite (it already has a form of transaction support), if I can come up with a solid implementation, while keeping rqlite simple to deploy and operate (which is a primary goal of rqlite):
dqlite is very interesting. If its patches to SQLite are accepted upstream, dqlite could become the storage engine for rqlite. But right now rqlite is deliberately built on vanilla-SQLite, so it can offer the same correctness guarantees as SQLite.
> It's fascinating how Raft has democratized distributed computing
First I thought you probably ment commoditized but then I realized how democratized actually applies perfectly well to consensus algorithms because by definition they apply decisions via voting by all participants :)
But anyways, I agree very much with you that Raft has transformed the industry and we can't be thankful enough for it. But like you said, it's just one of the core building blocks of a distributed system that also shards data. Sharding is more use-case specific and so it can be valid for different databases to have different sharding logics.
It's the second meaning of democratize, meaning "to make something accessible to everyone", that I intended, but I suppose Raft also has commoditized the technology, since there are now off-the-shelf libraries you can pick and choose from somewhat interchangeably.
FWIW, the CockroachDB people developed a modified version they call MultiRaft [1] that deals with the scaling challenges they have around shards.
ActorDB is very interesting. It uses two-phase commit, similar to CockroachDB, TiDB and Spanner, but as I understand it, relies on locking to update the participating shards. I'm not sure how well ActorDB performs in a large cluster with lots of distributed transactions, but I've never tried it.
The size of the shards depends on the data. ActorDB wants you to design your application around explicit containment boundaries (the actors). For example, a social networking site might collect all of a single user profile's data in one actor. Or there might be one actor for each grouping of a user's stuff (one actor for all of user X's photos, one actor for all of their tweets, etc.). So the size of those actors might not be "micro". You could decide to shard it further, of course, e.g. shard a user's photos by date. Since ActorDB supports cross-actor queries (though not joins), you can still query multiple shards at the same time.
But ActorDB does seem to promote inter-actor transactions, and a pattern I've seen encouraged is where you use shared actors (especially with the key-value store functionality) to keep some kind of central lookup table that your app then can use to find actors. For example, if you're modeling HN with ActorDB, you'd have a global list of stories, but each story/comment tree could be a separate actor, and each user would be a separate actor. Posting a story only needs to access the story actor, but to post a comment transactionally, you have to do a transaction that inserts the comment and updates the user's comment history together.
You're right. What I meant was that a 'shard' in actordb isn't meant to be 'all data on one machine/node' - which is how many other databases view sharding, but is expected to be much more fine-grained. Thanks for the other explanations.
Another interesting implementation aspect is its use of LMDB for storing the SQLite pages.
> What this project (and similar projects such as rqlite) doesn't address is the hard problem -- sharding.
Yes, this is technically correct. However to meet its goals rqlite (and dqlites) does not require sharding. The point of rqlite is to provide fault-tolerance and high-availability. Again, you are technically correct that rqlite doesn't support sharding, but sharding simply isn't required for rqlite to do what it wants to do. Sharding is a solution to a different type of problem.
if it does the things paxos does, it's provably isomorphic to paxos. what I saw in the demo was a flavor of paxos that's been around since the early 90's, I'm pretty sure that's where the credit is due.
You can't think of any case, for anything, in which you can do something in two (or more!) different ways, but one of them is simpler to explain, understand, and implement?
> writing a consistent, distributed state machine now can be done in a few lines of code
Well, if you run it on a very fast and reliable local network without much load, not multi dc deployments over public internet, than sure, it can sort of work. Not without problems though once a fault occurs.
I don't agree. The whole point of putting a solid Raft implementation at the center of your system is deal with the faults. I myself built a simple distributed state machine which deals with faults perfectly well. Of course I built on top of a good Raft implementation, which certainly is not a few lines of code.
No, Raft doesn't deal with the faults, it deals with consensus and can help tolerate certain faults. You then deal with the faults by other means. And even on a fast and reliable local network where Raft can work well, once you have a significant amount of data replacing failing nodes, healing broken records, resyncing, rebalancing all without affecting operations is not trivial and unlikely to be done properly.
I'm not sure what this means. The Raft paper explicitly states that Raft is a fault-tolerant system -- and by definition this means it deals with faults. To quote the paper:
"Replicated state machines are used to solve a variety of fault tolerance problems in distributed systems."
Raft is a type of replicated state machine. To say that "Raft doesn't deal with faults" is not correct. Perhaps you mean that there is other work to be done to take the fault tolerance offered by Raft and build a functioning application, and a system that will stay up in the real world -- and deal with an even wider range of faults. That I definitely agree with.
Rereading this, I see what you mean in the sense that if dealing == "fixing the fault", but tolerating == "keep going in the face of faults". So yes, you are right -- Raft doesn't fix the faults. One must still fix the network, replace the node etc. When I wrote "deal" I meant "tolerate the fault", but I think "deal" meant "fix the fault" to you.
Well, if you’re being smart - do please tell us how Raft solves the problem of online faulty node replacement: how how does it re-sync the current, up-to-date state to the new node, how does it make sure that when one of the nodes fails - that all of the clients do not overwhelm the rest of the system.
As far as Google went with their paper; they were incredibly correct by saying that actually implementing a distributed system is hard because of the failure modes; rather than the success modes of the running system.
If you read the raft paper, or just watch presentations of it, you'll see that all the issues you mentioned are covered:
- faulty node replacement: you can take any node off (or any node can crash) at any time. As long as there are enough nodes left to reach a quorum, your system will be available. If there are not enough nodes left, your system will be unavailable (but keep consistency)
- re-syncs, snapshots and the rest are all covered in the raft paper
- clients wanting to perform writes always talk to the node that is currently the leader, that node fails, clients will look for next leader (which will be eventually elected as long as there's a quorum of surviving nodes)
- overwhelming a system is a different concern, raft writes are serialized so the goal is usually not throughput (for that you might look at AP/AC storage solutions in the CAP spectrum, raft is CP). Designs that need high write throughput might still use raft as internal building block for coordination.
> how does it re-sync the current, up-to-date state to the new node
it is a core feature of the protocol.
> how does it make sure that when one of the nodes fails - that all of the clients do not overwhelm the rest of the system.
for write access (proposals), in a typical implementation (defined below [1]), one failed nodes actually slightly speed up the system with the cost of reduced reliability. Consider a typical setup with 3 nodes, the leader normally replicate state to 2 followers, the replication cost get cut in half once your cluster loses one member.
given that you can implement linearizable read by going through the write procedure, one can argue the above "speed up" can obviously be achieved on reads as well.
[1] independent replication to followers, for N followers, the same entry will be serialised and sent N times by the leader.
again - just talking about typical implementation and the described "speed up" comes with degraded reliability.
>Well, if you’re being smart - do please tell us how Raft solves the problem of online faulty node replacement:
I never said that it did. That is what I mean by my statement that more work is required to build a real system in the real world. Much of the code of rqlite is about cluster management, built on a Raft substrate.
>how how does it re-sync the current, up-to-date state to the new node
I suggest you read the Raft paper, and study the Hashicorp Go implementation (https://github.com/hashicorp/raft) to see how this is done. It's all there.
>As far as Google went with their paper; they were incredibly correct by saying that actually implementing a distributed system is hard because of the failure modes; rather than the success modes of the running system.
Damn it, I was going to work on exactly this -- my goals were slightly different but basically the same paxos/raft/gossip + sqlite and see how far I could take it.
Note for those who are interested in cool-stuff-with-SQLite, there's a similar project called RQLite (mentioned in the README):
The main differences are laid out in the README, but IMO it really all boils down to the fact that since it's single-writer quorum'd there's only one node actually doing writes, and they've just chosen to replicate right-before (right after?) the WAL commit (I think this is what they're calling frames, basically one chunk of the WAL content), not at the point of receiving a query. This is more like having a read secondary more than anything (I also assume writes are redirected to master).
It says in the readme that it should be expected to be slower than RQlite, I'd love to see some numbers.
I do really like this though -- can't wait to pour through the code and see if there's anything I can learn.
Things are still in flux, but yes, I plan to publish benchmarks before making a 1.0 release, as well as improving documentations and introduce some more abstractions to make it easier to use.
The reason it might be slower is some cases is that RQlite replicates statements and dqlite replicates WAL frames (where "frames" roughly means "disk pages"), which are typically bigger.
I didn't run benchmarks against RQlite yet, but I'd expect performance differences to be negligible for most use cases.
Hey thanks for putting this out there, it's a pretty cool project.
I see what you meant by it being slower, but that's only replication speed right? maybe that should be pointed out in the documentation when you find time to update it.
Does it seem like the replication patch is going to land in Sqlite? I sure would like it to...
when you perform a SQLite write, SQLite writes a new page on disk to the its write-ahead log. dqlite needs to replicate that page write across all nodes. A page is typically 4kb, so when you do a write on the leader node you need to transfer 4kb across the wire to all other follower nodes and wait for a quorum of them to come back and say "I got it" (which includes writing the page to disk). On the contrary rqlite just needs to transfer and store the SQL text, which is typically less than 4kb.
In practice it's still pretty performant, but I'll publish benchmarks later on.
I plan to submit the patch to SQLite upstream too, yes.
The WAL frames are by definition the result of the executed transactions so replicating the frames is a pretty smart way of ensuring you correctly distribute the results.
You're correct: only the master accepts writes. During a failover clients can get an error and are expected to retry against the new master.
Oh this is what I figured it was, it's the only thing left! I just have never come acros the teminology of a WAL "frame" before, so I wasn't completley sure. since it's a log I would have expected it to be called a record.
Loud and clear on the failure modes, the README was pretty clear about it
Yeah, but it's probably not smart to re-invent the wheel too many times. Especially when the projects that came before don't look badly written or unreasonable.
I think I do have something to offer -- mostly trying gossip and making use of aggressive automatic sharding -- but I'm also more interested in what I wanted to build on top of distributed SQLite as well. No need to get down in the weeds if someone's already done the hard work for me :)
I implemented dqlite in Go for convenience (a production-ready raft library was available, github.com/hashicorp/raft). It'd be interesting to implement something similar in C or event batter in rust, so you can link the library from languages than just Go. But not sure there'd be good use cases for that.
sqlite is fantastic piece of software. I am using it in every single side projects of mine. I can not describe how I satisfied with results even in relatively high load (300K pages/day on $5 Linode VPS). I found that most of time using a full featured RDBMS is not necessary depending on use case.(I am doing mostly reads)
Does that qualify as "high load"? That's just below 4 pages/second. Given how beefy server CPUs and SSDs are, I would consider something like 100+ requests/second "high load".
If only it worked like that, we would avoid so many problems.
Sadly you cannot extrapolate at all the load of a server with their numbers of pages seen per day, depending on the use you can easily have peaks of more than 10x that.
I wonder how this kind of project can be tested in order to ensure the correctness of the use cases and error cases.
does anyone know a resource to learn more about testing in this kind of situation? I hope there is an alternative to just try all cases in different real hardware setups (which is the mantra at my company)
One very good resource are the writings of Aphyr where he tests various distributed databases: https://aphyr.com/tags/jepsen . Jepsen is the framework he wrote to test them. Next would be formal verification but you get very far by understanding the distributed db semantics and applying practical tests.
Unit and fuzzy-based test coverage that includes all possible edge cases is being put in place. I'd like to write a Jepsen test suite too.
Hardware does not matter that much, since SQLite is pretty hardened for handling any possible hardware failure (out of memory, out of disk space, memory/disk corruption) and propagate that to client code (including dqlite). Regarding network failures (hardware-related or not), raft guards you against them, and it's formally proved.
1) Distributed (e.g. n equal processes running on n machines)
2) Needs some shared state across nodes
3) Would like that state to have SQL semantics (relations, transactions, etc.)
4) Not to heavy on writes to this shared state (I might provide some ballpark numbers at some point)
5) Wants to avoid the operational overhead of an external storage system (mysql, postgresql)
6) Wants to be fault-tolerant and have transparent failovers
I would imagine it would be used similarly to etcd, zookeeper, consul, etc. as a distributed configuration store, with the advantage of being SQL and the schema goodness that comes along with that.
Projects like this(sparsely documented, tiny) always seem to be just internal tools that the company just threw out into the world just because they can. Not that I'm ungrateful, but without any sort of explanation as to why I'd use this over a clustered MySQL or Postgres it's a little hard to make heads or tails about whether I should care.
It's work someone else did and shared with you for free. It strikes me as lazy to complain about how you don't know what to do with it and want everything spoon-fed to you. It's go+sqlite+raft. That means it's an embedded database for Go that can be used across a cluster of Go processes while maintaining ACID goodness. I got that out of it in 2 minutes of looking at the readme, I think you could manage the same if you take the time it took you here to make a new account and comment.
I don't ever remember complaining, just saying I have no idea what it's even useful for. This is usually where a good, comprehensive, thoughtful readme helps people figure out why they should care about a project over others that do the same thing and what sets it apart. But I guess expecting explanations for libraries is spoonfeeding. Oh well!
I'm working towards a 1.0 release. At that point documentation will be more complete and the plumbing needed to use it in an app should be also reduced.
I think the best explanation of why you'd use it over PG or MySQL is the same as why you'd use SQLite over them:
It looks like a dependency of lxd, the successor to lxc, which is being developed by Canonical (of Ubuntu fame). You can see how they're using it in https://github.com/lxc/lxd/tree/master/lxd/cluster.
Dqlite and rqlite are not in the same space as postgres and mysql, they're more comparable to tools like Zookeeper, etcd, or Consul[0] - low throughput CP data stores that are generally used for coordinating distributed systems. In this case, it saves users from having to manage an external data store to run lxd clustered.
[0] in fact, dqlite uses the same raft implementation as Consul, which is cool
Hopefully someone more acquainted with the ecosystem can respond, but docker is built on lxc, and it appears that the intention of lxd is to do something similar - provide a low-level tool for docker and related tools to use, with some extra tools and security guarantees.
BedrockDB requires to operate a separate process, whereas you can embed dqlite in your Go application (pretty much in the SQLite philosophy). Also, afaik BedrockDB patches upstream SQLite with some more intrusive changes than dqlite (e.g. for supporting concurrent writes). The SQLite patch that dqlite requires is pretty minimal and just adds hooks to internal WAL events.
RocksDB is a storage engine (a non-distriuted ordered Key-Value store with transactions). SQLite is a non-distributed SQL database that uses a storage engine (see SQLite4). The linked tool here makes SQLite distributed.
What this project (and similar projects such as Rqlite) doesn't address is the hard problem -- sharding. Raft makes it quite trivial to write a master/slave system where every node is guaranteed to be identical, but it won't scale writes and won't distribute storage across your cluster.
You can build sharding on top of Raft, but in an SQL setting this would require distributed transactions, something that's difficult enough that relatively few projects have tackled it so far, the notable ones being CockroachDB and TiDB. (Google Spanner, being proprietary, is not relevant in this context.)