For Anyone with experience with FoundationDB! How does it compare to CockroachDB...

manigandham · on March 11, 2022

FoundationDB is a key/value store, and well-known for being fast, scalable and incredibly reliable. It passed many of the distributed database tests unlike many competitors. Here's a good HN thread when Apple open-sourced it a few years ago: https://news.ycombinator.com/item?id=16877395

CockroachDB implements postgres-compatible SQL layer on top of their own key/value store called Pebble. This is similar to TiDB which implements mysql on top of their TiKV key/value store.

The latter databases offer easier programming with SQL interfaces, but since you can model just about anything on top of key/value store, there are already "layers" that can implement similar abstractions on top of FDB like the Record Layer[1], Document Layer [2], and even a SQL layer [3] at one point.

1) https://www.foundationdb.org/blog/announcing-record-layer/ 2) https://foundationdb.github.io/fdb-document-layer/ 3) https://github.com/jaytaylor/sql-layer

jandrewrogers · on March 11, 2022

There is one caveat I would add. While you can always build just about any kind of database or data model on top of a logical KV store, the scalability and performance is almost always going to be qualitatively worse for queries than using a storage engine purpose-built for the intended type of database or data model. It is simpler to implement, which is why people do it, but you give up a lot of selectivity and locality on the query side.

The target design case for KV stores is when all your queries are essentially trivial. Layering SQL on top of it implies the expected use case is complex queries of the type where a KV store won't perform well. Good for a compatibility interface, but not performant.

tomnipotent · on March 12, 2022

>a logical KV store

I've always considered KV to be an API. Under the hood you're still using data structures managing fixed-size blocks and byte-offsets, such as LSM trees, and using binary searches to minimize disk reads (e.g. index files of very 128th 64k key block).

If you're building something more robust on top of a KV store, I'd be under the impression you're not doing so using the logical API but leveraging the underlying data structures with algorithms better suited to the task. An LSM tree is pretty decent at random seek/scan, and implementing something like JOINs is less about KV constraints as it is LSMT constraints.

For example, FoundationDB doesn't use SQLite directly but rather its B-tree implementation.

jandrewrogers · on March 12, 2022

I intentionally added the word "logical" because I think it is important. All database engines are KV at the physical page/block I/O level but they almost never expose that to the user, and there is no requirement to limit the expression to what the lower level can express.

The primary limitation of KV as an API is that it can't directly express complex but elementary relationships between keys e.g. intersections when your keys are constraint types. Or graph traversal relationships, if efficiency on real hardware matters at all. Most complex databases, like SQL, are implicitly searching across relationships like this constantly. Not being able to express these types of relationships as first-class citizens is computationally expensive.

Putting a KV store under every database is a recent fad. Databases classically had more expressive but still inadequate storage architectures (e.g. first-class operations on constraint types were terrible there too), so this is a step backward in most ways -- but I understand why expediency incentivizes that approach. Kind of like discarding joins to achieve scale-out -- joins are critical operations for many things but scaling out joins requires very challenging computer science. On the flip side, a lot of the cutting edge research has been focused on jamming as much expressiveness in a singular index-organized store as possible, in as little space as possible. This has proven to be a very effective, albeit more difficult to understand.

LSM trees are a good example of an incremental hack to address limitations in more traditional approaches while not solving the underlying fundamental issues with the way the indexing structures are constructed. LSM tree write throughput is okay but not great, empirically. Similarly, the query performance is okay but not great, for reasons that are well-understood from other contexts in database engine design. You can use it to solve a specific problem but it is unclear why it should be the solution to a general problem.

KV stores, LSM trees, etc are stuck in a suboptimal local minima.

native_samples · on March 12, 2022

Could you elaborate on this argument? For something like a graph, I din't quite follow why a KV API is problematic or how you'd do it better. If you want to map node locality to KV locality you can certainly do that.

jandrewrogers · on March 14, 2022

The core design problem for graph databases is data locality during join recursion. If you shard your data model by key, the cross-shard join operation asymptotically converges on a Cartesian product for each join iteration i.e. pathological data locality. This is why most graph databases have such poor scalability in practice.

A less intuitive approach is to shard the data model based on edge similarity measures, such that a specific key does not map to a single shard. While this obviously has poorer locality for simple key lookup, cross-shard join operations -- the most expensive operation and what actually matters -- only involve a tightly bounded number of shards and therefore have much better locality. While this was originally developed for to make the problem scale on supercomputers (at IBM Research AFAIK), it is entirely amenable to use in graph databases.

At the storage engine level, first-class support for indexing by similarity measures has different design requirements and tradeoffs than simple key lookup or ordered-tree indexing. While you could make it work on a KV store, the impedance mismatch would incur a significant performance drag. This cuts both ways; storage engines optimized for use with similarity measure designs are going to offer poor performance if you try to put b-trees or LSM-trees on top of them.

ovaistariq · on March 12, 2022

I think of FoundationDB as a low level persistence layer that you can use to build a fully functional database on top with more complex querying capability. It provides the core set of semantics such as data replication, globally ordered transactions and horizontal scalability - that are all complex to build in their own right.

If you look at CRDB or TiKV design, under the hood they are also KV-stores with SQL layered on top.

sa46 · on March 11, 2022

Very curious, what’s an alternative implementation strategy or more appropriate mental model?

My mental model (I am not a database engineer) is that every SQL database is fundamentally backed by a key value store. For Postgres, each value is a row of a relation, the key is the relation’s key, and the physical storage is the table file. Postgres builds b-trees on top to make it fast, but the b-tree itself is a relation, a key value store that Postgres works hard to keep sorted.

mamcx · on March 11, 2022

Not expert, but as enthusiast I have read some about this (for https://tablam.org).

KV is truly ill-suited for this.

Is very easy to see if you put the data layout:

    --data
    pk  city    country
    1   miami   USA
    2   bogota  Colombia

    --As Kv (naive):

    pk1:    1
    pk2:    2
    city1:  miami
    city2:  bogota

    --As btree, with "value" being N (as you say):

    Key    Value

    1      1    miami   USA
    2      2    bogota  Colombia

    --As paged btree

    Page1
        Low: 1 High:2   
            //think this is a block
            1   miami   USA
            2   bogota  Colombia

    --As columnar:

    pk:     1       2
    city:   miami   bogota

    --As PAX (hybrid columnar/row)

    Page1
        Low: 1 High:2   
            //think this is a block
            pk:     1       2
            city:   miami   bogota

jandrewrogers · on March 12, 2022

Many data models don't have obvious singular keys. The critical search relationships cannot be reduced to an order relationship on a single column a priori. Much of the most interesting research in storage models is around increasingly efficient ways of indexing complex information theoretic features across multiple columns in a single structure, with an underlying storage implementation to match.

At some level, every database contains a key-value store. For performance reasons, the hardware always has to be treated this way. Databases work at the level of blocks/pages, but those abstractions are usually hidden from users with a lot of clever logic in the middle that is more opinionated to enable optimization. That doesn't change.

An interesting and important property of search data structures is that, at the limit, a single index that can optimally satisfy all possible queries is equivalent to general AI. It is also completely intractable. Fortunately real-world queries tend to be much more limited in nature. A corollary is that the distinction between indexing, storage, and scheduling in databases is a fiction -- useful for making some things simple but not necessary in any database. In essence, the practice of treating indexing, storage, and scheduling as discrete functions in a database is the opposite extreme. There are a vast number of possible implementations between these two extremes with better properties than either in practical real-world databases.

As a general design principle for scalability and performance reasons, you want to organize your data model around a single indexing mechanic. Consequently, it is critical to maximize the expressiveness and efficiency of any particular indexing mechanic. At the limit, with a good algorithm, it is equivalent to having an index for each column, with the ability to efficiently search more features of the data and without the overhead of actually having an index for each column.

I think we are entering a new golden age of database technology where the boundaries between elements we treated as discrete are much fuzzier.

anonymousDan · on March 12, 2022

Any papers you can recommend that go into this in more detail?

travisd · on March 11, 2022

i’m also not a db engineer, but i think this is true-ish. however building and maintaining those index tables is hard and probably prone to issues if you can’t update multiple as part of the same transaction.

the other major thing you’d miss is smarter joins. the distributed databases do a lot of work to be able to push down predicates as far down as possible to the physical nodes storing the data.

there’s probably more as well.

Gaelan · on March 11, 2022

> probably prone to issues if you can’t update multiple as part of the same transaction

IIRC one of FoundationDB's features is that it does support such transactions, so you can easily implement indexing on top of it.

manigandham · on March 11, 2022

Yes, agreed, although a lot of storage modalities naturally map to key/value in some way.

Ironically FDB uses SQLite for the storage servers. Abstractions all the way down.

skrtskrt · on March 11, 2022

though as with things like CAP theorem, there are always clever ways to to place a given implementation at various points along the spectrum, rather than just at one end or the other.

legojoey17 · on March 11, 2022

Adding clarifications to that, Pebble is a local K/V library akin to RocksDB/LevelDB and provides no services or distributed capabilities. Comparing the two does not make sense.

On the other hand, TiKV is more comparable to the internal distributed transactional K/V layer to CockroachDB that its SQL layer is coupled to. That of course is not a usable external interface. You could utilize CockroachDB like a K/V store with simple table to forego the SQL functionality and planning (e.g. only primary col index, only key/value column, so on), but I am not sure what the practicality of that is as I have not kept up with CockroachDB development.

(disclaimer: I interned at Cockroach several years back)

manigandham · on March 11, 2022

The comparison is about providing a relational data model on top of a key/value store.

Yugabyte also does something similar on top of its underlying DocDB document store.

atombender · on March 11, 2022

An important difference between CockroachDB and TiDB is that the former has an embedded key-value store; everything runs in the same process.

TiDB is split into multiple parts: TiDB proper (stateless SQL/query plan layer), TiKV (the key-value store), and PD (placement driver, the main control plane and metadata layer). TiKV also isn't a pure key-value store, as it handles some particular query operations that can be "pushed down" from TiDB to reduce network round trips.

FoundationDB's role similar to TiKV. It's a pure key-value store with replication built in.

ddaalluu2 · on March 12, 2022

It's actually not that fast. Since it has multiple layers and that implies overhead. It's about 4x as slow as MySQL.

Personally I prefer cockroachdb, postgres compatible and easy scale.

EGreg · on March 11, 2022

What is better, FoundationDB, CockroachDB, or Hypercore?

vvern · on March 11, 2022

One of these is not like the others.

richieartoul · on March 11, 2022

I've never used CockroachDB, but FoundationDB has been an absolute dream to work with. There are a few downsides to using it:

  - There is (currently) no hosted solution you can use, so you have to run it yourself.
  - There is no great "layer" that provides a nice document/SQL layer so you kind of have to work with the raw sorted KV abstraction.
  - You have to structure your workload to avoid a high rate of conflicts on individual keys.
  - It won't scale to extremely high write volume workloads like Cassandra/Scylla will.

That said, in exchange you get:

  - Interactive linearizable transactions (highest consistency level possible) over arbitrary key ranges (none of this single partition transactions nonsense) with a guarantee that FDB will basically never violate this property. Jepsen refuses to test FDB because FDB's simulation testing is much more rigorous than anything Jepsen does, but AFAIK the most recent releases of Cockroach/Mongo/Cassandra/Fauna/etc have never fully passed a Jepsen test.

  - Automatic range partitioning. So unlike Cassandra you don't have to pick a static partitioning key yourself, FDB just automatically detects and split ranges for you based on storage volume and write throughput.

  - A strongly consistent store you can actually use in anger and that you don't have to baby. We've never managed to actually break a FoundationDB cluster, the worst case scenario we've ever seen is that we slam it with too many writes/reads and it starts to apply backpressure by increasing latency.

  - Write/Read throughput scales ~linearly as nodes are added.

  - Built in backup/restore from blob storage.

  - Strongly consistent disaster recovery. This one is amazing and its hard to go back to other databases that don't support it, but basically you can run 2 FDB clusters in DR mode where one is the leader (accepting writes + reads) and the other is passively replicating from it. You can then fail over from the primary FDB to the secondary *consistently* in such a way that the two clusters coordinate to lock the primary and then unlock the secondary so that your application is guaranteed to not see any inconsistency or data loss with no participation from the application required (except teaching it how to talk to 2 different FDB clusters at once).

Basically its lacking some niceties for building traditional applications on top of it, but it is an amazing primitive for building rock solid distributed systems and will almost never let you down once you've developed some operational experience with it. Its the closest thing to "magic" that I've ever worked with in the distributed storage space.

sigstoat · on March 11, 2022

to the folks asking why you'd use FDB, i'd highlight the main points as:

> Interactive linearizable transactions (highest consistency level possible) over arbitrary key ranges (none of this single partition transactions nonsense) with a guarantee that FDB will basically never violate this property.

the transaction guarantees on FDB are just the best there is in any product you can get for free. not to fault the rest, FDB just went past what ~98% of folks will ever need. but if you're in that last 2%, god speed trying to get anything else to offer you the same semantics.

> A strongly consistent store you can actually use in anger and that you don't have to baby.

i've seen some folks trying to run a cluster without proper monitoring, who were nearly driving the poor thing into the ground, with hosts dying, and having so much pressure put on them they failed out, but it never lost anything. i don't think it ever even became unavailable.

> Automatic range partitioning. So unlike Cassandra you don't have to pick a static partitioning key yourself, FDB just automatically detects and split ranges for you based on storage volume and write throughput

eh, in defense of everything else, FDB is a key value store, and the keys are byte strings, so the situation is a bit easier for them.

richieartoul · on March 12, 2022

Sure, I’m not really knocking Cassandra. I’m just pointing out that Cassandras approach favors raw write throughout over everything else which is great for some use cases, but FDBs approach is much more forgiving and a better fit for many use cases, albeit at the cost of reduced throughout.

eatonphil · on March 11, 2022

FoundationDB is just a key value store, so it's not comparable.

There is a relational layer you can run though called Foundation Record Layer. But this doesn't support SQL AFAICT. The examples on Github just show a query builder interface in Java [0].

There does seem to be a Lucene interface but it doesn't seem to be used [1].

[0] https://github.com/FoundationDB/fdb-record-layer/blob/main/e...

[1] https://forums.foundationdb.org/t/lucene-layer-on-foundation...

tonyhb · on March 11, 2022

It's comparable based off of the fundamentals and I have the same question.

They both use KVs using LSMs, do range queries, support ACID transactions, handle resharding and clustering for you.

They're both based off of Spanner fundamentals. They both actually have an SQL layer - foundationdb's is the record layer. Just because one has a primary SQL interface doesn't mean we can't compare.

I'd really like to know a comparison of the fundamentals including hotspot management, resharding, read/write performance, resharding etc. also and have been looking for this.

eatonphil · on March 11, 2022

> They both actually have an SQL layer - foundationdb's is the record layer.

Can you show me their SQL layer? I was looking for it.

tonyhb · on March 11, 2022

https://github.com/FoundationDB/fdb-record-layer is the SQL/orm like interface. This handles things in a DB like way - joins, indexes. I misspoke when I said SQL regarding this - sorry.

There's a go and rust library floating around that's not as good. I've tried em :)

eatonphil · on March 11, 2022

The readme says

> Queries - The Record Layer does not provide a query language, however it provides query APIs with the ability to scan, filter, and sort across one or more record types, and a query planner capable of automatic selection of indexes.

So I'm not sure how you can call that a SQL layer when SQL is a query language and FDB Record Layer says it doesn't have a query language?

tonyhb · on March 11, 2022

Yeah, I did say "I misspoke when I said SQL regarding this - sorry.". I should have said "SQL-like", or at least relation-like.

There were a couple SQL-ish things I found in other languages and got them mixed up. Sorry about that :)

AlphaSite · on March 11, 2022

I don’t think it ever got open sourced, although I do remember reading about it years back.

richieartoul · on March 11, 2022

FWIW I don't think FoundationDB is based off Spanner at all (it might even predate it? I think FDB was started in 2008 or something). Its transaction handling system is completely different (and unique from what I can tell) and not dependent on well behaved clocks.

StreamBright · on March 11, 2022

There used to be some SQL layer:

https://forums.foundationdb.org/t/sql-layer-in-foundationdb/...

eatonphil · on March 11, 2022

Neat, here [0] is a 3rd party one at the end of the thread but this isn't the one the thread is about. Doesn't seem like that one is public.

[0] https://github.com/rustdesk/opentick

aseipp · on March 11, 2022

It was open source and if you Google a bit you can find forks with the relevant code snapshots, here's one: https://github.com/AydinSakar/sql-layer

ledgerdev · on March 11, 2022

I was just looking into this, and found this really interesting post about foundationdb and sql.

https://www.voltactivedata.com/blog/2015/04/foundationdbs-le...

Also this, excellent comparison of distributed database consistency methods, mentions cockroach, but not foundation. http://dbmsmusings.blogspot.com/2018/09/newsql-database-syst...

This brings up the question, what method/protocol does foundation use for distributed consistency?

richieartoul · on March 11, 2022

"What method/protocol does foundation use for distributed consistency?"

This is a really tough question to answer succinctly. It took me a long time to wrap my head around it, but the TLDR is:

  1. It uses a quorum of Coordinator nodes to elect another node (via Paxos) whose job it is to then assigns roles to other nodes.
  2. Some of those roles are replicated (Like TLogs) and some of those roles are singletons (like the thing that advances the clock and hands out versionstamps).
  3. If any node in the "transaction subsystem" ever fails, the cluster goes through a recovery where it becomes unavailable for ~3s during which it reassigns the roles and keeps chugging.

In terms of what it looks like for a transaction to actually commit, thats another long conversation. Its hard to draw analogies to existing systems because it really is kind of its own thing.

ledgerdev · on March 12, 2022

Awesome answer, thanks! Foundation seems like magic almost, from just choosing very smart trade offs and co-ordination.

richieartoul · on March 12, 2022

Yeah that’s basically it. It really is a wonderful piece of free software that was (and still is) way ahead of its time in many ways.

They basically made a ton of good architectural decisions really early on and then just nailed down the details super tight over many years.

Their technique of running cluster-wide recoveries to handle failures is a great example. Most distributed systems engineers find the idea of designing a database that becomes unavailable (even for a few seconds) when a single node is lost completely appalling, but it turns out this small trade off leads to massive simplifications elsewhere and the resulting system is net/net much more reliable than anything else I’ve ever seen.