It is a fork of cassandra written in the Seastar c++ framework and is drop-in compatible with cassandra. Claims 10x increase in performance.
I always thought there was a few percentage points difference - never a 10x performance difference between java and c++. And that too for a project with as many man hours and facebook-scale tuning as cassandra.
I don't believe their claims. Many benchmarks (including those done by ScyllaDB) are done badly. They'll take a database built to operate on larger than memory data (e.g. 10x) and run on a dataset that can fit entirely in memory. So whoever optimized for in memory wins. But run on an appropriately sized dataset or reduce system memory and you see little difference.
This might seem like a good thing (ScyllaDB gives you extra performance when you have the memory for it), but it does mean that if your dataset grows, performance falls off a cliff. Something to keep in mind.
"it does mean that if your dataset grows, performance falls off a cliff."
Are you saying you know ScyllaDB does not handle larger datasets and Cassandra is better in this respect? Or are you saying that their benchmarks are not yet conclusive?
I am saying that when you go from fully in memory (due to having a small dataset) to having to move things to and from disk, disk increasingly becomes your bottleneck rather than memory. And disk is much slower than memory.
I thought a main point of Cassandra was to be distributed so the working dataset could stay in memory across the cluster. And the smaller memory footprint you typically get when you're not in the JVM means more of your working dataset can be cached in memory. So I would expect superlinear speedups compared to Java for exactly the reason you describe (depending on the request distribution).
But yeah, I'm always up for pouring over more benchmarks. :)
The choice of C++ is responsible for only a very small part of the performance difference. ScyllaDB uses different low-level algorithms, many of which could have been done in Java as well. That the Cassandra data model works well with the sequential processing approach of Seastar makes the effort of implementing in C++ manageable. In general, concurrent data structures in C++ require significantly more effort than in Java, and rarely yield performance improvements that are worth it, unless you're memory-bound. In sequential code it's easier to surpass Java's performance, but even that difference is diminishing (and expected to be dramatically reduced when value types are added to the JVM). Usually, the only significant overhead you must be prepared to pay is in RAM, and in exchange you get better performance-per-effort.
If something sounds too-good-to-be-true, it probably is.
I don't love Cassandra, but it's not because it's written on the JVM (full disclosure, we roll our own JVM key-value datastore https://github.com/liveramp/hank).
If your random-access keystore is limited by anything except network and disk latency, you have bigger problems.
If you investigate what they actually do to achieve those numbers, it's much less simple than just rewriting Cassandra in C++. For example, they use their own TCP stack and make use of vector intrinsics.
Hey.. "vector intrinsics" looks very cool. Thanks for mentioning that!
so what you mean is that, even after throwing facebook scale resources at java.. it is possible for a <10 people team to get 10X performance over java using the features that you mention.
> so what you mean is that, even after throwing facebook scale resources at java.. it is possible for a <10 people team to get 10X performance over java using the features that you mention.
That is not what I mean. Writing a userspace TCP stack isn't a feature of C++.
Explicit use of vector intrinsics are not the source of any 10x performance boost, nor anything to do with C++ in particular. Again, only a small (but probably not minuscule) portion of the difference has to do with Java vs. C++. The bulk of the difference is due to all sorts of optimizations, most of them could have been done in Java as well. But the ScyllaDB people are more experienced in C++ than in Java, and as they use sequential code anyway, there isn't a big downside for using C++ -- certainly not for them -- so it was the better choice. From what little I know, the reasons why such optimizations weren't done in Cassandra are because 1. the people working on it aren't low-level optimization experts, but more importantly, 2. because the performance was good enough.
You aren't comparing the same program written in two languages. Seastar stuff is written by C++ performance experts who are fanatical about tuning, and does all kinds of unusual far-out things that Cassandra doesn't do to get high performance.
They also have never caught up feature wise, and actually have gotten further behind since initial release. Also the benchmarks are lies (tbf, all benchmarks are lies).
characterizing what the performance difference between C++ and Java is or will normally be is really hard.
Naive translations from Java to C++ will normally result in only a small % difference.
But clever rewrites where control of memory locality is leveraged, and SIMD intrinsics are leveraged (either via pragmas to induce it automatically, or by hand), good understanding of compiler settings for given architectures, etc, the differences can get quite large, depending on the problem domain.
Then again, there are ways around some of the performance limitations in the JVM, but it often involves writing very painful coding styles. But you could narrow the gap a bit with that effort. (but if you are going to add effort, maybe just do it is in C++?)
It is a fork of cassandra written in the Seastar c++ framework and is drop-in compatible with cassandra. Claims 10x increase in performance.
I always thought there was a few percentage points difference - never a 10x performance difference between java and c++. And that too for a project with as many man hours and facebook-scale tuning as cassandra.