I used to think this too, until I came across http://www.scylladb.com/ It is a f...

uluyol · on Feb 1, 2017

I don't believe their claims. Many benchmarks (including those done by ScyllaDB) are done badly. They'll take a database built to operate on larger than memory data (e.g. 10x) and run on a dataset that can fit entirely in memory. So whoever optimized for in memory wins. But run on an appropriately sized dataset or reduce system memory and you see little difference.

This might seem like a good thing (ScyllaDB gives you extra performance when you have the memory for it), but it does mean that if your dataset grows, performance falls off a cliff. Something to keep in mind.

fnord123 · on Feb 2, 2017

"it does mean that if your dataset grows, performance falls off a cliff."

Are you saying you know ScyllaDB does not handle larger datasets and Cassandra is better in this respect? Or are you saying that their benchmarks are not yet conclusive?

uluyol · on Feb 2, 2017

I am saying that when you go from fully in memory (due to having a small dataset) to having to move things to and from disk, disk increasingly becomes your bottleneck rather than memory. And disk is much slower than memory.

fnord123 · on Feb 2, 2017

I thought a main point of Cassandra was to be distributed so the working dataset could stay in memory across the cluster. And the smaller memory footprint you typically get when you're not in the JVM means more of your working dataset can be cached in memory. So I would expect superlinear speedups compared to Java for exactly the reason you describe (depending on the request distribution).

But yeah, I'm always up for pouring over more benchmarks. :)

Here are more details on benchmarks here:

https://qconsf.com/system/files/presentation-slides/avikivit...

The YCSB benchmark suite they use is the same one as used in this paper from the Cassandra homepage:

http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf

pron · on Feb 1, 2017

The choice of C++ is responsible for only a very small part of the performance difference. ScyllaDB uses different low-level algorithms, many of which could have been done in Java as well. That the Cassandra data model works well with the sequential processing approach of Seastar makes the effort of implementing in C++ manageable. In general, concurrent data structures in C++ require significantly more effort than in Java, and rarely yield performance improvements that are worth it, unless you're memory-bound. In sequential code it's easier to surpass Java's performance, but even that difference is diminishing (and expected to be dramatically reduced when value types are added to the JVM). Usually, the only significant overhead you must be prepared to pay is in RAM, and in exchange you get better performance-per-effort.

bpodgursky · on Feb 1, 2017

If something sounds too-good-to-be-true, it probably is.

I don't love Cassandra, but it's not because it's written on the JVM (full disclosure, we roll our own JVM key-value datastore https://github.com/liveramp/hank).

If your random-access keystore is limited by anything except network and disk latency, you have bigger problems.

echlebek · on Feb 1, 2017

If you investigate what they actually do to achieve those numbers, it's much less simple than just rewriting Cassandra in C++. For example, they use their own TCP stack and make use of vector intrinsics.

sandGorgon · on Feb 1, 2017

Hey.. "vector intrinsics" looks very cool. Thanks for mentioning that!

so what you mean is that, even after throwing facebook scale resources at java.. it is possible for a <10 people team to get 10X performance over java using the features that you mention.

That's a huge loss of face for java IMHO

echlebek · on Feb 1, 2017

> so what you mean is that, even after throwing facebook scale resources at java.. it is possible for a <10 people team to get 10X performance over java using the features that you mention.

That is not what I mean. Writing a userspace TCP stack isn't a feature of C++.

pron · on Feb 1, 2017

Explicit use of vector intrinsics are not the source of any 10x performance boost, nor anything to do with C++ in particular. Again, only a small (but probably not minuscule) portion of the difference has to do with Java vs. C++. The bulk of the difference is due to all sorts of optimizations, most of them could have been done in Java as well. But the ScyllaDB people are more experienced in C++ than in Java, and as they use sequential code anyway, there isn't a big downside for using C++ -- certainly not for them -- so it was the better choice. From what little I know, the reasons why such optimizations weren't done in Cassandra are because 1. the people working on it aren't low-level optimization experts, but more importantly, 2. because the performance was good enough.

zigzigzag · on Feb 1, 2017

You aren't comparing the same program written in two languages. Seastar stuff is written by C++ performance experts who are fanatical about tuning, and does all kinds of unusual far-out things that Cassandra doesn't do to get high performance.

throwawayish · on Feb 1, 2017

> That's a huge loss of face for java IMHO

Because the JVM JIT, like every other compiler, sucks at developing vectorized algorithms ad-hoc; a task usually carried out by human experts in that?

cnlwsu · on Feb 1, 2017

They also have never caught up feature wise, and actually have gotten further behind since initial release. Also the benchmarks are lies (tbf, all benchmarks are lies).

jackmott · on Feb 1, 2017

characterizing what the performance difference between C++ and Java is or will normally be is really hard.

Naive translations from Java to C++ will normally result in only a small % difference.

But clever rewrites where control of memory locality is leveraged, and SIMD intrinsics are leveraged (either via pragmas to induce it automatically, or by hand), good understanding of compiler settings for given architectures, etc, the differences can get quite large, depending on the problem domain.

Then again, there are ways around some of the performance limitations in the JVM, but it often involves writing very painful coding styles. But you could narrow the gap a bit with that effort. (but if you are going to add effort, maybe just do it is in C++?)