If designing a new system is there any reason to choose Kafka over Pulsar at this point?
Apart from Confluent wanting you to use Kafka so they can keep leeching money off you by hijacking de facto ownership of an open source project, of course.
As someone who evaluated Pulsar to replace Kafka, my thoughts...
More moving parts. Brokers, and Bookies, and ZK, plus proxies etc. Plus an additional ZK for inter-cluster replication.
Immaturity - it's still early days for Pulsar, and there's still a lot of bugs being found - and then rapidly fixed, full credit to them, but yeah, not yet as stable. Documentation is often obsoleted, and I found myself having read the code to figure out what was actually going on.
More complex workflows - there's only really one model for a developer consuming or producing against Kafka. With Pulsar, there's multiple different subscription modes, and choosing the wrong one could produce problems.
Also, the need to explicit ack the messages is something you'd have to always watch for to avoid duplicated reads. Also, if using batch receive, when I was looking at Pulsar, you either had to acknowledge the entire batch, or none of it, so a failure during batch processing would lead to the batch being reprocessed, but I think acking within a batch is in development.
No Pulsar IO S3 sink yet.
That said, there's a lot of cool things it's doing, like the built-in schema registry and far easier multitenancy, and offloading older data into S3 etc. transparently to the consumers, so I'm definitely I'm keeping an eye on it.
Lastly, you're taking aim at Confluent, you realise Pulsar is largely controlled by people employed by StreamNative, yeah?
StreamNative doesn’t lock critical functionality behind enterprise agreements while still advertising the software as open source, and doesn’t openly lie about what the system capabilities are.
Not saying that they won’t turn evil at some point, but so far they’re leagues ahead of Confluent in terms of earning developer trust. At a minimum this developer, but also others that I’ve worked with.
Maybe I’m the minority opinion here and that’s fine, but confluent has been far too shady for me to ever consider contracting with them.
We used FOSS Kafka for yonks without hitting any limitations - At one point we were looking at Confluent Replicator, but decided it was just easier to go with Mirror Maker 1 (and you know, no massive licensing fees) - and Mirror Maker 2 largely emulates Replicator in terms of functionality.
I'm aware of a few other things like the MQTT KC Connector, but that was never part of Kafka in the first instance, it's something that Confluent built for paying customers.
And I could argue that the StreamNative "critical functionality" that they lock away behind enterprise agreements is "quick bug fixes for the many bugs we're still finding", if I was feeling mean spirited.
But anyway, it seems your preference for Pulsar is due to Confluent, but they're not the only ones offering managed Kafkas - AWS, IBM, RedHat, etc. etc.
I personally think Kafka has the edge in many ways. It will soon be possible to run a single-process Kafka cluster, which will unlock a lot of applications that previously people used an older systems for, simply because it was easier than standing up a full ZK cluster + Kafka cluster. The broader Kafka ecosystem has features like exactly-once support, KSQL, Kafka Connect, Cluster Linking, and excellent client support that are very valuable.
The Kafka community is huge and the velocity of development is very high. It's easy to forget now, but in the beginning, Kafka didn't even have replication. That's a good reminder that things that seem like permanent advantages of system X over Kafka (for various values of X) may very well prove to be temporary. For example, in this very thread, I see people talking about how various system X'es have the advantage over Kafka because they can run without ZK. Those discussions are almost out of date.
Finally, I work at Confluent and I think the company has always been a positive force in the open source community. I respect the Pulsar people as well, but I think they have a difficult challenge to overcome.
> The broader Kafka ecosystem has features like exactly-once support
No. No it doesn’t. It has at-least-once delivery with client-side deduplication. That’s not new, it’s what TCP does FFS. Why would you lie to people about supporting something long established at best and demonstrably impossible at worst?
> Finally, I work at Confluent....
Oh, that’s why. Never mind then. Continue selling digital snake oil.
This is one of the differences between Kafka (Confluent) and Pulsar.
Confluent make big bold claims "Exactly once delivery" and have aggressive marketing.
Pulsar on the other hand would say we have "effectivley-once". Reading Pulsar docs vs Kafka, Pulsar are very modest about functionality and have no commercial marketing at all.
These days I have noticed Confluent in blog posts do use effectively once but marketing is as aggressive as ever.
Credit where credit is due. Confluent, the marketing and big bold claims is why almost everyone is using Kafka and not Pulsar and may not of even heard of Pulsar. I do find Pulsar architecture more interesting, since Splunk has brought them though it's remained in the background like it always has with no huge push to sell it.
> No. No it doesn’t. It has at-least-once delivery with client-side deduplication. That’s not new, it’s what TCP does FFS.
It's not just deduplication, you can atomically commit a consumer from one topic + produce of records resulting from that. Which is exactly the same exactly-once guarantee that you get from e.g. an SQL database in linearizable mode (a lot of SQL databases will do the same thing internally - optimistically execute transactions and then re-run them in the case of a conflict).
One advantage of Kafka is the ecosystem effect. There are many systems (Flink, Kafka Streams, Pinot, Druid, Presto, etc) that connect to Kafka. I'm not sure about the extent of Pulsar support here, although I'd love to learn more!
The problem I see with kafka is that it was built before cloud architectures were commonly adopted (with distributed systems everywhere). Confluent has put a lot of effort dragging kafka's architecture to the present, but some major features are missing:
- Auto-scaling: Confluent finally introduced "elastic scaling" a few months ago but it only allows you to scale up and must be triggered by the admin (no threshold-based auto-scaling).
- Multi-tenancy: Planning for a multi-tenant kafka cluster is not for the faint of heart. Achieving isolation tends toward liberal usage of topics of which starts to become unmanageable in the low thousands. This isn't crazy when you've got a few hundred microservices and several tenants to keep isolated.
- Decoupled brokers and storage: Any broker scaling or failure can lead to downtime while event storage is redistributed.
Confluent's Cloud service reduces operational overhead but isn't always feasible due to cost, resource limits (like service accounts or schemas for instance), data controls, etc.
There's actually no reason to choose Pulsar anymore. Pulsar has even more layers with Zookeeper + Bookkeeper that requires something like Kubernetes to run well. It was great 5 years ago for heavy users who need better scalability and features than Kafka, however the development has become a mess.
With the removal of zookeeper and tiered storage (separated from compute), Kafka has caught up on scalability while being simpler to deploy. It also has a far bigger ecosystem with more polished features like ksqldb.
Apache Pulsar is hardly an ideal comparison in this context, considering that Pulsar requires ZooKeeper and Apache BookKeeper (which also requires ZooKeeper).
One of the benefits of the Kafka rearchitecture effort is to allow Kafka to "scale down" to run without external dependencies. Using Pulsar would add more dependencies.
They are, but I think they have some ways to go still. BookKeeper is still on ZooKeeper, though they've abstracted the API and have added support for using Etcd instead (not sure if this is production-ready).
Pulsar also requires you to run ZooKeeper and BookKeeper, so TFA has at least one reason you might choose Kafka.
(That said, unlike many I consider depending on ZooKeeper to be a positive sign. "We wrote our own consensus protocol" belongs in roughly the same bucket as "we wrote our own crypto." Using ZooKeeper doesn't automatically mean your distributed system will work but at least you'll have a fighting chance.)
> Pulsar also requires you to run ZooKeeper and BookKeeper, so TFA has at least one reason you might choose Kafka.
BookKeeper is a feature though. Allows to scale the partition beyond the capacity of a storage unit. Effectively unlimited retention for a partition. The problem with Kafka is that the broker is tied to storage.
Yep, it's available from Confluent if you pay for it, but like how Mirror Maker 2 is awfully similar to Confluent Replicator, I believe that Kafka will (eventually) get tiered storage under the Apache licence (I know there's a KIP for it[1]). It's a hard issue to solve, and not sure how much effort in the community is being directed towards it. But bear in mind that it's not just Confluent who have a stake in Kafka - there's a bunch of big corps selling managed/supported Kafka and all of them would probably quite like tiered storage in core Kafka as a feature to help them sell their support/management, so I have some faith in their enlightened self-interest.
The fact that MM2 happened, and Confluent didn't try to stop it, despite it being awfully similar to Replicator, makes me think that Confluent are acting in good faith.
Incidentally, I quite like how Pulsar solved tiered storage, and it's a definite tick in the Pulsar box - it's transparent from a consumer's POV, although there somewhat of a delay in rehydrating the offloaded block, I don't think anyone's expecting near-realtime performance when loading historical data.
Thanks for putting things in perspective, EdwardDiego.
> The fact that MM2 happened, and Confluent didn't try to stop it, despite it being awfully similar to Replicator, makes me think that Confluent are acting in good faith.
Let me share an anecdote related to this example. We (Confluent) were actually the ones who contributed the documentation for MirrorMaker v2 to the Apache Kafka docs (https://kafka.apache.org/documentation/#georeplication). The development lead on MM2 was (an engineer at) Cloudera, yet they never spent the time to provide user-facing documentation to the Kafka project. I don't want to speculate about reasons, yet I noticed that MM2 was documented in the Cloudera docs.
If we didn't care for the Kafka community at Confluent, we would not have spent our own resources and time to fill that gap, given that we have a proprietary product similar to MM2 (i.e., Confluent Replicator).
Shit, wait, there's documentaton for Mirror Maker 2 now? I spent most of my time implementing it by reading hypothetical examples in a KIP, and then diving into the actual code.
Hardly the most straightforward, and it was rather a gaping hole. Thanks for the background on how that hole developed.
I really appreciate Confluent putting that time into documenting something vital, that could compete with your own product, and IMO that does put a nail in the previous commenter's assertions about Confluent's alleged attempts to wall off necessary features of Kafka.
I think you are completely missing the point.
Those kind of system need to be designed in layer.
You can host all 3 layer on each of your VM instance if you want but they should not be mixed in the same process.
One layer BookKeeper provide an abstraction similar to HDFS.
That is it provide file that are horizontally scalable in size and throughput and reliable append only files.
Pulsar is a service built on top of BookKeeper but could run on top of HDFS or something like Amazon S3 ...
And is only responsible for making sure there is only one writer per BookKeeper file even if multiple process try sending request to Pulsar to write to the same partition.
It also try to balance request across all the brokers.
I’ve been following Kafka for 5+ years now. Confluent has done their level best to stack the Kafka PMC with their own employees. Only recently have they added other members. They should not be an Apache project due to how much gaming of ownership they are doing with the project.
Fair enough, not hijacking. Just presenting the software as open source then requiring you to pay substantially for features that it’d be irresponsible to use the software in production without (e.g. geo replication via mirror maker).
Add to that their insistence in claiming “exactly once delivery semantics” from Kafka despite that being provably impossible and I don’t see any reason to trust them as a company or pay for their software.
I’ve been sticking to pulsar for all new projects and have yet to hit a single drawback. It scales better, has less fiddly knobs needing adjusting, has cluster management already built in, and supports traditional pub/suv as well as worker queue semantics. It even has Kafka compatible adapters so it’s relatively easy to migrate existing systems.
Kafka played an important role in the history of distributed system design but it’s time to move to something better built and better managed IMO.
> requiring you to pay substantially for features that it’d be irresponsible to use the software in production without (e.g. geo replication via mirror maker).
MM1 and MM2 are free. You might be getting confused with Confluent Replicator.
> Add to that their insistence in claiming “exactly once delivery semantics” from Kafka despite that being provably impossible
Exactly once delivery is impossible. Exactly once processing is possible. TBH, semantically there's very little difference between those two from an end user perspective.
I think Pulsar is a much better design and further investment in Kafka is a mistake at this point.
Kafka have a lot of downside
1- size for single topic limited to the size of one machine
2- complex stateful client library that need to know which machine is currently the master for each partition.
....
for #1 if you need message order to be maintained you cannot use more than 1 partition.
for #2 99% of the production issue we had where caused by bugs in librdkafka not talking to the right brokers.
Also it mean that the IP for the broker need to all be public IP and the writing throughput and latency is limited by the capacity of the machine that is the master for the partition you are trying to write.
If the machine hosting you partition become overloaded you have to switch the master for that partition to another machine unlike pulsar this is not done automatically and also if you replication factor is 3 your choice are limited to 2 other machine if you don't want to copy the whole partition to a new machine which would take hours.
Yeah, that's true, using librdkafka from C# I hit a few issues where librdkafka was somewhat behind Java in terms of features, I think the one I hit was multi-topic subscriptions.
IIRC Confluent has started putting resources into it - I would hope so, given how .NET Core is going.
That said, the state of Pulsar clients outside of the official Java ones was far worse, I was looking into .NET Core ones and the "official" one (Pulsar-DotPulsar) lacked some key features, whereas a third party one, pulsar-client-dotnet, had far more features, but was still somewhat behind the Java clients.
Caveat is that I looked into all of this when Pulsar was at version 2.6, it's not at 2.7.1, so my comments may well be out of date.
Apart from Confluent wanting you to use Kafka so they can keep leeching money off you by hijacking de facto ownership of an open source project, of course.