Does anyone use https://nats.io here? I have heard good things about it. I would...

nchmy · 2025-08-23T20:17:17 1755980237

I dont have kafka experience, but nats is absolutely amazing. Just a complete pleasure to use, in every way.

https://www.synadia.com/blog/nats-and-kafka-compared

atombender · 2025-08-23T21:54:16 1755986056

NATS is very good. It's important to distinguish between core NATS and Jetstream, however.

Core NATS is an ephemeral message broker. Clients tell the server what subjects they want messages about, producers publish. NATS handles the routing. If nobody is listening, messages go nowhere. It's very nice for situations where lots of clients come and go. It's not reliable; it sheds messages when consumers get slow. No durability, so when a consumer disconnects, it will miss messages sent in its absence. But this means it's very lightweight. Subjects are just wildcard paths, so you can have billions of them, which means RPC is trivial: Send out a message and tell the receiver to post a reply to a randomly generated subject, then listen to that subject for the answer.

NATS organizes brokers into clusters, and clusters can form hub/spoke topologies where messages are routed between clusters by interest, so it's very scalable; if your cluster doesn't scale to the number of consumers, you can add another cluster that consumes the first cluster, and now you have two hubs/spokes. In short, NATS is a great "message router". You can build all sorts of semantics on top of it: RPC, cache invalidation channels, "actor" style processes, traditional pub/sub, logging, the sky is the limit.

Jetstream is a different technology that is built on NATS. With Jetstream, you can create streams, which are ordered sequences of messages. A stream is durable and can have settings like maximum retention by age and size. Streams are replicated, with each stream being a Raft group. Consumers follow from a position. In many ways it's like Kafka and Redpanda, but "on steroids", superficially similar but just a lot richer.

For example, Kafka is very strict about the topic being a sequence of messages that must be consumed exactly sequentially. If the client wants to subscribe to a subset of events, it must either filter client-side, or you have some intermediary that filters and writes to a topic that the consumer then consumes. With NATS, you can ask the server to filter.

Unlike Kafka, you can also nack messages; the server keeps track of what consumers have seen. Nacking means you lose ordering, as the nacked messages come back later. Jetstream also supports a Kafka-like strictly ordered mode. Unlike Kafka, clients can choose the routing behaviour, including worker style routing and deterministic partitioning.

Unlike Kafka's rigid networking model (consumers are assigned partitions and they consume the topic and that's it), as with NATS, you can set up complex topologies where streams get gatewayed and replicated. For example, you can streams in multiple regions, with replication, so that consumers only need to connect to the local region's hub.

While NATS/Jetstream has a lot of flexibility, I feel like they've compromised a bit on performance and scalability. Jetstream clusters don't scale to many servers (they recommend max 3, I think) and large numbers of consumers can make the server run really hot. I would also say that they made a mistake adopting nacking into the consuming model. The big simplification Kafka makes is that topics are strictly sequential, both for producing and consuming. This keeps the server simpler and forces the client to deal with unprocessable messages. Jetstream doesn't allow durable consumers to be strictly ordered; what the SDK calls an "ordered consumer" is just an ephemeral consumer. Furthermore, ephemeral consumers don't really exist. Every consumer will create server-side state. In our testing, we found that having more than a few thousand consumers is a really bad idea. (The newest SDK now offers a "direct fetch" API where you can consume a stream by position without registering a server-side consumer, but I've not yet tried it.)

Lastly, the mechanics of the server replication and connectivity is rather mysterious, and it's hard to understand when something goes wrong. And with all the different concepts — leaf nodes, leaf clusters, replicas, mirrors, clusters, gateways, accounts, domains, and so on — it's not easy to understand the best way to design a topology. The Kafka network model, by comparison, is very simple and straightforward, even if it's a lot less flexible. With Kafka, you can still build hub/spoke topologies yourself by reading from topics and writing to other topics, and while it's something you need to set up yourself, it's less magical, and easier to control and understand.

Where I work, we have used NATS extensively with great success. We also adopted Jetstream for some applications, but we've soured on it a bit, for the above reasons, and now use Redpanda (which is Kafka-compatible) instead. I still think JS is a great fit for certain types of apps, but I would definitely evaluate the requirements carefully first. Jetstream is different enough that it's definitely not just a "better Kafka".

shikhar · 2025-08-23T21:57:41 1755986261

> Jetstream clusters don't scale to many servers (they recommend max 3, I think)

Jetstream is even more limited than most Kafkas on number of streams https://github.com/nats-io/nats-server/discussions/5128#disc...

pdimitar · 2025-08-24T08:38:35 1756024715

That's an amazing analysis, thank you so much!

What are your impressions of Red panda?

We're particularly interested in NATS' feature of working with individual messages and have been bitten by Kafka's "either process the entire batch or put it back for later processing", which doesn't work for our needs.

Interested if Redpanda is doing better than either.

atombender · 2025-08-24T08:44:20 1756025060

Redpanda is fantastic, but it has the exact same message semantics as Kafka. They don't even have their own client; you connect using the Kafka protocol. Very happy with it, but it does have the same "whole batch or nothing" approach.

NATS/Jetstream is amazing if it fits your use case and you don't need extreme scalability. As I said before, it offers a lot more flexibility. You can process a stream sequentially but also nack messages, so you get the best of both worlds. It has deduping (new messages for the same subject will mark older ones as deleted) and lots of other convenience goodies.

pdimitar · 2025-08-24T08:51:11 1756025471

Thank you so much again. Yes, we are not Google scale, our main priority is durability and scalability but only up to a (I'd say fairly modest) point. I.e. be able to have one beefy NATS server do it all and only add a second one when things start getting bad. Even 3 servers we'd see as a strategic defeat + we have data but again, very far from Google scale.

We've looked at Redis streams but me and a few others are skeptical as Redis is not known for good durability practices (talking about the past; I've no idea if they pivoted well in the last years) and sadly none of us has any experience with MQTT -- though we heard tons of praise on that one.

But our story is: some tens of terabytes of data, no more than a few tens of millions of events / messages a day, aggressive folding of data in multiple relational DBs, and a very dynamic and DB-heavy UI (I will soon finish my Elixir<=>Rust SQLite3 wrapper so we're likely going to start sharding the DB-intensive customer data to separate SQLite3 databases and I'm looking forward to spearheading this effort; off-topic). For our needs NATS Jetstream sounds like the exactly perfect fit, though time will tell.

I still have the nagging feeling of missing out on still having not tried MQTT though...

atombender · 2025-08-24T09:07:29 1756026449

At that scale, Jetstream should work well. In my experience, Jetstream's main performance weakness is the per-stream/consumer overhead: Too many and NATS ends up running too hot due to all the state updates and Raft traffic. (Each stream is a Raft group, but so is each consumer.)

If its tens of TB in a stream, then I've not personally stored that much data in a stream, but I don't see why it wouldn't handle it. Note that Jetstream has a maximum message size of 1MB (this is because Jetstream uses NATS for its client/server protocol, which has that limit), which was a real problem for one use case I had. Redpanda has essentially no upper limit.

Note that number of NATS servers isn't the same as the replication factor. You can have 3 servers and a replication factor of 2 if you want, which allows more flexibility. Both consumers and streams have their own replication factors.

The other option I have considered in the past is EMQX, which is a clustered MQTT system written in Erlang. It looks nice, but I've never used it in production, and it's one of those projects that nobody seems to be talking about, at least not in my part of the industry.

pdimitar · 2025-08-24T09:16:35 1756026995

Well I work mainly with Elixir in the last 10-ish years (with a lot of Rust and some Golang here and there) so EMQX would likely be right up my alley.

Do you have any other recommendations? The time is right for us and I'll soon start evaluating. I only have NATS Jestream and MQTT on my radar so far.

Kafka I already used and rejected for the reasons above ("entire batch or nothing / later").

As for data, I meant tens of terabytes of traffic on busy days, sorry. Most of the time it's a few hundred gigs. (Our area is prone to spikes and the business hours matter a lot.) And again, that's total traffic. I don't think we'd have more than 10-30GB stored in our queue system, ever. Our background workers aggressively work through the backlog and chew data into manageable (and much smaller chunks) 24/7.

And as one of the seniors I am extremely vigilant of payload sizes. I had to settle on JSON for now but I push back, hard, on any and all extra data; anything and everything that can be loaded from DB or even caches is delegated as such with various IDs -- this also helps us with e.g. background jobs that are no longer relevant as certain entity's state moved too far forward due to user interaction and the enriching job no longer needs to run; when you have only references in your message payload, this enables and even forces the background job to load data exactly at the time of its run and not assume a potentially outdated state.

Anyhow, I got chatty. :)

Thank you. If you have other recommendations, I am willing to sacrifice a little weekend time to give them a cursory research. Again, utmost priority is 100% durability (as much as that is even possible of course) and mega ultra speed is not of the essence. We'll never have even 100 consumers per stream; I haven't ever seen more than 30 in our OTel tool dashboard.

EDIT: I should also say that our app does not have huge internal traffic; it's a lot (Python wrappers around AI / OCR / others is one group of examples) but not huge. As such, our priorities for a message queue are just "be super reliable and be able to handle an okay beating and never lose stuff" really. It's not like in e.g. finance where you might have dozens of Kafka clusters and workers that hand off data from one Kafka queue to another with a ton of processing in the meantime. We are very far from that.

atombender · 2025-08-24T09:45:17 1756028717

Those are the two I can think of.

Jetstream is written in Go and the Go SDK is very mature, and has all the support one needs to create streams and consumers; never used it from Elixir, though. EMQX's Go support looks less good (though since it's MQTT you can use any MQTT client).

Regarding data reliability, I've never lost production data with Jetstream. But I've had some odd behaviour locally where everything has just disappeared suddenly. I would be seriously anxious if I had TBs of stream data I couldn't afford to lose, and no way to regenerate it easily. It's possible to set up a consumer that backs up everything to (say) cloud storage, just in case. You can use Benthos to set up such a pipeline. I think I'd be less anxious with Kafka or Redpanda because of their reputation in being very solid.

Going back to the "whole batch or nothing", I do see this as a good thing myself. It means you are always processing in exact order. If you have to reject something, the "right" approach is an explicit dead-letter topic — you can still consume that one from the same consumer. But it makes the handling very explicit. With Jetstream, you do have an ordered stream, but the broker also tracks acid/nacks, which adds complexity. You get nacks even if you never do it manually; all messages have a configurable ack deadline, and if your consumer is too slow, the message will be automatically bounced. (The ack delay also means if a client crashes, the message will sit in the broker for up to the ack delay before it gets delivered to another consumer.)

But of course, this is super convenient, too. You can write simpler clients, and the complicated stuff is handled by the broker. But having written a lot of these pipelines, my philosophy these days is that — at least for "this must not be allowed to fail" processing, I prefer something that is explicit and simpler and less magical, even if it's a bit less convenient to write code for it. Just my 2 cents!

This is getting a bit long. Please do reach out (my email is in my profile) if you want to chat more!

jnmatsynadia · 2025-08-31T17:51:50 1756662710

Thanks for appreciating NATS.io and your comment!

> Jetstream clusters don't scale to many servers (they recommend max 3, I think)

You can have clusters with many servers in them, 3 is actually the minimum required if you want fault-tolerance, that's how you scale JetStream horizontally: you let the stream (which can be replicated 1, 3 or 5 times) spread themselves over the servers in the cluster.

JetStream consumers create state on the server (or servers if they are replicated) and are either durable with a well known name or 'auto-cleanup after idle time' (ephemeral), and indeed allow you to ack/nack each message individually rather than just having an offset.

However, that is with the exception of 'ordered consumers', which are really the closest equivalent to Kafka consumers in that the state is kept in the client library rather than on the servers. They deliver the message in order to the client code, they take care of re-deliveries and recovering from things like getting disconnected from a server, and only ever deliver messages in order, no need to explicitly ack the messages, and if you want to persist your offset (which is the sequence number of the last message the ordered consumer delivered) just like the consumer group clients in Kafka persist their offset in a stream, you would persist your offset in a NATS KV bucket.

And indeed you can now even go further and use batched direct gets to get very good read speed from the stream and no extra server state in the server besides an entry in the offset KV, performance of the batched direct gets is very high and can match the ordered consumer's speed. Besides no incurring no server state, another advantage of stateless consuming is that all the servers replicating the stream will be used to process direct get requests not just the currently elected leader (don't forget to enable direct gets for the stream, it's not on by default). So you can scale the read throughput horizontally by increasing the number of replicas.

The mechanics of replication: streams and stateful consumers can be replicated using 1, 3 or 5 servers. Servers connect directly together to form a cluster and jetstream assets (streams/consumers) are spread out over the servers in the cluster. Clusters can be connected together for form super-clusters. Super-cluster means that access to JetStream assets is transparent: streams/consumers located in one cluster can be accessed from any other cluster. You can have streams that mirror or source from other streams, those mirrors could be located in other clusters to offer faster local access. You can easily move on the fly JS assets from one cluster to another. Leaf nodes are independent servers (which can be clustered) that connect to a cluster like a client would. Being independent means they have their own security for their own clients to connect to them, they can have their own JS domain and you can source to/from streams between the leaf node's domain and the hub (super-cluster). Leaf nodes can be daisy chained.

atombender · 2025-09-01T21:22:48 1756761768

> You can have clusters with many servers in them

Sorry, what I meant that each stream (which forms a Raft group) doesn't scale to more. I thought it was 3, but thanks for the correction.

Everything else you wrote confirms what I wrote, no? As for batch direct gets, that's great, but I'm not sure why you didn't go all the way and offered a Kafka-type consumer API that is strictly ordered and persists the offset natively. I've indeed written an application that uses ordered consumers and persists the offset, but it is cumbersome.

Every time I've used Jetstream, what I've actually wanted was the Kafka model: Fetch a batch, process the batch, commit the batch. Having to ack individual messages and worry about AckWait timeouts is contrary to that model. It's a great programming model for core NATS, but for streams I think you guys made a design mistake there. A stream shouldn't act like pub/sub. I also suspect (but can't prove) that this leads to worse performance and higher load on the cluster, because every message has to go through the ack/nack roundtrips.

I'd also like to point out that Jetstream's maximum message size of 1MB is a showstopper. Yes, you can write big messages somewhere else and reference them. But that's more work and more complexity. With Kafka/Redpanda, huge messages just work, and are not particularly a technical liability.

jnmatsynadia · 2025-09-03T10:24:15 1756895055

> Sorry, what I meant that each stream (which forms a Raft group) doesn't scale to more. I thought it was 3, but thanks for the correction.

Streams can have more than 3 replicas. Technically they can have any number of replicas but you only get extra HA when it's an odd number (e.g. 6 replicas doesn't offer more HA than 5, but 7 does). Typically the way people scale to more than one stream when a single stream becomes a bottleneck is by using subject transformations to insert a partition number in the subject and then creating a stream per partition.

Point taken about wanting to have the 'ordered consumer + persist the offset in a KV' built-in, though it should really not be cumbersome to write. Maybe that could be added to orbit.go (and we definitely welcome well written contributions BTW :)).

> Having to ack individual messages and worry about AckWait timeouts is contrary to that model

Acking/nacking individual messages is the price to pay for being able to have proper queuing functionality on top of streams (without forcing people to have create partitions), including automated re-delivery of messages and one-to-many message consumption flow control.

However it is not mandatory: you can set any ack policy you want on a consumer: ackAll is somewhat like committing an offset in Kafka (it acks the sequence number and all prior sequence numbers), or you can simply use ackNone meaning you forgo completely message acknowledgements (but it will still remember the last sequence number delivered (i.e. the offset) automatically).

For example using a pull consumer with ack policy=none and doing 'fetch' to get batches of messages is exactly what you describe what you want to do (and functionally not different from using an ordered consumer and persisting the offset).

And yes, having acks turned on or off on a consumer does have a significant performance impact: nothing comes for free and explicit individual message acking is a very high quality of service.

As for the max message size you can easily increase that in a server setting. Technically you can set it up all the way to 32 MB if you want to use JetStream and up to 64MB if you want to just use Core NATS. However many would advise you to not increase it over 8 or 16 MB because the large the message are the more the potential for things like latency spikes (think 'head of the line blocking') increased memory management, increased slow consumers, etc...

dijit · 2025-08-23T20:19:33 1755980373

I got really pissed off with their field CTO for essentially trying to pull the wool over my eyes regarding performance and reliability.

Essentially their base product (NATs) has a lot of performance but trades it off for reliability. So they add Jetstream to NATs to get reliability, but use the performance numbers of pure NATs.

I got burned by MongoDB for doing this to me, I won’t work with any technology that is marketed in such a disingenuous way again.

AtlasBarfed · 2025-08-23T21:00:46 1755982846

Don't implement any distributive technology until aphyr has put it through the paces, and even then... Pilot

munksbeer · 2025-08-24T10:17:13 1756030633

https://aphyr.com/about

"Unavailable Due to the UK Online Safety Act"

:(

nchmy · 2025-08-23T21:08:34 1755983314

You mean Jetstream?

Can you point to where they are using core NATS numbers to describe Jetstream?

dijit · 2025-08-23T21:15:29 1755983729

Yes, I meant Jetstream (I even typed it but second guessed myself, my mistake) I’m typing these when I get a moment as I’m at a wedding- so I apologise.

The issue in the docs was that there are no available Jetstream numbers, so I talked over a video call to the field CTO, who cited the base NATs numbers to me, and when I pressed him on if it was with Jetstream he said that it was without: so I asked for them with Jetstream enabled and he cited the same numbers back to me. Even when I pressed him again that “you just said those numbers are without Jetstream” he said that it was not an issue.

So, I got a bit miffed after the call ended, we spent about 45 minutes on the call and this was the main reason to have the call in the first place so I am a bit bent about it. Maybe its better now, this was a year ago.

jeremyjh · 2025-08-23T21:41:30 1755985290

This doesn’t really support your position as far as most readers are concerned - it sounds like a disconnect. If they didn’t do this in any ad copy or public docs it’s not really in Mongo territory.

dijit · 2025-08-23T21:50:08 1755985808

I don’t really care.

I’m telling you why I am skeptical of any tech that intentionally obfuscates trade-offs, I’m not making a comparison on which of these is worse; and I don’t really care if people take my anecdote seriously either: because they should make their own conclusions.

However it might help people go in to a topic about performance and reliability from a more informed position.

nchmy · 2025-08-24T11:58:58 1756036738

I don't doubt your experience. But I think it might have been more just that guy, than NATS in general.

The other day i was listening to a podcast with their ceo from maybe 6 months ago, and he talked quite openly about how jetstream and consumers add considerable drag compared to normal pubsub. And, more generally, how users unexpectedly use and abuse nats, and how they've been able to improve things as a result.

zaphirplane · 2025-08-23T21:56:37 1755986197

It’s deceptive if true, why are you trying to spin it as it’s ok cause the deception were not published

jeremyjh · 2025-08-24T06:43:42 1756017822

Its not ok if there was deception, but it sounds just as likely its a communication disconnect in their call. We only have one side of it.

jnmatsynadia · 2025-08-31T17:01:20 1756659680

As the person in question I feel compelled to answer to this: first of all my apologies if I managed to piss you off, certainly didn't mean to!

It looks like you got frustrated by my refusing to give figures of performance for JetStream: I always say in meetings that because there are too many factors that affect greatly JetStream performance (especially compared to Core NATS which mostly just depends on the network I/O) I can not just give any number as that would likely not accurately reflect (better or worse!) the number that you would actually see in your own usage. And that rather you should use the built-in `nats bench` tool to measure the performance for yourself for your kind of traffic requirements and usage patterns, in your target deployment environment and HA requirements.

On top of that, the performance of the software itself is still evolving as we release new versions that improve things and introduce new features (e.g. JetStream publication batches, batched direct gets) that greatly improve some performance numbers.

I assure you that I just don't want to give anyone some number and then you try it for yourself and you can't match those numbers, nothing more! We literally want you to measure the performance for youself rather than to give you some large number. And that's also why the docs don't have any JetStream performance numbers. There is no attempt at any kind of disingenuity, marketing, or pulling wool over anyone's eyes.

And I would never ever claim that JetStream yields the same performance numbers as Core NATS, that's impossible! JetStream does a lot more and involves a lot more I/O than Core NATS.

However, if I get pressed for numbers in a meeting: I do know the orders of magnitude that NATS and JS operate at, and I will even be willing to say with some confidence that Core NATS performance numbers are pretty much always going to be up to the 'millions of messages per second'. But I will remain very resistant to making any claim any specific JS performance numbers because in the end the answer are 'it depends' and 'how long is a piece of string' and you can scale JetStream throughput horizontally using more streams just like you can scale Kafka's throughput by using more partitions.

Now in some meetings some people don't like that non-answer and really want to hear some kind of performance number so I normally turn the question and ask them what their target message rates and sizes are going to be. If their answer is in the 'few thousands of messages per second' (like it is in your case if I'm not mistaken about the call in question) then, as I do know that JetStream typically comfortably provides performance well in excess of that, I can say with confidence that _at those kinds message rates_ it doesn't matter whether you use Core NATS or JetStream: JetStream is plenty fast enough. That's all I mean!

jnmatsynadia · 2025-08-31T21:55:52 1756677352

And I would add, as soon as you are using more than one stream (e.g. do sharding using Core NATS subject transformation) because JetStream throughput scales horizontally, just like Kafka throughput scale horizontally as you add more partitions and more servers in the cluster I feel reasonably confident to say that _in most cases_ it doesn't really matter what the target number of messages per second is, as you can create a cluster large enough to provide that aggregated throughput. In properly distributed systems, the answer to the benchmark number question truly is 'how long is a piece of string'.

sea-gold · 2025-08-23T20:22:40 1755980560

There is a good comparison between NATS, Kakfa, and others here: https://docs.nats.io/nats-concepts/overview/compare-nats

rockwotj · 2025-08-24T02:30:58 1756002658

Maybe needs a neutral party comparison :)

The delivery guarantees section alone doesn’t make me trust it. You can do at least once or at most once with kafka. Exactly once is mostly a lie, it depends on the downstream system: unless going back to the same system, the best you can do is at least once with idempotancy

chatmasta · 2025-08-24T03:32:00 1756006320

It’s on the NATS website and “nats” appears in the URL three times, so maybe this isn’t the most objective source.