Hacker News new | ask | show | jobs
by annnoo 1808 days ago
This is really interesting for small scale projects.

If you are a solo Dev and really want/ need data streams than you can only chose between Kafka and Redis Streams at the moment. While there are a lot of client libraries for Kafka for nearly every programming language, you really don't wanna manage your own Kafka instance and, to be fair, it really feels like overkill.

So RabbitMQ seems to fit into the same spot as the Redis Streams do: Easy manageable and usable streams, even for smaller projects. Really hope that this will be usable in many different client libraries soon.

6 comments

Redpanda [1] is fantastic if you want a solid Kafka clone. It's written in C++, with no external dependencies; it has Raft consensus built in. The API is Kafka-compatible.

There are some missing features compared to Kafka (no dynamic rebalancing of partitions being the top one), but they're also rolling out some new features that Kafka does not have, such as transformation pipelines [2].

NATS's brand new JetStream [3] feature is also looking very promising. It uses Raft internally. NATS itself is rock solid, and JetStream builds on that foundation.

NATS also has very interesting support replication topologies, meaning you can build graphs of streams that each feed in other streams, so you can do the publishing and consumption in different locations, with different availability constraints.

[1] https://vectorized.io/redpanda

[2] https://vectorized.io/blog/wasm-architecture/

[3] https://docs.nats.io/jetstream/jetstream

Do you have actual production experience with RedPanda? If so I would love to hear about it- we found out that there is something of a wall of 10k partitions per broker before things start failing/under replicating without warning or any issues outside of the URPs. This appears to be a limitation of zookeeper and fetching metadata. We are fighting this by raising timeouts and such, but this blindsided us, and really the solution is to get people to stop creating topics with dozens of partitions when they aren't needed.

I took a look at redpanda this week, it sounded nice on paper, but them being a young company, I am concerned about what "gotchas" we are going to run into.

I've only run it as part of stress tests when evaluating it for a new application, and I found it to be a pleasure to work with.

We did push it to more than 10k partitions, but I honestly don't remember how that affected it; that's when I discovered that partitions cannot be dynamically rebalanced, which meant we'd have to change the way we would use it.

Interestingly, we also did a similar test with NATS JetStream, which did start struggling around 10,000 consumers. (A consumer in NATS is similar to a partition, as it has its own Raft group.) What I tried to do with JetStream goes against the grain a bit, mind you; I still think it's an excellent piece of software.

We have some great work coming that allows lighter weight consumers to scale to that level and beyond. Happy to chat with folks on how we can make that work today.
You might want to check out NATS[0], it starts out with RabbitMQ like functionality (message queue) and can also do Kafka things (distributed persistent log) via NATS Jetstream[1] or Liftbridge[2].

It's riding the wave of CNCF cloud-buzzword projects but don't let that scare you off -- generally that means that it is actually really easy to set up and operate, and does most of the things you'd expect via well structured pre-inserted configuration which is a plus. The devil is still in the details though, so read the docs thoroughly to make sure it fits your usecase.

[0]: https://nats.io/

[1]: https://docs.nats.io/jetstream/jetstream

[2]: https://liftbridge.io/

NATS has different characteristics. Afair, it disconnects not consuming / Not fast enough clients.

For regular messaging

So there is base NATS and abstractions built upon NATS. Which are you referring to?

There's documentation on this bug/feature[0] and it looks like using NATS streaming is a way to fix this. I will admit that it's a bit annoying to figure out the difference between NATS streaming, jetstream and Liftbridge but I don't think this issue affects all 3. Jetstream is essentially NATS streaming built into the NATS binary itself, so switching to it should produce the feature set you're looking for.

[0]: https://docs.nats.io/nats-server/nats_admin/slow_consumers

NATS JetStream is very different, implementation-wise, from NATS Streaming. The latter is deprecated and no longer under development.

JetStream is impressive and looks very promising. I did some stress testing recently and found some performance issues and possible bugs, but I wouldn't hesitate to put it into production.

Completely agree with that. I was referring base NATS.

RabbitMQ plays in the field of traditional general purpose messaging system. NATS was specializing on performance. Kafka focused on streams. That is a focus and includes trade offs. Which is all fine. That is all I want to address.

>you really don't wanna manage your own Kafka instance

So much of my life would be better right know if you didn't just describe me.

Care to provide some details on this?
Lets say you have a small team and dev, test, prod environments. You want high availability. You need kafka and zookeeper servers, 2-3 of each so let's say 5 total. Dev can be single servers, but you should probably have an environment that closely mirrors prod, so for your Kafka stack alone you have to manage 12 servers, 5 for prod/test, 2 for dev.

Then you probably have Kafka connect running somewhere. That's another handful of servers. Maybe Kafka streams is a few more servers. Then you're going to have servers that collect events and publish them to the server. How many more servers is that per environment?

Congratulations, you now have a state of the art event streaming enterprise grade platform, and 20 servers to manage. Better hope your company gets on board with the real time model, otherwise you're now the owner of these pet servers until the end of time.

What's that, you ran into a rare kafka bug that caused your offsets to be lost after a reboot and now the bus is pushing millions of messages down the pipe to all the consumers? Wow, that sure sucks, hope you can juggle your day job with this massive production issue.

Probably because....zookeeper
I find zookeeper to be the easy part. Managing Kafka’s byzantine auth systems, and applying upgrades to clients and servers that change a million things every release, and dealing with the shitstorm of gigabytes of text logs, and dealing with tuning mistakes (don’t you dare leave the file descriptor limit at the OS default!) just sucks up so much attention.
I thought Kafka stopped using zookeeper.
They're working toward that goal.
2.8.0 is out and it supports zk-less. But I'm not too sure the alternative is much better.
I think there are others?

I use google pubsub as a stream - it's just me developing our tools. I find it very easy to use and just works.

But maybe that doesn't count as a 'stream' under that definition?

I use GAE to accept and pass to pubsub super fast 1000s/second bursty webhook data, pass to pubsub, which is triggers cloud function to write to DBs. Cloud function retries if there is a write error or timeout or something.

It worked so well I've now just used this as a kind of micro-service for all DB writes I have to do. Now also parsing out other 'processing' services that don't need to respond with data to the request, like for instance an example 'service' we verify and format cell phones with twilio and then update that user profile.

For me one of the nicest things is that you can have both "traditional" messaging _and_ streaming in the same system. Feeding messages published to exchanges into both queues (for processing) and streams for archiving, auditing and analysis.

Best of both worlds :)

can you give some examples why i'd want a data stream for a 'small scale project' ?
Integration and webhooks are very well suited for streams. Having your core product emitting messages as event have many benefits:

- You can have a team working on the core product sending messages and another working on integrations triggering actions.

- If your integrations fails, your core product is not impacted. You can also replay old messages once you've fixed the stream consumer.

- Having streams allows you to do all kind of experiments too, you can connect a new project to a stream and go through a week of data almost instantly and see the result... rince and repeat as much as necessary

Event-based architectures, for example: https://martinfowler.com/eaaDev/EventSourcing.html For some types of applications moving to a model with a stream of events as the source of truth can solve a ton of hairy distributed computing problems.
How is that not possible in the exchanges + queues model that Rabbit has supported forever?
I was recently thinking about this: let’s say you built a chrome extension and wants to collect some basic usage analytics (with explicit user consent and knowledge, while preserving privacy) - you could batch-send activities at intervals to a REST-style API, but would be nicer to handle as a stream (eg to respond in real-time somehow).
Something lightweight like MQTT is also well suited for this. It was originally designed for telemetry messages in IoT situations but it also supports websockets so it can be used in web applications or browser extensions.
> eg to respond in real-time somehow Make the api call in real-time. No batching. Pipe the api request into your streaming service.