Hacker News new | ask | show | jobs
by 66fm472tjy7 1442 days ago
We use RMQ for most of our asynchronous processing. In most cases, we get a HTTP call and publish a message to the RMQ after committing the DB transaction, then we send the response to the HTTP client.

We found out the hard way that RMQ does not behave like a transactional DB. Just because publishing worked does not mean the message will be delivered.

Our solution is to also write the message into an outbox table in the DB. We then publish the message using confirms[0]. RMQ asynchronously sends us a confirmation when it has really persisted the message. We then delete the outbox entry. If we do not receive the confirmation in time, a timer will re-publish the message.

Therefore I disagree with the suggestion of using a library wrapping the native RMQ one. We are using spring-amqp and this made it harder to understand what is going on. In the end, for a large project you will have to understand nuances of RMQ (and other infrastructure you are using). Using a leaky abstraction over it means you now have to understand both the underlying product and the abstraction.

[0] https://www.rabbitmq.com/confirms.html#publisher-confirms

5 comments

I agree. The pattern I've seen more than once is is:

1) Adopt RabbitMQ without any experts on the team 2) Conceal Rabbit/AMQP functionality as much as possible behind a simplifying abstraction, often in multiple layers, often written by non-experts 3) Run into some intractable reliability or scaling problem 4) Have no idea how to solve it because you still don't have any experts 5) Throw a lot of money at the problem, fail 6) Decide to do a very expensive migration to a different system (SNS+SQS, Kafka, etc.)

At that point, you go back to step 1. If you're lucky, somebody has expertise in the new system and the migration can be pulled off successfully. Otherwise, you either end up repeating the whole process or everything goes off the rails when you're halfway migrated to the new system.

This same process happens for all kinds of stuff, not just RabbitMQ, of course.

Kafka is at least a bit simpler than RabbitMQ, though both are very square-ish shaped pegs and are usually forced into very round-ish holes. People regularly think they need a message queue, when they really need a job queue, or message bus. Or even all three, but then they try to hack everything on top of Kafka (or RMQ) ... which can be done, of course, but "results may vary".

For tracking state at scale (but still per-job, per-thing) a Cassandra-like system works best (but preferably a better implementation, eg. SkyllaDB or AeroSpike or some other KV store).

> People regularly think they need a message queue, when they really need a job queue, or message bus

This is one of the reasons I am a really, really big fan of Google Cloud's Task Queues. It allows the stupidest, simplest temporal execution of HTTP invocations.

Currently working on a project in AWS and it's stunning how complicated it is to achieve the same simple need of "I want to execute this HTTP call at this time in the future". It's either AmazonMQ -- using either ActiveMQ or RabbitMQ with plugins -- or hacking around SQS's 15 minute delay limit. In our case, we are going to end up wrapping our messages in an envelope with a delivery time and if it hasn't met the delivery time, we put it back into SQS.

GCP is highly underrated for how it simplifies control over execution of code. Pub/Sub and Task Queues both have HTTP delivery built in. Couple that with Google Cloud Run and it is a recipe for building almost any type of execution model with much less complexity and overhead

In case you want to avoid an extra envelope, you can also add custom headers to SQS messages. This can be handy if you want to implement that delay hack without parsing message bodies.
Lol to “ Kafka is at least a bit simpler than RabbitMQ”. I’m sorry, what universe does this statement live in?
True, it’s actually a lot simpler than RabbitMQ. People seem to assume “giant ball of enterprisey Java means” that the experience using it will be complicated. Kafka is extremely reliable, simple to cluster, battle tested (I haven’t hit an actual bug in Kafka in ages), the self-healing is turnkey, and has way stronger guarantees for clients.

Where it bites people is that it’s not a queue and scaling is harder than just add more consumers.

I am curious if anyone has used confluent.io (kafka as a service). Or is it too cost prohibitive?
Java is a bit simpler than erlang
I looked it up but couldn't really find information, what would you say is the difference between a "message queue", a "job queue" and a "message bus"?
I just throw them on the wall based on my experiences, maybe they have some agreed upon precise definitions, but I'm not aware :o

message bus: firehose of events, (by default) no ACK. usually multi producer multi consumer. (see also DBus which is more of an RPC layer + service discovery + pub/sub via event listeners)

message queue: usually between components, ACK, but no selective ACK, backpressure, might even have support for "dead letters" (letters not ACKed by any consumer)

job queue: selective ACK, retry, etc.

(there's also the "enterprise service bus", which is similar, but mostly implemented on things like IBM MQ)

I ran into the exact same problem as the author and I fixed it by...reading the documentation.

In my case, a whole team of devs was using RMQ without knowing anything about it. Literally caused many sev-1 occurrences over the years until I resolved all of the issues. It took a datacenter migration that allowed me the opportunity to redesign the entire RMQ infrastructure before I was able to put the whole mess to bed.

Sounds exactly like the story for many adopting noSql: adopt RDBMS with no experts, hide it behind an ORM, run into scaling and performance problems, no idea how to solve it, scale the hardware vertically, move to NoSQL.
I’m currently wrestling with a thing at work where someone wrapped a frameworkish library with their own abstraction. Now I’m trying to add a cross-cutting concern that neither my coworker nor the authors thought about, and so instead of punching through three layers of inadequate data passing I’ve got six to deal with and a stutter as well (builder patterns are great, except when they are not).

Having this new failure mode added to all the other ones I’ve already met over the last few decades has colored my perception a bit, and I’m having opinions about how you shouldn’t try to wrap a wrapper, and maybe the best way to live with a bad API is to pass through the yucky bit as quickly as possible - preprocess to see if you can avoid calling it at all, and then avoid asking it to do anything extra the rest of the time.

That part doesn’t feel that transformative to me but maybe I’m wrong. What’s bigger and stickier for me is that I now have to think about some NIH code we wrote that deeply bothers me, and decide if I still don’t like it, or if the author had the same conclusion and this was their answer.

If you have to write all messages to the DB, why use RMQ at all and not just read the messages from the DB?
How quickly does RMQ ack the message? Obviously too long to delay an HTTP response, or you’d have skipped the DB part of this; but this seems kind of clunky. I know Kafka has (optional, tunable) acknowledgements for publication, for example, that you could use for this.
In the first iteration of using confirms, we did not have the outbox but only logged how long it took to get the confirmation. After 3 seconds, we would throw out the expected confirmation. If a confirmation took longer than that, we would log that we received an unknown confirmation.

We hoped it would be fast enough that we can just wait for the confirmation before committing the transaction.

The official documentation says

> This means that under a constant load, latency for basic.ack can reach a few hundred milliseconds

I never did statistics, just looked at the log. IIRC most were acceptable but > 3s occurred frequently enough (and we even had instances of messages never being confirmed, IIRC) that we abandoned that plan.

We considered using Debezium[0], but decided on the current solution as it could be solved entirely with the current services and infrastructure whereas Debezium would have required us to deploy (writing this from memory so this might be inaccurate/incomplete) Kafka, Zookeeper, and a connector service.

[0] https://debezium.io/

Yep, Debezium is built on Kafka Connect, and yeah, it expects a Kafka cluster to talk to, which will have ZK present for maintaining cluster state.
Kafka has shipped the long-awaited ZooKeeper-free mode, but AFAIK it’s still beta and behind feature flags on the producer, broker, and consumer (like almost all Kafka config :( but that’s another story)
Yeah, it's shipped, but it's missing some existing ZK features that tooling around Kafka relied on, and I'm a bit embarrassed for Confluent that they pushed KRaft so hard without a replacement.

E.g. the ability to watch a ZK node for changes, which means in Kafka sans ZK, you can't detect changes to topics without continuously polling via the admin client.

A coworker is working to implement something like this for KRaft, but it really demonstrates how an IPO can cause a company that was the steward of a FOSS project to do things detrimental to that project to keep the share price up. (Was also interesting how many key Confluent people left right after the IPO)

The other very notable change is how Confluent's dev effort has switched from the open source project to the Enterprise Edition, but they still have the majority of PMC members, while not having the corporate blessing to spend time reviewing PRs.

> you can't detect changes to topics without continuously polling via the admin client.

Yikes, that sounds like an oversight! Aren't topic configs written to a system topic that you could consume from?

Kafka's acks aren't between consumer / producer, or consumer/ cluster, it's solely between producer and cluster.

It's one of Kafka's strengths.

It depends on the current throughput of the system, how many queues a message is routed to, size of the message etc. But a mostly idle RabbitMQ cluster with fast disks should confirm a message published to a single quorum queue in a couple of ms.
Using MQTT, my Sonoff with Tasmota on it, as soon as it gets a message to switch, it will reply with it's current state. Seems simple enough?