| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 66fm472tjy7 1442 days ago

We use RMQ for most of our asynchronous processing. In most cases, we get a HTTP call and publish a message to the RMQ after committing the DB transaction, then we send the response to the HTTP client.

We found out the hard way that RMQ does not behave like a transactional DB. Just because publishing worked does not mean the message will be delivered.

Our solution is to also write the message into an outbox table in the DB. We then publish the message using confirms[0]. RMQ asynchronously sends us a confirmation when it has really persisted the message. We then delete the outbox entry. If we do not receive the confirmation in time, a timer will re-publish the message.

Therefore I disagree with the suggestion of using a library wrapping the native RMQ one. We are using spring-amqp and this made it harder to understand what is going on. In the end, for a large project you will have to understand nuances of RMQ (and other infrastructure you are using). Using a leaky abstraction over it means you now have to understand both the underlying product and the abstraction.

[0] https://www.rabbitmq.com/confirms.html#publisher-confirms

5 comments

djur 1442 days ago

I agree. The pattern I've seen more than once is is:

1) Adopt RabbitMQ without any experts on the team 2) Conceal Rabbit/AMQP functionality as much as possible behind a simplifying abstraction, often in multiple layers, often written by non-experts 3) Run into some intractable reliability or scaling problem 4) Have no idea how to solve it because you still don't have any experts 5) Throw a lot of money at the problem, fail 6) Decide to do a very expensive migration to a different system (SNS+SQS, Kafka, etc.)

At that point, you go back to step 1. If you're lucky, somebody has expertise in the new system and the migration can be pulled off successfully. Otherwise, you either end up repeating the whole process or everything goes off the rails when you're halfway migrated to the new system.

This same process happens for all kinds of stuff, not just RabbitMQ, of course.

pas 1442 days ago

Kafka is at least a bit simpler than RabbitMQ, though both are very square-ish shaped pegs and are usually forced into very round-ish holes. People regularly think they need a message queue, when they really need a job queue, or message bus. Or even all three, but then they try to hack everything on top of Kafka (or RMQ) ... which can be done, of course, but "results may vary".

For tracking state at scale (but still per-job, per-thing) a Cassandra-like system works best (but preferably a better implementation, eg. SkyllaDB or AeroSpike or some other KV store).

CharlieDigital 1442 days ago

> People regularly think they need a message queue, when they really need a job queue, or message bus

This is one of the reasons I am a really, really big fan of Google Cloud's Task Queues. It allows the stupidest, simplest temporal execution of HTTP invocations.

Currently working on a project in AWS and it's stunning how complicated it is to achieve the same simple need of "I want to execute this HTTP call at this time in the future". It's either AmazonMQ -- using either ActiveMQ or RabbitMQ with plugins -- or hacking around SQS's 15 minute delay limit. In our case, we are going to end up wrapping our messages in an envelope with a delivery time and if it hasn't met the delivery time, we put it back into SQS.

GCP is highly underrated for how it simplifies control over execution of code. Pub/Sub and Task Queues both have HTTP delivery built in. Couple that with Google Cloud Run and it is a recipe for building almost any type of execution model with much less complexity and overhead

grncdr 1441 days ago

In case you want to avoid an extra envelope, you can also add custom headers to SQS messages. This can be handy if you want to implement that delay hack without parsing message bodies.

davydog187 1442 days ago

Lol to “ Kafka is at least a bit simpler than RabbitMQ”. I’m sorry, what universe does this statement live in?

Spivak 1441 days ago

True, it’s actually a lot simpler than RabbitMQ. People seem to assume “giant ball of enterprisey Java means” that the experience using it will be complicated. Kafka is extremely reliable, simple to cluster, battle tested (I haven’t hit an actual bug in Kafka in ages), the self-healing is turnkey, and has way stronger guarantees for clients.

Where it bites people is that it’s not a queue and scaling is harder than just add more consumers.

rawgabbit 1441 days ago

I am curious if anyone has used confluent.io (kafka as a service). Or is it too cost prohibitive?

d8tltanc 1441 days ago

Java is a bit simpler than erlang

dariusj18 1441 days ago

I looked it up but couldn't really find information, what would you say is the difference between a "message queue", a "job queue" and a "message bus"?

pas 1441 days ago

I just throw them on the wall based on my experiences, maybe they have some agreed upon precise definitions, but I'm not aware :o

message bus: firehose of events, (by default) no ACK. usually multi producer multi consumer. (see also DBus which is more of an RPC layer + service discovery + pub/sub via event listeners)

message queue: usually between components, ACK, but no selective ACK, backpressure, might even have support for "dead letters" (letters not ACKed by any consumer)

job queue: selective ACK, retry, etc.

(there's also the "enterprise service bus", which is similar, but mostly implemented on things like IBM MQ)

datavirtue 1442 days ago

I ran into the exact same problem as the author and I fixed it by...reading the documentation.

In my case, a whole team of devs was using RMQ without knowing anything about it. Literally caused many sev-1 occurrences over the years until I resolved all of the issues. It took a datacenter migration that allowed me the opportunity to redesign the entire RMQ infrastructure before I was able to put the whole mess to bed.

ako 1441 days ago

Sounds exactly like the story for many adopting noSql: adopt RDBMS with no experts, hide it behind an ORM, run into scaling and performance problems, no idea how to solve it, scale the hardware vertically, move to NoSQL.

hinkley 1442 days ago

I’m currently wrestling with a thing at work where someone wrapped a frameworkish library with their own abstraction. Now I’m trying to add a cross-cutting concern that neither my coworker nor the authors thought about, and so instead of punching through three layers of inadequate data passing I’ve got six to deal with and a stutter as well (builder patterns are great, except when they are not).

Having this new failure mode added to all the other ones I’ve already met over the last few decades has colored my perception a bit, and I’m having opinions about how you shouldn’t try to wrap a wrapper, and maybe the best way to live with a bad API is to pass through the yucky bit as quickly as possible - preprocess to see if you can avoid calling it at all, and then avoid asking it to do anything extra the rest of the time.

That part doesn’t feel that transformative to me but maybe I’m wrong. What’s bigger and stickier for me is that I now have to think about some NIH code we wrote that deeply bothers me, and decide if I still don’t like it, or if the author had the same conclusion and this was their answer.

grogers 1441 days ago

If you have to write all messages to the DB, why use RMQ at all and not just read the messages from the DB?

cmckn 1442 days ago

How quickly does RMQ ack the message? Obviously too long to delay an HTTP response, or you’d have skipped the DB part of this; but this seems kind of clunky. I know Kafka has (optional, tunable) acknowledgements for publication, for example, that you could use for this.

66fm472tjy7 1442 days ago

In the first iteration of using confirms, we did not have the outbox but only logged how long it took to get the confirmation. After 3 seconds, we would throw out the expected confirmation. If a confirmation took longer than that, we would log that we received an unknown confirmation.

We hoped it would be fast enough that we can just wait for the confirmation before committing the transaction.

The official documentation says

> This means that under a constant load, latency for basic.ack can reach a few hundred milliseconds

I never did statistics, just looked at the log. IIRC most were acceptable but > 3s occurred frequently enough (and we even had instances of messages never being confirmed, IIRC) that we abandoned that plan.

We considered using Debezium[0], but decided on the current solution as it could be solved entirely with the current services and infrastructure whereas Debezium would have required us to deploy (writing this from memory so this might be inaccurate/incomplete) Kafka, Zookeeper, and a connector service.

[0] https://debezium.io/

EdwardDiego 1442 days ago

Yep, Debezium is built on Kafka Connect, and yeah, it expects a Kafka cluster to talk to, which will have ZK present for maintaining cluster state.

cmckn 1441 days ago

Kafka has shipped the long-awaited ZooKeeper-free mode, but AFAIK it’s still beta and behind feature flags on the producer, broker, and consumer (like almost all Kafka config :( but that’s another story)

EdwardDiego 1438 days ago

Yeah, it's shipped, but it's missing some existing ZK features that tooling around Kafka relied on, and I'm a bit embarrassed for Confluent that they pushed KRaft so hard without a replacement.

E.g. the ability to watch a ZK node for changes, which means in Kafka sans ZK, you can't detect changes to topics without continuously polling via the admin client.

A coworker is working to implement something like this for KRaft, but it really demonstrates how an IPO can cause a company that was the steward of a FOSS project to do things detrimental to that project to keep the share price up. (Was also interesting how many key Confluent people left right after the IPO)

The other very notable change is how Confluent's dev effort has switched from the open source project to the Enterprise Edition, but they still have the majority of PMC members, while not having the corporate blessing to spend time reviewing PRs.

cmckn 1437 days ago

> you can't detect changes to topics without continuously polling via the admin client.

Yikes, that sounds like an oversight! Aren't topic configs written to a system topic that you could consume from?

EdwardDiego 1442 days ago

Kafka's acks aren't between consumer / producer, or consumer/ cluster, it's solely between producer and cluster.

It's one of Kafka's strengths.

kjnilsson 1440 days ago

It depends on the current throughput of the system, how many queues a message is routed to, size of the message etc. But a mostly idle RabbitMQ cluster with fast disks should confirm a message published to a single quorum queue in a couple of ms.

teekert 1442 days ago

Using MQTT, my Sonoff with Tasmota on it, as soon as it gets a message to switch, it will reply with it's current state. Seems simple enough?