Hacker News new | ask | show | jobs
by vasco 2226 days ago
Celery can be backed by RabbitMQ, not sure if that's what you meant, but all of what you described can be abstracted away. I didn't have the same experiences with months taken to get up to speed. Moreover, at work RabbitMQ is probably our most stable underlying tool, perhaps toe to toe with Redis. And that's saying a lot, since I consider Redis to almost be a piece of art in how great of a tool it is.

Back to RabbitMQ though, we run a HA 2 node deployment (just one active writer) and have been for over 3 years, requiring minimal changes or any kind of maintenance whatsoever, has scaled to hundred plus queues, going from some with super high numbers of messages per second, some with only tens of messages per day. Some queues stay low and process fast, others are heavy jobs that get enqueued all at once and generate hundreds of thousands of jobs.

Sure, if you have a service that interacts with disks you should have automated a monitor that cover your IOPS consumption, but I don't see how that's specific to RabbitMQ, you should be doing this for all your instances.

All in all, these are two identic instances, one active, one failover, and in a world of Kafkas and Pulsars and understanding the ins and outs of SQS pricing and capacity allocation, RabbitMQ is a tool that I consider simple to administer and allows me to sleep at night.

Interesting how the same tool can evoke such different reactions, but whatever works - works.

2 comments

You would think, until you get to a split brain issue. The master and failover lose connectivity, and they each then think they're the master.

There's ways to repair it (and it has happened to me one total time in 4 years), but it does happen. I personally try to make my message processing idempotent for the worker to help alleviate these situations.

haven't encountered it personally, so honest question here: how does a split brain situation become an issue in a message queue?

there are some possible situation from my naive viewpoint:

1. the 'active' queue keeps jumping between, consumers & producers keep reconnecting

=> everything is still consumed, but takes longer as producers write into alternating queues, which are consumed ... albeit slowly whenever the switch happens

2. they're database backed, so they'll try to write into the same table

=> usually software that does this (but cant handle several writers) also creates a `lock` which has to be manually reset before the failover can come up. if its reset, the other node would fail. only one is up, so no issue?

3. producers/consumers dont notice that the 'active' mq changed, and keep running on initial

=> issue manifests as soon as any system is restarted. but only slowly so you got time to handle it with minor service degradation

none of them really sound that bad to me -- but as i said before, i haven't encountered it before, so i might just overlooking something really obvious?

There is a reason why you're supposed to run an odd number of nodes so that you will hopefully have a majority in case of a failure.
Once every four years sounds like a no-brainer, to be honest.
I have simple single node deployment and I was floored how easy it was to set up with Celery. Really surprised. I was kicking myself for not using it sooner.

Granted I don't know all the intricacies of RabbitMQ and this was just one step beyond os.popen, but it was painless, like half an hour painless to set up and it has worked really well.

*edit: reading some of the other posts now I'm waiting for the other shoe to drop. but so far it's worked wonderfully.

I also got my first queue set up and running within a reasonable period of time with celery. I have no idea of the internals of RabbitMQ and took longer with celery really (back on python 2.7) but that system has been in prod for 6 years now without really needing any maintenance
Same experience. Single node with a few clients and Celery. Works well.

My main issue in the beginning were network timeouts now and then. Those went away after tuning some TCP settings.