Hacker News new | ask | show | jobs
by jdotjdot 2077 days ago
Quite a few of these issues too strike me as Celery problems rather than RabbitMQ. I’ve run into many of these similar issues and in every case it was due to Celery’s implementation not using RabbitMQ properly and was fixed with an internal patch to Celery.

The most blatant example is the countdown tasks. Celery has a very strange implementation of these (meant to be broker agnostic) where it consumes the task from queue, sees in the task custom headers (which is meaningless to RabbitMQ) that it should be delayed and then sits on the task and takes a new task. That results in heavy memory load on your celery client holding all these tasks in memory, and if you have acks_late set, RabbitMQ will be sitting on many tasks that are claimed by a client but not acked and _also_ have to sit in memory. But that is 100% a celery problem, not Rabbit, and we solved it by overriding countdowns to use DLX queues instead so that we could use Rabbit-native features. Not surprisingly, Rabbit performs a lot better when you’re using native built-in features.

1 comments

With your countdown implementation, how did you solve the resolution issue? I.e. if I delay a task for 111 seconds, I can publish it to RabbitMQ with a queue with a per-queue TTL of 100 seconds, after which it'll fall out to a DLX. What happens with the remaining 11 seconds? Does the DLX deliver (somehow) to a series of increasingly shorter per-queue TTL queues?

I, too, would love an implementation of arbitrary-granularity delays without buffering (potentially huge, mem-wise) tasks in "unacknowledged" in a giant binary heap: that's expensive for consumers and terrifying for RabbitMQ stability when a broker has to e.g. re-"ready" millions of messages because a delay-buffer OOMed.

One solution which we did was to use the official delay plugin provided by the Rabbit MQ. It’s working good, so far

https://www.rabbitmq.com/blog/2015/04/16/scheduling-messages...

Hopefully you're aware of some of the shortcomings of this plugin when operating in a multi-node cluster.

The plugin keeps a node-specific database of the messages that are to be delayed. If the node is unavailable or lost, so too are the messages the node was keeping for future publish.

Also, there is no visibility in to the messages that are awaiting delay. 10? 100? 5000? Your guess is as good as mine and you'll never be able to figure out what's in to-be-published pipeline.

Signed, someone who has dealt with these issues.

Awesome! I was peripherally aware of that, I think, but had filed it as "not ready" for some reason. Will definitely give it another look, thanks!
Hopefully you're aware of some of the shortcomings of this plugin when operating in a multi-node cluster.

The plugin keeps a node-specific database of the messages that are to be delayed. If the node is unavailable or lost, so too are the messages the node was keeping for future publish.

Also, there is no visibility in to the messages that are awaiting delay. 10? 100? 5000? Your guess is as good as mine and you'll never be able to figure out what's in to-be-published pipeline.

Signed, someone who has dealt with these issues.