Hacker News new | ask | show | jobs
by Nouser76 1483 days ago
>SQS has a many-to-one relationship. You can send messages to a queue from many different producers but only one consumer can be defined. A consumer is another application, most often some compute instances such as Lambda, EC2, or Fargate.

My understanding is that you can have multiple consumers of an SQS through the use of visibility timeouts[0]. Once a message is consumed it is as if that message doesn't exist for all other consumers until it reaches a timeout period or is marked done by that consumer. You can also manually mark a message as being ready for other consumers. This moves the message back into the queue for the other consumers to see.

I'm going to be linking this article to my team. We've been talking about moving to SNS/SQS/etc. and this article helps understand the use cases and distinctions better.

[0]: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQS...

4 comments

This is a bad idea.

1. The main point of the visibility timeout is to handle failure. A message is read by a consumer; the visibility timeout starts; that consumer finishes some processing; then deletes the message from the queue. But, what happens if the consumer encounters a fault during processing which destroys its ability to even tell the queue it encountered a fault? The visibility timeout protects against that; the message just naturally reappears in the queue for processing by another consumer. If one overloaded the visibility timeout to also mean "other consumers should process this", you'd lose the ability to handle faults.

2. It also screws up deadletter redrive policies, which are primarily based on visibility timeout lapses (in addition to communicated failures). You basically could not reliably put a deadletter redrive on your queue, which again just means, you're protecting against fewer failure modes.

3. There would be natural, avoidable latency in waiting for the visibility timeout on every fan-out, whatever you set it to. 1 second? 100 consumers? That message is just clogging up the queue for over a minute as it gets fanned-out to everyone.

4. Consumer1 eats the first message, then times-out its visibility; its back in the queue; there's no way to ensure that message isn't just processed again by Consumer1 instead of Consumer2! You're basically tossing a coin and hoping that, eventually, Consumer2 gets its turn at the message, all the while having Consumer1 reprocess the message an indefinite number of times.

5. Someone has to delete the message. Who? The "last" component to touch it? Once all the other components are done? How do you coordinate that? Theres no guarantee of ordering on when each component sees the message. You'd need some kind of external state, and at that point, why are you even using SQS?

You could theoretically have each consumer read from a queue, process the message, delete that message from the queue, then redrive the message into a new queue for processing by another consumer. This may make sense if you have strict ordering needs for processing but still want the benefits of SQS. You could even have it redrive into N queues for N consumers at the same time. But, at that point, why? We're trying to put a square peg in a round hole; SQS is designed for single consumers. There are far better and simpler tools out there if what you're looking for is multi-consumer fan-out.

I have used SQS in the parent's suggested fashion for many years. I feel like your points are overstated. Visibility timeout's "main point" is not to only handle failure nor do the AWS docs themselves state that. AWS's built-in redrive policies have been more than sufficient to correctly handle error scenarios.

> there's no way to ensure that message isn't just processed again by Consumer1 instead of Consumer2!

Correct, but this isn't the job of the pipe. Smart endpoints, dumb pipes.

You're welcome to design systems however you want. But this is, put simply, bad advice; and when sharing advice like this to people who may be learning these things for the first time it's critical to communicate not just what these complex components are capable of, but how to best work with them to build reliable and effective systems.

If you have multiple heterogenous consumers, do not use a single SQS queue.

I can't even comprehend how you would engineer around the issue of consumer re-processing. You can quote metaphors all day; if you love the idea of dumb pipes, why doesn't the city transport clean and gray water in the same pipe? Do you want to wash your hands using flushed toilet water?

Similarly, you can't engineer around heterogenous consumers grabbing a message, putting it back in the queue, then consuming it again. You can make them smart! You can have them say "woah hold on, I already saw that message I don't need to see it again put it back". Or, you can make them idempotent so reprocessing isn't undesirable. But its still reprocessing; its still a huge waste, and will probably require external state to manage. Moreover, there's literally no system guarantee that Consumer2 will ever see that message; it'll probably see it, fifty-fifty, well then again if one consumer is faster at accessing the AWS API than the second, who knows, anything could happen, but at least its convenient?

The city doesn't require every household to have gray water filtration. Because that would be insane. The pipes don't have to be "smart". We just build two pipes!

This just blows my mind too. The pipe analogy is apt. Using logic to dispatch to whatever pipe -> consumer you want is the way to use queues. Turning it upside down and using properties of the queue to have consumers decide what they want to take and sending it back to others is just unquestionably bad design when you could just make more queues!
Conceptually it is still a terrible idea to have multiple consumers (By multiple consumers I mean things doing different actions on a message, not concurrent consumers doing the same action) on a single queue. Why overload a queue like that for 2 different actions when one can fan out on an action to 2 queues with SNS? Then your consumer does not have to determine if the message is for them or not. Visibility timeouts are for concurrency/errors by a single action. Yes you could hijack it and have 2 consumers act on one message and do different things but that is confusing and no benefit over just having 2 queues
The primary reasons for multiple consumers in a queue is availability and SLA reasons on a queue as well as for easier horizontal scaling. Otherwise you’ll need to have a queue scheduler type system that can signal or serve out queue locators to idle consumers and you start getting into technical scenarios similar to freakin’ ESBs. At enough scale you already have that setup though for multi region failover purposes sure but the granularity of queue consumer routing is based not around concurrency to the queues as much as concurrency and routing across several regions with n queues in between that serve as priority queues.

Also, two different queues being two different buffers that have durability issues can in an improperly conceived architecture amount to a distributed RAID0 of messages.

It really depends upon the tolerance to message duplication, SLA needs, and how prioritization should be handled. At a previous place we had multiple consumers for multiple SQS queues representing different priorities within the same region and it worked fine for many years with the primary headache being message de duplication handling being tricky.

This discussion is, at least it seems, mostly about multiple heterogenous consumers; not homogenous consumers/replicas/horizontal scaling. So, if Slack sends a queue message for every DM that's sent, the difference between having 1 consumer that updates the database and 1 consumer that sends a push notification, versus having 2 consumers that both only update the database.

The idea of having multiple homogenous consumers shouldn't be controversial; that's just horizontal scaling. And, well, at least until a few hours ago I also would have said that the idea of having multiple heterogenous consumers is also uncontroversially bad. But I guess everyone has "their way" of doing things.

Its also important to note that there's a third situation I see somewhat often: maybe call it homogenous delegated consumers, whereby you've got messages like '{"type":"SendDM", "content": {}}'. Or maybe: '{"type":"SendDM", "action": "UpdateDB", "content":{}}'. The consumers are still homogenous, they all run the same code, but they may internally delegate the message to do different things depending on enums within the message. This is pretty ok; its different because at least you'd never have a consumer hit message and be like "I don't want this take it back".

Though I'd caution against it; just understand that its something of a 'hack' to make one queue act like N queues, and that's ok if you're small and have a good grasp on the problem domain. The big issue it will inevitably run into is: some queue message "kinds" will take a lot longer to process than others; and so if you're e.g. overloading a queue to handle both a simple email send and a much more complex asynchronous database update, you'll inevitably get delayed emails. Absolutely inevitable. But, it can work for a time.

Weird example, do people actually use multiple consumers doing different things to a single message? You just queue multiple messages with different properties and consumers process things the same way.
I’m curious why you would do it this way vs publishing to SNS and having that fan out to multiple queues where each consumer can listen for the things it needs to work on (as mentioned in the original article.)
I found it just was not necessary for most cases. And the way I got there was working backwards from the "web scale" technologies like kafka, kinesis, and dynamodb.

I built a data ingestion system that handled an average of 300 messages/sec, peaking at 1,000, and writing to a single R3 RDS instance. You can do a lot by pushing simple scaling strategies to their limit. Everyone thinks they need to handle web scale, but really you just need to handle your scale.

Excellent write-up. I think when dealing with messaging systems it's important to know the difference between Pub/Sub vs Point to Point models or Topics vs Queues.

Can you technically use a Queue as a topic for pub/sub? Yes. But should you? Probably not. You're much better off not using SQS for that and instead using SNS.

I wish SNS had a way to have a process receive a message, or have a Watcher for it in an AWS SDK like Boto. Feels like a big hole in actually using SNS as a pub/sub mechanism. Much simpler than having to setup and maintain an HTTP endpoint.
Yes, and you can group the messages such that messages within a group are (almost?) always consumed in order. I think the distinction though is that with SNS each message is consumed by each consumer, whereas SQS each message is consumed by one node (so you can only really have one system that reads from the queue)
Why not go pub/sub model if you need multiple consumers?
I would still view it as many to one. Visibility timeouts are for concurrency. Semantically speaking I would consider one consumer with n concurrent workers as one consumer function/service. In workflow terms, a SNS is a fanout, and SQS is a queue.
You can subscribe multiple lambdas to an SQS queue. It's not recommended, but it's doable. many-to-many or many-to-one depends on your choices of infrastructure.
I guess we can debate the semantics of it because it is technically possible. But it is terrible design to have a SQS to many different consumers. If someone did that I would reject it on review. In any proper usage of SQS it is many to one.
there are edge cases where that's desirable. I won't enumerate them here but they're discoverable on the Googles. I also would advise against that kind of passionate adherence to infrastructure dogma, taking a more analytical approach to review.
If you could name even one I would remove the "dogma". I cannot think of why anyone would want to do that. And if someone did want to do that they would have to have a very compelling reason to complicate what is usually an easy thing (One action listening on a queue)
How do you scale out processing if you can't have more than one consumer?
I am not referring to multiple homogeneous consumers processing a queue. That is fine. You have a pool of consumers that can pick up from the queue. That is still considered one entity/actor. The people here are proposing having multiple heterogeneous consumers consume from the queue. That is bad.
Yes, but only one of them will receive the message.
Only one would receive a particular message, but all will receive messages.
That's true, but I wouldn't say that fits the definition of 1:N messaging. Just as the name says, it's a queue.