IMO, SQS is one of the best products that AWS offers. Airbnb built a queueing + scheduling system on SQS to great success [0] (disclaimer, I wrote this post). There are many others, e.g., Slack, that are building on top of SQS.
> SQS is one of the best products that AWS offers.
I would never trust anyone who says this, regardless of who did what with it.
SQS Standard (which was the only for a long time) has historically been a nightmare for almost any use case including job queueing. I love creating a whole separate cache to ensure when SQS delivers message multiple times, that I don't act on them. That's just ONE issue that I don't need to have. When you get into FIFO to avoid that, now your costing is ridiculous at scale AND the (relatively small) number of issues with FIFO that are still outstanding .
It's really not hard to pick a natural key to use for idempotency in your persistence layer. If you don't want at least once delivery, why are you picking SQS instead of some random ephemeral queue like NATS?
I second the parent post. I've billions of messages through SQS at a previous job and I can remember having issues with SQS availability exactly once (due to a systemic failure involving DyanmoDB where cascading failures took down almost everything (and all of our integrations) that would have hosed us completely if not for our durable-queue/s3-journal[0] at the edge). SQS is simple to use, does what it says, and has very good SDK support. A++++ Would build a business on top of again!
This is a surprising take. Accounting for duplicate events is such a common pattern in distributed systems that it never occurred to me to think of it as an issue. What is your preferred approach?
Sharing an SQS queue between services is ok then? Even the most basic example for SQS queues from the AWS SA training course recommends using multiple queues for different classes of video transcoding. All of those benefits evaporate once you share a single queue; if you're just reading that queue to create a job in a database, then the queue provides no benefit.
I'm not quite clear what you mean by "sharing a single queue between services". If you mean having multiple different services reading from the same queue then no, I wouldn't do that. If you mean having a sending service and a receiving service share the same queue (either directly or indirectly) in the sense that the sender sends a message and the receiver reads it then I can't see any alternative to doing that. If I've misunderstood then would you be able to share what you were thinking of?
I interpret arduinomancer's post above to mean that if the receiving service must action a single message once and only once (which is sometimes but not always necessary) then you need to give each SQS message a unique ID and that ID needs to be stored in as DB by the receiving service. I've used that pattern a few times and it works well (and doesn't result in unnecessary coupling between sender and receiver). In the past, I've used messaging systems that gave stronger transactional guarantees, but those systems were things like JMS or MQSeries and I don't really want to go back to those days.
The one legitimate "problem" in that list that is #3, the hard limit on FIFO messages (20,000). The rest are expected behavior for a 'deliver exactly once and in order' queue. The use cases offered (client fails to process and therefore consume the next message, blocking subsequent messages) and somehow (?) expecting the queue to facilitate reprocessing previous consumed messages are both misuses of such a queue.
SQS (standard and FIFO) has another problem that precludes using it in one case I have; the 256KB message size limit. Amazon has a Java-only workaround (Amazon SQS Extended Client Library for Java) that will spool large messages to/from S3 -- so clearly my case isn't unreasonable -- but there is nothing for the general case.
I’m no longer with Airbnb / the project, but we did in fact use SQS Standard exclusively for job queueing (I don’t think we ever did explicitly add any support for FIFO). Our SLA was at-least-once (as most job queues are), so this didn’t matter as much. In practice it also doesn’t happen that often.
I would never trust anyone who says this, regardless of who did what with it.
SQS Standard (which was the only for a long time) has historically been a nightmare for almost any use case including job queueing. I love creating a whole separate cache to ensure when SQS delivers message multiple times, that I don't act on them. That's just ONE issue that I don't need to have. When you get into FIFO to avoid that, now your costing is ridiculous at scale AND the (relatively small) number of issues with FIFO that are still outstanding .
eg https://tomgregory.com/3-surprising-facts-about-aws-sqs-fifo...