Hacker News new | ask | show | jobs
by philipbjorge 1157 days ago
We were comfortably supporting millions of jobs per day as a Postgres queue (using select for update skip locked semantics) at a previous role.

Scaled much, much further than I would’ve guessed at the time when I called it a short-term solution :) — now I have much more confidence in Postgres ;)

3 comments

> We were comfortably supporting millions of jobs per day as a Postgres queue (using select for update skip locked semantics) at a previous role.

That's very refreshing to hear. In a previous role I was in a similar situation than yours, but I pushed for RabbitMQ instead of postgres due to scaling concerns, with hypothetical seilings smaller than the ones you faced. My team had to make a call without having hard numbers to support any decision and no time to put together a proof of concept. The design pressures were the simplicity of postgres vs paying for the assurance of getting a working message broker with complexity. In the end I pushed for the most conservative approach and we went with RabbitMQ, because I didn't wanted to be the one having to explain why we had problems getting a RDBMS to act as a message broker when we get a real message broker for free with a docker pull.

I was always left wondering if that was the right call, and apparently it wasn't, because RabbitMQ also put up a fight.

If there were articles out there showcasing case studies of real world applications of implementing message brokers over RDBMS then people like me would have an easier time pushing for saner choices.

> RabbitMQ also put up a fight.

I'm interested in hearing more about this (making a similar decision right now!). What pains did RabbitMQ give you?

> showcasing case studies of real world applications of implementing message brokers over RDBMS

You mean "industrial scale RDBMS" that you can license for thousands of dollars? No, you can't really implement message brokers on those.

You will never see those showcase articles. Nobody paying wants them.

No, industrial scale RDBMSes like PostgreSQL, that you can license for free. Obviously?
Those don't have money to fund studies about industry best practices. So you don't get many.

Almost everything you see on how to use a DBMS is an amateur blog or one of those studies. One of those is usually dismissed on any organization with more than one layer of management.

> Those don't have money to fund studies about industry best practices. So you don't get many.

Your comment reads like a strawman. I didn't needed "studies". It was good enough if there was a guy with a blog saying "I used postgres as a message broker like this and I got these numbers", and they had a gitlab project page providing the public with the setup and benchmark code.

Just out of curiosity (as someone who hasn't done a lot of this kind of operational stuff) how does this approach to queueing with Postgres degrade as scale increases? Is it just that your job throughput starts to hit a ceiling?
Throughput is less of an issue then queue size—Postgres can handle a truly incredible amount of throughput as long as the jobs table is small enough that it can safely remain in memory for every operation. We can handle 800k jobs/hr with postgres, but if you have more than 5k or 10k jobs in the table at any given time, you're in dangerous territory. It's a different way of thinking about queue design then some other systems, but it's definitely worth it if you're interested in the benefits Postgres can bring (atomicity, reliability, etc)
With Postgres, you also need to worry a lot about tombstoning and your ability to keep up with the vacuums necessary to deal with highly mutable data. This can depend a lot on what else is going on with the database and whether you have more than one index on the table.
One strategy for mitigating vacuum costs would be to adopt an append-only strategy and partition the table. Then you can just drop partitions and avoid the vacuum costs.

Really depends on the needs but this can unlock some very impressive and sustainable throughputs.

This! Most haven't tried. It goes incredibly far.
Because all popular articles are about multi million tps at bigtech scale, and everybody thinks they're big tech somehow.
That's the original problem, but then there are the secondary effects. Some of the people who made decision on that basis write blog posts about what they did, and then those blog posts end up on StackOverflow etc, and eventually it just becomes "this is how we do it by default" orthodoxy without much conscious reasoning involved - it's just a safe bet to do what works for everybody else even if it's not optimal.