Hacker News new | ask | show | jobs
by thinkingkong 1153 days ago
I love these types of techniques. Need a basic no nonsense queue? Postgres. Need a basic reporting infrastructure? Postgres. Need a document store? Postgres.

But every single time this comes up, people on the engineering teams Ive been on all throw their hands up and accuse folks of overengineering or underengineering. You need rabbit or kafka. We should move to mongo. Etc.

Thats the part thats hard.

5 comments

Just had this conversation with one of my juniors. He brought up scalability concerns with a PG queue and wanted to know if EG REDIS was a better choice. My whole tack was to lay out the actual context for the system—we’re going to soft launch, it’s for a hardware product, and we’ll have some one post-launch to make changes. Then I brought up my values since I’m basically in charge: maintainability, speed to first release, and ease of understanding. He made the decision that a PG queue was better for now because it’s easy to change our minds later, we already know PG, and we’re not anticipating fast scaling in this application.

I think a lot of the problem in industry is simply lack of rigor. We talk a big game about being software engineers but nobody takes the time to talk about values, requirements, trade-offs or business context for decisions. It’s a shame because these topics are the actual engineering of the system.

Most people on modern engineering teams haven't built systems services themselves. I've seen this cause a lot of insecurity on making choices, because they don't have the experience in how to actually assess what will work or won't. It's far easier to fall back on the industry hive mind, e.g. "nobody ever got fired for buying $X".
I agree. I think good software architecture makes this very viable too. I worked on a system where we had a queue that would hold our jobs to be processed, extremely common. This was a POC product to show to investors so we needed this thing out fast. I implemented the basic queue in Postgres but made sure to write a solid interface around it for the queuing methods, and the queue would only be interacted with via that interface. When we moved to something a bit more heavy duty, we just changed the underlying implementation and kept the interface and everything flowed. Not to mention the fact that it was nice and testable since we'd pass a mocked queue around.

To some this is the most obvious thing to do, but you'd be surprised that some people wouldn't do this (I wouldn't have before reading a few books) and how I even got pushback at first, despite it being a 10-20 minute to wrapper logic in a class.

A good abstraction for things like this makes it really justifiable to take advantage of Postgres and Redis for things that aren't their forte for the time being until you eventually need to swap the out for a more robust solution. My experience is at startups mostly, and that ability to make complete, but small implementations to get going and being able to make them more robust over time is an essential skill.

Depending on the team I'd still opt for using "anything else" rather than Postgres as a queue. The tendency to have long-lived connections combined with a flow of short-lived messages will yield a runaway queue (due to MVCC) at low enough message rates that even an early-stage startup might notice.
Maybe, but at “mid” production scale I’ve used Postgres plenty as the substrate for queues and managing requests for FSMs and their state changes, and didn’t run into these problems.

Like everything, it depends on the application.

It can work great, but if not carefully introduced it's one typo away from a disaster in prod that nobody understands. You just need somebody to introduce a code path with longish transactions interacting with the queue and not have a reasonable prolonged load test in your deployment pipeline. Given how easy other queues are to set up I wouldn't default to Postgres on many teams.
I think you're projecting an implementation of a queue in Postgres, which isn't how most people implement these things. [0] We're not doing table level locks, or creating contention with multiple queue producers or consumers, and they're not "one typo away from disaster in prod".

To do this right, you're using row level locking e.g. SELECT FOR UPDATE/SKIP LOCKED [1], and hopefully you're already using idle_in_transaction_session_timeout to deal with total consumer failures. A properly designed queue in Postgres runs more-or-less in parallel, and supports (really fantastic features like) atomic row locks across all resources needed to serve the queue request.

If you need extremely long consumer timeouts, it's also totally fine to use RLLs in addition to state on the job itself.

[0] - https://www.crunchydata.com/blog/message-queuing-using-nativ... [1] - https://www.2ndquadrant.com/en/blog/what-is-select-skip-lock...

Even that first link explicitly calls out MVCC table bloat with that strategy? Like, you can do it right (and I have no comment on your implementation in particular, let's assume for the sake of argument it's fantastic), but it's easy to write a right-looking solution with the property that innocuous-looking one-line changes cause major issues. I wouldn't want an average development team to pursue that approach without good reason, especially when the alternatives are so easy to do right.
The first link also calls out how to deal with said bloat. It’s part of administering a Postgres DB yourself (which may or may not be something a team should be doing).

> I wouldn't want an average development team to pursue that approach without good reason, especially when the alternatives are so easy to do right.

Agreed. Good reasons I’ve had in the past are:

- wanting transactional guarantees across your DB data and queue (which you don’t get if your queue is external)

- not wanting to add more stack complexity (this has been an issue when you have to support on-prem deployments that you aren’t allowed to interact with).

I’m sure there are others.

Im sort of confused. Youre mentioning MVCC and deployment pipelines not having load but what portion of an application MVCC has anything to do with a queue, postgres or otherwise? Same with deployments? Maybe theres a specific model or deployment strategy thats in use that I am unaware of and its a blind spot?
Postgres uses multi version concurrency control. The amount of garbage kept around to ensure each transaction can be handled independently is roughly proportional to the length of the transaction multiplied by how much work is happening with any transaction overlap. That's potentially a problem with queues implemented in Postgres because the for-all-intents-and-purposes-dead rows being tracked by MVCC slow down each piece of work on the queue, increasing transaction times, and circularly causing a runaway scenario at lower thresholds than you might expect.

The comment about deployment pipelines was just that some shops explicitly load test with a multiple of prod-like data to find that sort of issue before prod. Doing so for a non-negligible amount of time is important though to catch qualitative shifts in the RDBMS behavior as it approaches a steady or runaway state as a result of any software change.

How do these techiques work with replication and sharding? Or do you just use cloud managed pg like AWS RDS to not think about it?
Just don't keep all your old data in Postgres forever. Set up a nightly cron job to archive old crap out.

Now you'll never* don't need to shard your Postgres.

* Unless you work at a very rare company.