| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by monksy 486 days ago
	Famious last words. There are database as a queue antipattern warnings about this.

4 comments

srhtftw 485 days ago

> Famious last words.

These weren't his last words, but Jim Gray had this to say about this so-called "antipattern".

Queues Are Databases (1995)

Message-oriented-middleware (MOM) has become an small industry. MOM offers queued transaction processing as an advance over pure client-server transaction processing. This note makes four points: Queued transaction processing is less general than direct transaction processing. Queued systems are built on top of direct systems. You cannot build a direct system atop a queued system. It is difficult to build direct, conversational, or distributed transactions atop a queued system. Queues are interesting databases with interesting concurrency control. It is best to build these mechanisms into a standard database system so other applications can use these interesting features. Queue systems need DBMS functionality. Queues need security, configuration, performance monitoring, recovery, and reorganization utilities. Database systems already have these features. A full-function MOM system duplicates these database features. Queue managers are simple TP-monitors managing server pools driven by queues. Database systems are encompassing many server pool features as they evolve to TP-lite systems.

https://arxiv.org/abs/cs/0701158

reval 486 days ago

Why is that an anti-pattern? Databases have added `SKIP LOCKED` and `SELECT FOR UPDATE` to handle these use cases. What are the downsides?

maniacalhack0r 485 days ago

as with everything, it depends on how you're processing the queue.

eg we built a system at my last company to process 150 million objects / hour, and we modeled this using a postgres-backed queue with multiple processes pulling from the queue.

we observed that, whenever there were a lot of locked rows (ie lots of work being done), Postgres would correctly SKIP these rows, but having to iterate over and skip that many locked rows did have a noticeable impact on CPU utilization.

we worked around this by partitioning the queue, indexing on partition, and assigning each worker process a partition to pull from upon startup. this reduced the # of locked rows that postgres would have to skip over because our queries would contain a `WHERE partition=X` clause.

i had some great graphs on how long `SELECT FOR UPDATE ... SKIP LOCKED` takes as the number of locked rows in the queue increases, and how this partiton work around reduced the time to execute the SKIP LOCKED query, but unfortunately they are in the hands of my previous employer :(

izacus 485 days ago

How did you get from original post of "low level of load" to overengineering for "150 million objects/hr".

Is the concept of having different solutions for different scales not familiar to you?

pritambaral 485 days ago

I did sth similar. Designed and built for 10 million objects / hour. Picked up by workers in batches of 1k. Benchmark peaked above 200 million objects / hour with PG in a small VM. Fast forward two years, the curse of success strikes, and we have a much higher load than designed for.

Redesigned to create batches on the fly and then `SELECT FOR UPDATE batch SKIP LOCKED LIMIT 1` instead of `SELECT FOR UPDATE object SKIP LOCKED LIMIT 1000`. And just like that, 1000x reduction in load. Postgres is awesome.

----

The application is for processing updates to objects. Using a dedicated task queue for this is guaranteed to be worse. The objects are picked straight from their tables, based on the values of a few columns. Using a task queue would require reading these tables anyway, but then writing them out to the queue, and then invalidating / dropping the queue should any of the objects' properties update. FOR UPDATE SKIP LOCKED allows simply reading from the table ... and that's it.

maniacalhack0r 485 days ago

smart. although, i guess that pushes the locking from selecting queue entries to making sure that objects are placed into exactly 1 batch. curious if you ran into any bottlenecks there?

pritambaral 483 days ago

> ... making sure that objects are placed into exactly 1 batch. curious if you ran into any bottlenecks there?

A single application-layer thread doing batches of batch creation (heh). Not instant, but fast enough. I did have to add 'batchmaker is done' onto the 'no batch left' condition for worker exit.

> ... that pushes the locking from selecting queue entries to ...

To selecting batches. A batch is immutable once created. If work has to be restarted to handle new/updated objects, all batches are wiped and the batchmaker (and workers, anyway) start over.

ninju 485 days ago

I believe the article and parent comment were discussing queue solutions for low-volume situations.

maniacalhack0r 485 days ago

completely missed this. apologies.

stickfigure 485 days ago

40,000 per second is waaaaay beyond where you should use a dedicated queuing solution. Even dedicated queues require tuning to handle that kind of throughput.

(or you can just use SQS or google cloud tasks, which work out of the box)

pritambaral 485 days ago

I hit 60k per second in 2020 on a 2-core, 100GB SSD installation of PG on GCP. And "tuning" PG is way easier than any dedicated queueing system I've seen. Does there exist a dedicated queueing system with an equivalent to EXPLAIN (ANALYZE)?

stickfigure 485 days ago

If that's true, you managed to do much better than these folks:

https://softwaremill.com/mqperf/

Maybe you should write a letter?

winrid 485 days ago

It's possible the person you're replying to wasn't using replication, so it's entirely different. Those folks also used "synchronous_commit is set to remote_write" which will have a performance impact

kstrauser 485 days ago

I worked at a shop that had to process about 6M RPS for 5 seconds at a time, once a minute or so. That looked a lot like a boatload of Python background threads queueing work in memory then flushing them out into Cassandra. That was a fun little project.

chupasaurus 485 days ago

> 150 million objects / hour

Is not a low volume unless this could be done in batches of hundreds.

maniacalhack0r 485 days ago

completely missed this. apologies.

makeitdouble 486 days ago

I suppose you are referring to this:

https://mikehadlow.blogspot.com/2012/04/database-as-queue-an...

The main complaint seems to be that it's not optimal...but then, the frame of the discussion was "Until you hit scale", so IMHO convenience and simpler infra trumps having the absolute most efficient tool at that stage.

golergka 486 days ago

Can you elaborate? I guess it has to do with connection pooling?