Hacker News new | ask | show | jobs
Ask HN: 120M rows Postgres – how can I speed up queries?
23 points by dbnewbie 2419 days ago
The table is flat (no related), but many queries we make (including aggregates) take hundreds of milliseconds.

Forgetting the specific queries for a moment (basically all queries in this table are relatively slow):

How would you handle such a scenario? Shard the database?

6 comments

Have you checked if you are using the correct settings in order to extract the max from your hardware?

https://pgtune.leopard.in.ua/#/

Are you monitoring your machine to check if it is not starving on cpu?

Re the queries, we have been using pgbadger to collect metrics about the usage of the dB and the slowest queries by type. This is helpful as it guides where you should put your efforts.

https://github.com/darold/pgbadger/blob/master/README.md

This is very good ref about scaling Postgres.

https://pyvideo.org/pycon-ca-2017/postgres-at-any-scale.html

Thank you an absolute ton! I’m gonna be watching that scaling video from Pycon this afternoon!

We definitely have a decent configuration, in terms of using the hardware best for our typical workload. We also have indexes on the fields we are using for filtering. It’s crazy, I put this question here because I feel like this is an inevitable thing that everybody just runs into over and over with scaling and the only real out is sharding, so I’m glad to see you a bunch of suggestions here and I’m going to watch that video as I mentioned. Thanks again.

Check out https://www.pgmustard.com/ if you haven't already. It has helped me optimize queries as much as possible when querying large datasets.
Thanks! That’s awesome. I know roughly how to understand PG query plans, but it doesn’t ever seem to actually help me figure out a solution. This service looks like it will help quite a bit.
There is not a lot of specific details here but have you already tried and exhausted things like analyzing execution plans, index optimizations, index encoding, value encoding, denormalization, read replicas, and [front] caching?
I have not looked into “index encoding”. In fact, I haven’t even heard of that, thank you for the suggestion!

In terms of execution plan, the query we are doing is relatively basic even though it includes some aggregation. The aggregation is rule (CASE) based and very simple. It feels like there is no way to quickly (sub-50ms) retrieve information from a database once you reach the high tens of millions of rows.

By index encoding I just mean putting “encoded” values into the indexed column of your table. Something like this: “city:department:sku”.

This technique is a bit advanced, borrowed from hierarchical databases, and optimizes for specific queries known upfront, so it’s cool but not very flexible. There is a lot more to making it work. You can watch [1], if interested.

But I’d also +1 other suggestions here on fine tuning your db engine and just scaling up the server.

[1] https://youtu.be/jzeKPKpucS0

Disclaimer: I’m with AWS.

Interesting, that does sound like an advanced indexing technique, but also sounds like a really good idea. It reminds me of the old flat file database formats I read about.
Have you thought about materialized views that pre-aggregate your queries and then updating those views with triggers? I got this idea from a forum so it’s not mine but it works.

Sharding by month or other bucket of time could help.

We have a very similar situation except it’s billions of rows. One benefit is it’s a bit denormalized in that we store the meat of our data in a hstore field

Thanks a ton for the help, I will take a look into what we might be able to do in terms of adding a materialized aggregate view!
You’ll get the most benefit from pre-aggregating not for a specific query but in a general way. Sum orders group by user in a view and then use triggers to update that view. That way to get the orders for a given user it’s a simple lookup and all the calculation is done. And this way you can build up meta reports from these general aggregates.
Ok awesome, and thanks again!
Just to be super clear don’t use the triggers to rebuild the view every time but merely to add one to the sum’d account when a row matches — I wasn’t sure I made that clear. Good luck! And do put up a blog post or something if it works out
OK that makes sense, and I’ll do a bit of reading on materialized view functionality before actually implementing anything.
From my experience 7+ years ago, I saw an order of magnitude difference running Postgres on higher end bare metal hardware (think $6k rack mounted server w/ nice RAID controller) vs a supposed high end AWS VM w/ an EBS volume. That was testing with spinny disks before SSDs were the norm, but if you are on a cloud VM, it's probably worth investigating.
Thank you for the suggestion. I don’t know for certain that it’s a tenable solution, because we are using some services that would take some time to vertically integrated if we were to go to a colo. But when we reach that scale that will definitely be an effort worth exploring.
Use c_store fdw to store your data in columns and marvel at the 10x performance improvement without indexes.
Excellent, thanks for the suggestion. I hadn’t heard of this lib until now. And, seeing that this is developed by Citus makes it automatically 10x better.