Hacker News new | ask | show | jobs
by drob 3316 days ago
Batch inserts usually increase throughput by reducing the number of write IOPS, fsyncs, etc. They usually aren't associated with a 10x _CPU_ savings, which is the finding here.

There are a million 'db best practices' you can go implement blindly, but the point is that this methodology – determining the bottlenecking resource and then profiling to determine exactly what is consuming it – will _reliably_ yield huge wins, whereas implementing 'best practices' on gut alone is a very inefficient way to improve performance.

2 comments

I'm not saying you need to do all "best practices" blindly but if insert speed is your problem (at a high level) batching is the very first thing to explore.

Without any specific numbers backing up the 10x we can only guess what improved 10x. All of those things you listed show up as CPU wait events as well. Without specifics I assume he means they inserted the same row count in 1/10th of the time. Not that there was a direct drop in CPU tasks.

Author here. We were under the assumption that CPU was mostly being used for evaluating the partial index predicates. Under this assumption, we figured batching was unlikely to yield much of a benefit. It wasn't until we actually profiled Postgres did we realize batching would be worth a try.

As for the numbers, we specifically got a 10x improvement in ingestion throughput.

> Batch inserts usually increase throughput by reducing the number of write IOPS, fsyncs, etc. They usually aren't associated with a 10x _CPU_ savings, which is the finding here.

Often the saving on roundtrips is bigger than any of these. Most database client libraries work synchronously, so if you insert via single row INSERT statements you'll approximately get a two context switches, and a roundtrip for each row. That's often more costly wall clock time wise than the insertion itself.