Hacker News new | ask | show | jobs
by mbell 2015 days ago
I feel like the article misses most of the 'meat' of bulk loading setups. e.g. Using partitioned tables, attaching partitions with little to no locking, indexing after loading, `COPY ... FREEZE`, etc.

One other note:

> Goto solution for bulk loading into PostgreSQL is the native copy command. But one limitation with the copy command is that it requires the CSV file to be placed on the server.

You can use `COPY ... FROM STDIN` and stream the data from the client, this is basically what `/copy` does in psql.

1 comments

The article did compare loading with index vs index after load.

What good do partitions do? Loading in parallel?

One use case I've used them for is when you have a table that is 'under fire', and you need to add bulk data to it. Doing a large insert into a table like that can cause various issues. Instead I a partitioned table and each bulk load is a separate partition. This allows doing the bulk load to a 'clean' table out of the line of fire. You can then apply indexes, do cache warming or whatever else you may want to do to it, then attach the partition to the parent table to make it 'active'. As long as you have covering constraint on the partition that matches the partition constraint, this results is minimal impact to query performance. We do bulk loads of partitions with ~100-200M rows this way and it's barely noticeable to the query side latency.
The best use case I've had for partitioning tables is when there is a lot of churn that would otherwise result in significant vacuuming. For example, a table holding the past 7 days of data would require deleting 1/7 of its rows every day.

Swapping out a child table and dropping the old one is free in that sense, but does require an exclusive lock which can cause issues if you have long running queries.