| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zeroq 456 days ago

It's roughly 10Gb across several CSV files.

I create a new in-mem db, run schema and then import every table in one single transaction (in my testing it showed that it doesn't matter if it's a single batch or multiple single inserts as long are they part of single transaction).

I do a single string replacement per every CSV line to handle an edge case. This results in roughly 15 million inserts per minute (give or take, depending on table length and complexity). 450k inserts per second is a magic barrier I can't break.

I then run several queries to remove unwanted data, trim orphans, add indexes, and finally run optimize and vacuum.

Here's quite recent log (on stock Ryzen 5900X):

   08:43 import
   13:30 delete non-essentials
   18:52 delete orphans
   19:23 create indexes
   19:24 optimize
   20:26 vacuum