Hacker News new | ask | show | jobs
by cuchoi 3070 days ago
I think what the article is proposing is that it can be quicker to drop the indexes and recreate them than to load a lot of data in an indexed table.
3 comments

I thought this is pretty common knowledge and I don't even do much with databases.
and I think that's what what johnny555 is questioning - isn't this common knowledge already? apparently not.

here's another tip, at least on mysql, but possibly other databases that have memory tables. Import stuff in to memory tables, then insert from the memory table to a disk-based table. I took a process that was naively importing data via SQL commands which took close to 24 hours down to around 20 minutes by breaking it up, chunking imports to memory tables, then copying those to permanent disk. This was years ago (12?) and mysql is probably better about insert handling than it was, but that approach (plus the drop/recreate indexes) meant this was a smallish process vs a 24 hour import cycle.

> apparently not.

Just becasue one blog post making it seem that way makes it not known ?

not just because of that - cuchoi also seemed to imply that.

relatedly/anecdotally, I still run in to people who aren't aware of this, as they don't understand what happens when you're inserting data in to a database or what indexes are in the first place.

"Batch-mode processing is always more efficient."

Not sure where I first heard that, but it applies here. Essentially it is almost the same thing as saying that computers are often set up to exploit economies of scale.

Thus building an index all at once after a large set of changes are made is more efficient than incrementally updating an index as each change is made.

It is more accurate to say that sequential I/O is always more efficient than random - even if there is no physical seek time anymore
The reasons why batch mode processing can be more efficient go beyond just sequential I/O.

If I do a little of something now and a little later and yet more even later, I will probably have to deal with caches that don't have my data in them because other things happen in the intervening time. If I do it all at once, then a lot of the work benefits from already-warm caches. (This can apply to disk caches, CPU data caches, and even instruction caches.)

Not to mention that sometimes batch mode processing opens up opportunities to use a more efficient algorithm (even if sometimes just by a constant factor). For example, if you maintain an index in a balanced tree and keep adding to it piecemeal, you do extra work continually rebalancing that tree. Whereas in theory if you built an index all at once, you could collect all the data, sort it using some kind of fast sort like mergesort, and then write out a final tree which is already balanced as desired and doesn't need to be rearranged as data comes in in random order.