Hacker News new | ask | show | jobs
by crazygringo 20 hours ago
> It takes far more work to delete/update than insert.

Updating rows of text data is going to be more work, because variable-length text can't be updated in-place. So in terms of allocating space, it's more like a delete plus an insert. That's not surprising. (An in-place update that doesn't touch indices is generally going to be faster than an insert, though.)

I'm not aware of instances where a delete is "far more work" than an equivalent insert though. That's not the general case, and I'm having a hard time thinking of any situations where that would be true.

3 comments

> So in terms of allocating space, it's more like a delete plus an insert.

Unless you're using zHeap, you have a narrow Heap-only-Tuples scenario where the indexes stay the same. TOAST kinda helps there, if the update is off the tuple area itself. The original zHeap docs have a lot of detail about why an UNDO log can help with long running transactions from the past etc.

That is a postgresql specific thing though. Mysql indexes were created with the idea of different storage engines in mind, so Mysql doesn't suffer from the index update ovehead on update/delete the same way.

Uber had a long blog post about switching to Mysql from Postgres for wide tables with hundreds of indexes. The HN entry is still there[1], but I can't read the original post now.

As a side note, I've used postgres partitions to the same effect to drop old data periodically - detach and then drop the partition instead of a direct DELETE (similar tricks in HBase existed).

[1] - https://news.ycombinator.com/item?id=10894047

> The HN entry is still there, but I can't read the original post now.

The post on IA - https://web.archive.org/web/20160304013342/https://eng.uber....

Not directly database related, but when it comes to writing files on disks, deletes on SSDs can be rather expensive because of the delete block size vs a simple write.
Doesn't the block simply get marked as deleted and only wiped on a write?
> I'm not aware of instances where a delete is "far more work" than an equivalent insert though. That's not the general case, and I'm having a hard time thinking of any situations where that would be true.

Transactionally across related items with constraints it can explode fast.

If you've ever used FoundationDB this rapidly becomes the defining PITA due to the transaction size limits. Adding/inserting/updates are all far more predictably bounded.

But in that case, you need to compare like-for-like with the situation where you need to insert all the prerequisite rows too. You can't just compare a delete cascade with a single insert where all the foreign keys are already satisfied.
The whole problem with the delete cascade is you can't tell how big it will be until you have entered the transaction to do it. An insert you either know or it will fail and you can retry.
That's true, but now you have moved the goalposts. The original claim upthread was "it takes just as much work to delete a row as it takes to insert a row", not "it's hard to predict the performance of a delete with cascade effects". And the obvious rebuttal to that is that it's equally hard to provide an upper bound for the runtime of a single insert: an application cannot control the other processes running on the database, some of which may delay, interfere with or even invalidate your query and you must account for that. A delete operation is just as much "it might fail and you can retry" as an insert, or the database you're working with isn't ACID-compliant.
> And the obvious rebuttal to that is that it's equally hard to provide an upper bound for the runtime of a single insert

This is precisely where you're going wrong. The insert is upper boundable in advance (you know the set of everything you might potentially have to insert), the delete isn't because you don't know what's in the db until you look.

I strongly recommend poking around with Foundation for this, because it becomes clear that this problem is the defining flaw with the way they tried to architect with layers, to the point they have a queuing system for processing large jobs of this type.

> The insert is upper boundable in advance

A concurrent DML happening then suddenly your MERGE INTO WHEN NOT MATCHED INSERT/INSERT INTO SELECT is way larger that you thought? I thought "some workloads can suddenly be way larger that I expected" was supposed to be a thing in all non-trivial DML.