Hacker News new | ask | show | jobs
by russell 6048 days ago
Dingwall lumps a number of separate issues under "soft delete": undo, audit trails, soft create, and performance in the presence of historical data. He presents several solutions, not all of which I would buy.

The is_deleted column is a pretty simple solution that we all use and there are a number of solutions to the problem of retrieving only the active columns, such as views.

Audit trails and performance are more interesting. A while back I worked for a web analytics company and we had the problem of the storage and performance costs of historical data. Only the 5% of the data was of any real interest, but the 95% historical data made writes slow because of the large number of indexes. They adopted the solution of historical tables with fewer indexes on cheaper drives.

I like the solution of serializing historical, deleted, and audit data and storing them in a NonSQL database of your choice. Then you can bring them back as individual undos, or into a data mining database for scenario playing.

I dont particularly like his suggestion of creating separate tables for each state of an element. I think that's needless complexity.

1 comments

> but the 95% historical data made writes slow because of the large number of indexes.

Why does updating an index take more time if there are many table rows?

i.e. if you have 5 indices on a table and 1 write/second, what's the difference in the work done by the db whether there are 1000 rows or 100k rows?

We were dealing with terabytes of data and 10's of billions of writes per month. Most tables had 4 to 6 indexes, resulting in multiple updates for each write. If your data sets and indexes are huge there is little locality of data. Each row returned may have several disk hits. If you can keep your table sizes small enough that the indexes can remain in memory, performance is mush better.

Even worse were the autogenerated queries that joined a dozen tables together. I saw some that were 1500 lines long.

Simple explanation: Most database indexes are forms of B-trees. Insertion into a binary tree isn't constant time O(1), it's usually O(log n).
Thanks, I was thinking in terms of hashes (which doesn't really make sense for searching/ordering).
An index scales in proportion with the data it is indexing. The large size might produce more I/O and thus less than constant algorithmic complexity.