Hacker News new | ask | show | jobs
by Spivak 1517 days ago
It's crazy how many apps opt for an RDBMS for append-only data like audit events. It's so tantalizing at the beginning but turns into a nightmare time marches forward.

   audit events -> queue -> elastic -> blob storage
is so easy to maintain and we save TBs from living in the DB.
4 comments

Actually, I've seen more problems with folks mixing lots of different tools up then I have from folks doing an append only audit event in a RDBMS.

When your audit trail is in DB, you can pretty easily surface audit events to your customers. Who changed what when is just another feature. Capturing audit events is also usually pretty smooth.

The folks doing the blob storage route, you would not BELIEVE the complexity they have to spin up to expose very simple histories etc. This matters a LOT in some spaces (financial etc), less so in others.

In my RDBMS model, who changed this field when from what to what is a basic select. You can even shard by recordID or similar if you want to reduce table scans, good select of indexes etc can be a huge help as well. In most cases users don't mind a bit of latency on these queries.

My only experience in the financial sector indicates the opposite. The firm held its trading history for tens of thousands of accounts going back 60 years in a SQL Server. Anyone who wanted a question answered had to submit it for overnight analysis and get the answer the next day. But in an optimal non-RDMS representation, said trading history could be condensed to a single 200MB flat file that could be queried interactively, in microseconds. Dumping the RDBMS for most use cases pretty much revolutionized the daily experience for the people at that firm.
This seems borderline impossible? I'd be curious if there were missing indexes or bad index selection or efforts to do things like joins? Normally for very large data sets you can shard by account if needed if that's the common filter if there is some insane table scan. Audit stuff tends to be "sparse" so if you can get an index which just tells you which pages have results, that is usually a 100X speedup with a pretty small index size.

But agreed, a daily dump to something can go a long way to unlocking other tools - in govt this is especially true not because the DBMS is hard to use, but because so many layers of consultants and others in the way it's not usable.

SQL Server can easily handle such datasets if columnstore is employed. I wouldn't be surprised if a single weekend of building a proper index (...being very generous here) wouldn't make their DB go literally 100x faster.
The problem at that firm wasn't the SQL Server it was the RDBMS mindset which led them to do joins between tables of journaled trades and much larger tables of corporate actions going back decades, instead of materializing a more useful intermediate result. This is my main beef with RDBMSs: they lead to cognitive hazards of this type. It's not that the databases are themselves naturally bad systems.
Were they running this on a 386? Why didn't they optimize their database?
It was "optimized" for the aesthetic concerns of RDBMS purists and for ease of implementation of certain other systems, in other words it was optimal for the developers and severely sub-optimal for the users, which is a problem endemic to RDBMS fandom.
So it was basically "write optimized", fully normalized, required at least a half dozen joins to get anything out of it?
One of the biggest thing that keeping audit records in your DB gives you is transactionality around your audit logs. Sending audit events to an external system (quite often) loses this, and the resources to address this before you have to are way larger than a slightly larger AWS/GCP/Azure/<insert-computer-provider-here> bill.
We're implementing something similar to what OP describes, but we'll keep the "queue" in the DB in order to insert the application audit event in the same transaction as the data change. A background process then uploads to secondary storage.

We won't have billions of rows though, so once uploaded to secondary storage we'll just clear the blob field and set a "processed" flag.

This way we can find all the relevant keys for a given order, invoice etc quickly based on a partial key search in the database, and transparently fetch from either db directly or secondary storage as needed.

We (Retool) are going to be doing this very soon and cut our database size by 50%+. And you're exactly right: it's so easy to get started by sticking audits (or other append-only data schema) in an RDBMS, but it quickly becomes a headache and bottleneck.
up until some size the headache from maintaining a separate datastore is bigger. everything should be in the RDBMS until proven otherwise for the sake of simplicity. it's actually amazing how much you can squeeze out of 'old school' databases.

e.g. https://docs.microsoft.com/en-us/azure/azure-sql/database/le...

What will you use as a queue and storage?
To be determined!
Mind expanding on what the "nightmare" is?