Hacker News new | ask | show | jobs
by shawnz 2509 days ago
Denormalization typically improves performance. Normalization isn't done for performance reasons but for consistency reasons, i.e. so that data isn't duplicated and there is only one source of truth
1 comments

Yes, exactly — normalization is really useful for reasons of quality and correctness, but generally not so important for data like logs that's rotating through the system on a pretty constant basis.

And addressing the parent's point on databases: they don't look like an RDMS, but you can kind of think of log management/querying systems like Splunk et al. to be like a specialized database with specific properties:

- Flexible indexing: Logs change frequently which makes keys come and go, so it's convenient not to not have to be constantly building new indexes to make them searchable.

- Optimized for recent data: Newer logs tend to be accessed relatively frequently and older logs much more rarely (if ever), so it's generally a good idea for these systems will rotate data through different tiers of storage as they age — the new on fast machines with fast disks, the old on slower machines with large disks, and the very old probably just in S3 or something.

- High volume: Any of the traditional relational databases would have a lot of trouble with the volume of data that we put through Splunk. (That said, its problem domain is more constrained — it scales horizontally much more easily because it doesn't have have to concern itself with things like consensus around write consistency.)

How many columns does the average canonical log entry at Stripe have? What's the mix of low/high cardinality string fields look like vs number of metric/counter fields?
On the order of many dozens of fields and it's a pretty good mix of all of those.

Lots of low cardinality fields, lots of counters and numbers (e.g. request duration), and quite a few high cardinality fields too. e.g. IDs, IPs.