Hacker News new | ask | show | jobs
by brandur 2509 days ago
Yes, exactly — normalization is really useful for reasons of quality and correctness, but generally not so important for data like logs that's rotating through the system on a pretty constant basis.

And addressing the parent's point on databases: they don't look like an RDMS, but you can kind of think of log management/querying systems like Splunk et al. to be like a specialized database with specific properties:

- Flexible indexing: Logs change frequently which makes keys come and go, so it's convenient not to not have to be constantly building new indexes to make them searchable.

- Optimized for recent data: Newer logs tend to be accessed relatively frequently and older logs much more rarely (if ever), so it's generally a good idea for these systems will rotate data through different tiers of storage as they age — the new on fast machines with fast disks, the old on slower machines with large disks, and the very old probably just in S3 or something.

- High volume: Any of the traditional relational databases would have a lot of trouble with the volume of data that we put through Splunk. (That said, its problem domain is more constrained — it scales horizontally much more easily because it doesn't have have to concern itself with things like consensus around write consistency.)

1 comments

How many columns does the average canonical log entry at Stripe have? What's the mix of low/high cardinality string fields look like vs number of metric/counter fields?
On the order of many dozens of fields and it's a pretty good mix of all of those.

Lots of low cardinality fields, lots of counters and numbers (e.g. request duration), and quite a few high cardinality fields too. e.g. IDs, IPs.