Hacker News new | ask | show | jobs
by markelliot 4672 days ago
What's the raw scale of input data for your 300GB/day of stored events? (assuming that's 300GB on disk stored in Elasticsearch)
1 comments

I think it was roughly 300 million events/day (1kb per event). There is some overhead incurred by logstash (turning a log into json, parsing it into fields) and by elasticsearch (analyzing/indexing data).

In practical terms, and by way of example, a plain text apache access log, fully parsed by logstash (breaking out fields, etc), has historically bloated by quite a bit (6.2x I have measured). Lately, however, with improvements to logstash, better default settings, and elasticsearch being awesome, the 'inflation' number gets down to something more like 1.5x - which isn't bad considering all the awesome you get with it.

Long term, I am working towards making the 'raw data to stored data' ratio something less than 1x.

You can see some experiments I did a year ago on this: https://github.com/jordansissel/experiments/blob/master/elas...

I will repeat these experiments after the next release of logstash, and I expect storage ratios to improve significantly.