Hacker News new | ask | show | jobs
by jordansissel 4686 days ago
At my last job (prior to joining elasticsearch), I had a cluster of 7 machines (16 cores, 16gb ram, 2TB raid1), each running logstash and elasticsearch.

The event rate going into this cluster was about 5000 events/sec on average (burst up to 10,000 events/sec sometimes).

During a maintenance (two machines going offline for disk repairs), I benchmarked the surviving 5-node cluster at 88,000 events/sec peak performance.

In terms of capacity planning, this means that we could have a 9x increase in normal event load and still not need to grow the cluster's processing capacity.

Persistent storage is another story. We stored about 300GB/day of events, getting us roughly 45 days of data retention before we would run out of space (2TB * 7 nodes / 300gb/day; roughly 45 days). I'm working on improving storage efficiency of logstash and elasticsearch, too, so retention should improve greatly in the long term.

For other experiences, it's useful to invoke the community and ask what others are done - the #logstash irc channel on freenode is very active as is the logstash-users@googlegroups.com mailling list.

Hope this helps!

3 comments

Thanks for the detailed reply! My use case is a stream of distinct, ordered events identified by a UUID, where the first event makes up about 95% of the volume; that is, we don't often receive subsequent events with the same UUID.

The initial event and any subsequent ones tend to arrive close together in time, so the challenge is to find something that can handle a high insertion rate, a relatively low update rate, while providing fast aggregations suitable for charting in a web-frontend. In Riak, Couchbase or HyperDex we'd use a secondary index and do our own math, but Elasticsearch is attractive because it appears to support the kind of queries we're interested in out of the box, in addition to having a good reported write-rate.

Persistence is less of an issue, because after a short period of time (a couple of hours) we would summarise the events into our analytics DB (Infobright) and so we could set a TTL on the data stored in Elasticsearch.

Again, thanks for the response and I'll check out the mailing-list and IRC channel.

Edit: Grammar

What's the raw scale of input data for your 300GB/day of stored events? (assuming that's 300GB on disk stored in Elasticsearch)
I think it was roughly 300 million events/day (1kb per event). There is some overhead incurred by logstash (turning a log into json, parsing it into fields) and by elasticsearch (analyzing/indexing data).

In practical terms, and by way of example, a plain text apache access log, fully parsed by logstash (breaking out fields, etc), has historically bloated by quite a bit (6.2x I have measured). Lately, however, with improvements to logstash, better default settings, and elasticsearch being awesome, the 'inflation' number gets down to something more like 1.5x - which isn't bad considering all the awesome you get with it.

Long term, I am working towards making the 'raw data to stored data' ratio something less than 1x.

You can see some experiments I did a year ago on this: https://github.com/jordansissel/experiments/blob/master/elas...

I will repeat these experiments after the next release of logstash, and I expect storage ratios to improve significantly.

it your taking feature requests, a plugin to archive to s3 would be really nice, for long term data retention. and more props to lumberjack-go port.