Hacker News new | ask | show | jobs
by giancarlostoro 3014 days ago
Ok so maybe someone can tell me what I (we) did wrong at my job we tried using the ELK stack, and it's probably still running but it is such a resource hog. I do not understand why they built Elasticsearch. I've read in a couple places you need like 32GB of RAM[0] just to run this thing to do queries, and having crashed Kibana / Elasticsearch a dozen times I believe it's designed poorly. I had hoped I could drop in MongoDB instead, but saw no indication of this being a fluid change. How many resources are any of you allocating towards your 'ELK' stack (I say 'ELK' cause now they have other software in the mix)?

Needless to say, I rather build my own solution for logging instead using a database that's not in-house having experienced all of this.

[0]: https://www.elastic.co/guide/en/elasticsearch/guide/current/...

Oh wow now it says 64GB of ram is the sweet spot.... What the heck is this thing doing that couldn't of been accomplished with MongoDB or PostgreSQL? I've got busier data sets that don't need 16GB of RAM, and yes we pound the database with logs of sorts and query in all sorts of ways and still I don't get it... I wouldn't recommend this stack to a friend unless they've got plenty of hardware to spare.

12 comments

> tell me what I (we) did wrong at my job we tried using the ELK stack, and it's probably still running but it is such a resource hog.

Part of the problem is what it's promoted for. It's a great drop-in, horizontally-scalable, full-text search engine, that's inexplicably become popular for log ingestion and analysis.

To those ends, I hate it, I hate every bit of it, from the atrocious JSON-based query DSL (seriously thought it was a joke at first) to its unpredictable timeouts, shard storms, mapping conflicts and other problems at scale. Elementary SQL concepts aren't possible ('select someone_elses_poorly_named_key as first_name', nope, you gotta reindex). High-cardinality aggregations fail in spectacular ways. High-anything aggregations fail in spectacular ways. The scroll API returns results unordered. There's no way to properly spec your cluster; the docs explicitly take a trial-and-error approach to design.

It's not just you. Elasticsearch does me no favors with the task of log analysis. I'd sooner normalize and grep a pile of gzipped log files than keep dealing with this mess, but this is the second job I've been at that's built their logging infrastructure on top of it.

> I do not understand why they built Elasticsearch.

"You know, for search."

It's great for searching proper text. Documents, comments, blog posts, etc.

> I've read in a couple places you need like 32GB of RAM[0] just to run this thing to do queries, and having crashed Kibana / Elasticsearch a dozen times I believe it's designed poorly.

You don't necessarily need 32GB to do anything. The required heap size scales with your intended workload, but it's not like you can do a back-of-napkin calculation to figure out the relation. I run a 1GB instance for development.

It uses the memory to do a lot of caching so the queries you throw it are lightning fast. Mongo does something similar (but crappier IMO) with its concept of working sets.

> atrocious JSON-based query DSL (seriously thought it was a joke at first)

I could not agree more, I found this in my codebase https://i.imgur.com/44stiHv.png

I've never seen an odder choice of data structure.

I've deployed it a few times. In some cases it randomly spikes to 100% cpu usage until you restart it.

Is there a way to know how much RAM you are going to need for your dataset? I think I was using it for looking up restaurant names from a database of 100,000 and I wanted to factor in misspellings and partial matches.

Not really, depends on the size of the dataset and the complexity of the operations - when did you use it? Sounds like you were on a dodgy build or set up

I think Elastisearches policy on sizing is pay us or a partner to have a look and give you a guesstimate, which is pretty standard

We ran into this same problem trying to build and run a data management platform on it... IDK if you want to try what we built, but it's a hell of a lot faster and less cumbersome. We went from 30 ES servers down to 4 Pilosa servers.

https://github.com/pilosa/pilosa

I spent so much time resurrecting ElasticSearch and then developers wouldn't even use it because the query language is based on ngrams and they want grep-like capabilities instead.

Unless there are hundreds of GB/day to index it's much simpler to forward the logs using syslog or journald and then use grep on the collector.

I definitely used it on 8GB or RAM but only about 10M documents or so. It's pretty kick-ass for ad-hoc queries about data that you would typically have to set up a star schema. I tell you probably the best thing you can do is set up a Kafka queue, Apache Spark and ElasticSearch (do the research around these 3) but you'll love the ability to find out things like how many M(ale) patients that are above the age of 30 that have diabetes have died shortly after a surgery. Trying to set all that up with complicated star-schemas etc.. really sucks compared to just building a JSON format that you pipe through Kafka or Spark.

Edit: And yes originally used it for log processing, but really people should definitely try it out for replacing very expensive BA stacks. For log analysis something called Graylog that actually uses Elk internally or just go with Splunk which gives you out of the box primitives for session length calculations. In ELK if you want to do something that sounds as simple as session lenght - you'll ending having to reprocess documents using Kafka or Spark with background jobs that reprocess documents greater than 1 hour old (or something to that effect).

We are so many that have been through this just like you.

My realization was also what others have mentioned, that I'm trying to use a search engine backend for log storage.

Specifically 3 months, which is the law here. That meant that ES had to keep 3 months of logs readily searchable. It's just not feasible when you're generating 25-30G logs each day.

I still use ES for search engines but I've stopped using ELK for logs unless it's for small environments.

This is why I never got into the elk craze. Especially coming from oldschool syslog/syslog-ng/rsyslog/journald/etc... and usually searched from the command line.

For some reason though the past few years companies and people have become obsessed with gui's, and nagios/cacti were falling out of favor, so people started just dumping elk/graylog so randoms could quickly and easily get that gui... without considering the resource requirements and lack of scalability. (graylog2 fixed a lot of the graylog problems)

It seems a lot of managers also started wanting gui-dashboards, which is probably the big reason behind the push. Regardless, I don't think the trend is going away, so the market is ripe for disruption for something that does the same thing but faster and with less resources. The real problem is most of the competitors for some reason decided proprietary was the way to go, and the people using these tools don't want more proprietary bullshit in their stack.

The feature that makes them different from other web-gui-graphing tools though is the search/query customization.

I'm not sure why elasticsearch documentation recommends such memory usage, however for small apps I have elasticsearch running for years (version 1.x) and it's been running on a shared 2GB virtual machine for the past 4 years, I've had to restart it a few times, but seriously you don't need 32GB or 64GB. It depends on your use case.
Originally the most popular (and only) use case for Elasticsearch was for full text search. Then a bunch of people thought it was good for analyzing logs and decided to hijack the brand, and market it as a logging analyzer.

But in its core, Elasticsearch is meant for full text search and analytics. Not logging. Logging is not even an interestinng use case.

You don't actually need 64GB or 32GB or whatever RAM. The docs should be more clear. What they mean to say is if you have a large enough dataset in production, 64GB RAM per node is the ideal maximum size. That is because 32GB is the max Java heap size that uses 32-bit pointers. So 32GB for Java heap and 32GB for everything else. Although more RAM is generally better because the OS cache will still be utilized making queries faster overall.

I think ES's Java heap default is 1 or 2 GB, and that is more than enough for many use cases. Heap-heavy operations like sorting and aggregations may need more RAM depending on the index size. As far as I know, search isn't heap-heavy so you only need more RAM as query volume increases or index size increases.

For crashes, what version? Version 6.x has never crashed on me, but previous versions did have a tendency to crash for me.

Elasticsearch is a search engine built on Apache Lucene and basically makes it much more usable. It's similar to Apache Solr but easier to get up and running.

It's good at text but can be used as a similarity search system, so you can index and find similar images, audio waveforms, binary data, etc.

Over the years, search queries have become useful for log analytics and so the ELK stack has been developed to become a single solution to do both text search as well as ingest logs and run all kinds of queries, aggregations, and even machine learning.

I think that's what fundamentally hurts it as they are very separate use-cases trying to be served by the same system. We still use ES for search but wish it just focused on that with much better performance, reliable clustering and proper transactions instead.

By default elasticsearch stores data in 3 ways: rowstore (original doc/row), inverted-index (for fast queries), doc-values (columnar store for aggregations). So you need to configure it for what you want, for each field.

While you can do all of that with plain pg and having no indexes and doing table-scans, it will be slow when you search (depends on how often you need to search and type of queries).

So es is very good at some stuff compared to every other type of db. For logging, yeah, there are some companies who just do full-scan on each query and are fine.

> plain pg and having no indexes

but you can have indexes in pg..

Yes, but I was saying about overhead.
According to the cloud.elastic.co recommendations for 'production' grade clusters suggests from 4GB upwards.