Hacker News new | ask | show | jobs
by bkeroack 4065 days ago
In my experience, Elasticsearch is the single most common source of infrastructure downtime and service failure. It's basically my arch nemesis.
5 comments

I am interested to hear a bit more about this, as I find it hard to believe. I have only ran it at pretty small scale - x8 servers, around 300 million documents indexed a day, peak index rate 30k docs/sec. I found that you have to monitor it correctly, tune the JVM slightly (Mostly GC), give it fast disks, lots of ram, and the correct architecture (search, index & data nodes) to get the most out of it. Once I did that it was one of the most reliable components of my infrastructure, and still is. I would recommend chatting to people on the elasticsearch irc, or mailinglist, everyone was a great help to me there.
The full explanation deserves a blog post, but in a nutshell it revolves around the issue that ES contains a huge amount of complexity around a feature that is actually fairly useless (the "elastic" part) or at least difficult to use correctly. I've found that you need to be a deep expert in ES to architect and run it properly (or have access to such expertise) and even then it requires regular care and feeding to maintain uptime. In a short-deadline startup world you probably won't have time for any of that--once it's working it will lull you into a false sense of security and then completely blow up a few weeks/months later.
Same here. A single node failure has lead to the whole cluster crashing down around me on more than one occasion.
Really? Perhaps I was never running it at a large enough scale, but even pre-v1.0 I've basically never had any troubles with it (outside of operation concerns like occasionally confusing query syntax.) Then again, I never had more than 11 servers in the cluster so again I may just have never run into problems at scale.
While I don't necessarily disagree, I do find that this depends entirely on how ES is used. All too often people dive headfirst into using elastic search in ways it really should not be used.
It can't be worse than RabbitMQ... can it?