Hacker News new | ask | show | jobs
by unethical_ban 158 days ago
I work in infosec and several popular platforms use elasticsearch for log storage and analysis.

I would never. Ever. Bet my savings on ES being stable enough to always be online to take in data, or predictable in retaining the data it took in.

It feels very best-effort and as a consultant, I recommend orgs use some other system for retaining their logs, even a raw filesystem with rolling zips, before relying on ES unless you have a dedicated team constantly monitoring it.

5 comments

Do you happen to know if ES was the only storage? Its been almost 8 years, but if I was building a log storage and analysis system, then I'd push the logs to S3 or some other object store and build an ES index off of that S3 data. From the consumer's perspective, it may look like we're using ES to store the data, but we have a durable backup to regenerate ES if necessary.
Searchable snapshots in Elasticsearch can be backed by S3 and they perform very well. No need to store the data on hot nodes any longer than it takes for the index to do a rollover, and from then it's all S3.
Dunno, I've had three node clusters running very stable for years. Which issues did you have that require a full team?
Even most toy databases "built in a weekend" can be very stable for years if:

- No edge-case is thrown at them

- No part of the system is stressed ( software modules, OS,firmware, hardware )

- No plug is pulled

Crank the requests to 11 or import a billion rows of data with another billion relations and watch what happens. The main problem isn't the system refusing to serve a request or throwing "No soup for you!" errors, it's data corruption and/or wrong responses.

I'm talking about production loads, but thanks.
Production loads mean a lot of different things to a lot of different people.
To be fair, I think it is chronically underprovisioned clusters that get overwhelmed by log forwarding. I wasn't on the team that managed the ELK stack a decade ago, but I remember our SOC having two people whose full time job was curating the infrastructure to keep it afloat.

Now I work for a company whose log storage product has ES inside, and it seems to shit the bed more often than it should - again, could be bugs, could be running "clusters" of 1 or 2 instead of 3.

There are no 2-node clusters (it needs a quorum). If your setup has 2-node clusters, someone is doing this horribly wrong.
I'm not even sure "get overwhelmed" is a problem, unless you need real time analytics. But yeah, sounds like a resources issue.
> I work in infosec and several popular platforms use elasticsearch for log storage and analysis.

Storing logs in ElasticSearch is just stupid, as it does not preserve order:

https://logstash.jira.com/browse/LOGSTASH-192

You have to slap something durable and a queue in front of it.

Elastic’s own consultants will tell you this …

Meh i run hundreds of es nodes, its gotten a lot more friendly these days, but yes it can be a bit unforgiving at times.

Turns out running complicated large distributed systems requires a bit more than a ./apply, who would have guessed it?