Hacker News new | ask | show | jobs
by bryanh 4515 days ago
The thing that worried me the most about Elasticsearch was how fragile it got around the limits of its performance. Run out of memory because of a nasty query? Boom, data corrupted. I hope you weren't using it as your primary persistence layer...

Otherwise, we love ES. The other comment about it being a better Mongo than Mongo rings true. With the backup/restore API and the some of the circuit breakers, I'm hopeful that my fears will be abated.

2 comments

FWIW, this is a place ES devs are spending a lot of time thinking about. For example, 1.0 introduces a new "Circuit Breaker" [1] feature which will help prevent over-eager facets from blowing out the heap. It's just one part of a very large effort to make ES handle exceptional events more gracefully (in particular, memory related).

Another example are disk-based doc values [2], which are essentially pre-computed field data structures that are stored on disk. This moves Field Data off heap and allows the OS to manage memory evictions, to help minimize GCs and OOM blowouts.

[1] http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

[2] http://www.elasticsearch.org/blog/disk-based-field-data-a-k-...

Ditto open file handles, which is easy to push when aggressively over-sharding. Not an uncommon mistake for the enthusiastic newbie.

Having supported Solr/ES/Lucene in production for 4+ years now (websolr.com / bonsai.io) I would be pretty hesitant to trust Lucene in general as a primary data store. Beautiful for secondary indexing, but otherwise, Why Not Postgres?™ ;)

Complexity. Having two copies of the data means more dev time, more resources required to shift the data around, etc. Having just 1 data store that can also handle all your searching is like the holy grail. As you say, not sure if Solr/ES/Lucene are there yet - but they're definitely very very close. There is no theoretical barrier either - it just comes down to closing bugs, and the ES/Lucene team are very good at closing bugs.

EDIT: I don't think MongoDB is there yet either. There are definite benefits and drawbacks between Postgres and ES, tipping heavily towards Postgres for structured heavy write data. But for ES and MongoDB? I think MongoDB falls a bit short there.

Sometimes, I actually find it easier to have more systems that do their job really well and sync things between them, rather than trying to get a single system to do everything.

For example, Postgres lets you reason about integrity, atomicity and transactional boundaries, and whether things are really safely stored with synchronous replication. If Postgres returns after a commit, I trust it. However, that requires me to have two servers working, which is harder to keep highly available.

ZooKeeper, on the other hand, I can rely on being available. But that's not really something you want to be putting lots of load on, nor try to do anything but trivial "queries". And the more servers you add, the slower writes get.

I don't trust Elasticsearch enough for those tasks, yet I wouldn't want to do searches in Postgres (Yep, I'm familiar with tsearch) even though it can. Elasticsearch is simple to scale out and awesome for searching.

Logs and metrics we shove straight into Elasticsearch, however. Other things go from ZooKeeper to Postgres and then to Elasticsearch, or from just Postgres to Elasticsearch.

Separate tools for separate jobs. I'm one of the co-founders of www.found.no, one of the hosted Elasticsearch providers . We absolutely love Elasticsearch and find new use cases for it all the time, but it's not going to be the one store to rule them all, at least not very soon.

I'd like to point out that two competing founders of hosted Elasticsearch as a service agree: ES is great, but not a general-purpose data store :-)
Hi, Nick. :)

Indeed!

That said, it's great that more people are picking up Elasticsearch for new exciting things.

Elasticsearch has really pushed what constitutes a "search problem", and deserves lots of kudos for that! :)

Sure, that's a fair point. Data consistency reliability in ES and Lucene will only get better over time.

But I personally suspect Lucene won't ever get away from the dreaded "just reindex." And to the larger point, I think recent resurgent interest in data stores and distributed systems have shown pretty clearly that there is no holy grail. No single data store can provide all the semantics necessary for all use cases. Maybe not even for most use cases. There are just too many tradeoffs to consider.

Believe me, I earn a living hosting Elasticsearch, so I'd love to see it become a robust primary data store. There are some use cases where it actually does make sense—just look at the amazing traction ES is experiencing for storing and indexing time-series data.

But as a general-purpose primary store, I'm not really holding my breath. Maybe I'm just becoming battle-worn and bitter. I would love to be proven otherwise over the next few years!

I'd like to learn from you about "general-purpose primary store". Do you mean for storing any type of data? Here is what I think regarding the case you brought up in the previous post:

ES is suitable for full-text based document indexing for enterprise level or any websites, which means they have a reasonable amount of data to be indexed in a given timeframe. A complete re-indexing won't not take for a couple of days.

So the basic idea behind the NoSQL database is to dump the data into the database quickly and return, so you can see very fast response for insert and delete. Then it will load the data into the memory to process for real-time retrieval which also produces fast response from select. I'm not sure about update.

If the data volume grows, they quickly add shards or make the number of pre-shards big enough to allocate enough memory resources to handle the queries or let the OS to swap the memories by adding more server nodes.

So if you want to use NoSQL database, you must be bound with the system requirement and make your application fit into that and take the most advantage from it. Otherwise, if you are running high structured data store, better to use relational database.

Another point is: if the documents are collected from the web like search engine, NoSQL will not fit for the large volume of data and relational database is also used to store the indexed data for fast retrieval. I guess this is what you meant "general-purpose primary store".

Correct me if I'm wrong.

I think the distinction made it this comment is valuable, and echoes our experience. ES is not (at least so far) suitable as a general purpose data store, but it is suitable (and very good) for more than search. For some use cases, it's the best available.
I think the CSS for bonsai.io is not loading.
Thanks. I screwed up our CDN settings while trying to push out some changes. Working on that :-)