Hacker News new | ask | show | jobs
by wyldfire 2612 days ago
> Where elasticsearch shines is in complex queries ...

If the "Multi-tenant indexing benchmark" is accurate it seems like it might be a robustness concern for ES. "Elasticsearch crashed after 921 indices and just couldn’t cope with this load." -- does that mean memory exhaustion or some other crash? If it's the latter, it seems like a quality problem more than a performance one.

5 comments

This is exactly why Elasticsearch has a soft limit of 1000 shards per node since version 7.0: https://www.elastic.co/guide/en/elasticsearch/reference/7.0/...

This benchmark used 4605 shards (5 per index) on a single node, which is way above the recommended number.

Also, to prevent oversharding, the default number of shards per index has been changed to 1 in 7.0.

Yeah, anyone creating 921 indexes in the same cluster hasn't read the ES docs[0]. Utilizing aliases and possibly routing is a significantly better design.

I think we can all agree that misusing a tool, after appropriate documentation has been published, shouldn't be a considered a fault of the tool.

[0] https://www.elastic.co/guide/en/elasticsearch/guide/current/...

Very very few customers actually have 921 indices in production. That is an insane amount.. by a large factor.
Judging from what I see on irc and when I get called for “our ES cluster is on fire, can you put it out?”, 921 indices is not much. I sometimes joke that I could replace myself with a bot that answers “less indices, less shards” to each and every question about performance and that bot could solve 90% of the problems at a fraction of my cost. But alas, nobody wants to pay for a visit from my bot.
each ES shard is actually a lucene index, and it uses memory... why would anyone need thousand of indices on a single node?
What's the difference, memory-wise, between a single shard and two shards holding half the data each?
I'll give a very, very basic example of why two shards with "half" the data is less optimal. More complicated optimizations can be left as an exercise to the reader.

Lets pretend the only data structure within a Lucene Shard is a Trie.

Given 4 strings, ["Hello", "World", "Help", "Thanks"]; A total of 20 chars.

With one shard, Lucene can utilize prefixing to find overlap between "Hello" and "Help". Meanwhile, "World" and "Thanks" are always fully stored. Resulting in a Trie of only 17 chars, i.e. a whopping (1 - 17/20) = 15% storage optimization!

With two shards Lucene potentially looses that optimization.

If the split is: ["Hello", "Help"], ["World", "Thanks"] then Lucene needs to store two Tries with 6 chars and 11 chars. Totaling: 17 chars and we still get a 15% optimization.

However, if the split is: ["Hello", "World"], ["Help", "Thanks"] then Lucene needs to store two Tries with 10 chars and 10 chars. Totaling: 20 chars for a 0% optimization :(

Now lets get back to reality, and remember that Lucene not only uses a LOT of optimizations (for both storage, and query performance), but also (for many reasons) pre-processing the data to find optimal shard placement is generally not an option, and the amount of data being indexed is generally so large that these optimizations are extremely powerful.

Just to make sure this comment is never used out of context: Sharding is still extremely important, and using a single shard is only recommended if you have insignificant amounts of data.

I can’t quantify in bytes, but a shard comes with quite a bit of baggage, it has a mapping and other data stores in the cluster state, some other bookkeeping data attached to it (where, is it primary or replica, in sync or not, ...) and each shard allocates a chunk of the HEAP for index operations. That chunks size depends on whether the shard received writes or not and ranges from 5MB to 256(?)MB. The exact maximum varies from version to version and I don’t think it’s in the ES docs.
In a production setting, I wouldn't recommend doing ElasticSearch multi-tenancy in this manner. Indexes aren't free.