Hacker News new | ask | show | jobs
by showerst 2616 days ago
I've seen a lot of ES competitor posts pop up on HN lately, and I think they're missing the point of Elastic.

If you only need very basic word search, ES is probably not worth the complexity in your stack, especially if you're already running a SQL database with decent plaintext search.

Where elasticsearch shines is in complex queries: "Show me every match where this field contains 'extinction' within 10 words of 'impact crater' but NOT containing 'oceanic' and the publish date is > last month and one of the subjects is anthropology"

9 comments

Not to mention that Elasticsearch is excellent for non-text search.

One application I worked on indexes a Postgres database into Elasticsearch for live front-end queries. We index every single field, sometimes hundreds of fields in a single index. ES does this easily. Thanks to Lucene's quasi-columnar/quasi-LSM tree storage, new indexed fields aren't very expensive, and searches -- even fairly complicated ones -- are very fast.

ES is also extremely fast at aggregations. Even complex multi-level aggregations (e.g. group by date, then multiple nested buckets by different fields with "top k" results for each) take just a few hundred milliseconds for latge million-document datasets.

Where ES has problems are areas like replication, consistency and memory usage. It's very hard to tune ES; due to JVM GC and caches, it's basically impossible to predict how much RAM ES will need, and OOMs are common. There's also still no way to ask for a consistent index on query; the best you can do is use "waitfor=refresh" on indexing, which is the wrong time for it. I'd love a consistent Raft-based ES.

Could you talk about the usecase here ? This is very interesting from a db query tuning perspective. What kind of queries work well in scenarios like this ? I thought search engines are only useful in ranking based searches ...so you accept a degree of error margin wrt databases.
Any non-joining OLTP query will perform very well with ES. It is particularly effective with low-cardinality fields where in a traditional relational database you would not benefit from a B-tree index and a database like Postgres typically would revert to a sequential scan over the entire table. Column intersections in Lucene are extremely efficient, basically streaming sorted vectors of document IDs from RAM.

Where ES is not optimal is when you need joins. That said, doing left outer joins -- which is typical in web workloads where you may have something like an "articles" table that you want to query with filters and then join against "authors" and "categories" without filters to fetch connected data -- on the client side with some basic parallelization is surprisingly effective. Currently doing that in some apps where we get <100 millisecond performance even when fetching maybe 5-6 related objects per result.

Do you do left outer join on elasticsearch...or do you do it in the client code ? I'm trying to figure out if elasticsearch supports these query types. It's something I never thought about.
In client code. ES doesn't do joins (ignoring a rather weak feature called "nested documents").
This. The best benchmarking for search engines is:

    1. Does it return relevant results?
    2. Can it handle complex queries?
2) is only required in specific use-cases, but when it's needed it's _really needed_.

1) is the main measure users care about, and in my experience is best evaluated by building a search in each system with the same corpus and giving to subject-matter experts.

The classic metrics here are recall and precision. Does it return all of the results that it should and does it list the best results first.

Without a good search engine you might have the results you needed plus lots of other results. You'd have to scroll to page 20 of your results to actually see the result that you wanted, which means it wasn't very precise.

Think of internet search engines pre-google. With e.g. alta vista you had great recall but extremely poor precision. You'd often be scrolling multiple pages of results. Google turned that around by having great precision and similar recall. They made it so good that they implemented the "i feel lucky" button.

The trick with search is to have great precision and still good enough recall. That's super hard because what is precise is very subjective and highly dependent on your usecases, data, languages, etc.

This is why Elasticsearch is such a hugely complicated product: it includes a lot of solutions for essentially any use case you can imagine around search.

I have no experience with Redisearch; so I'll reserve my judgment. But this article is not doing it any favors.

There are competing things out there for Elasticsearch. Most of the serious ones also use Apache Lucene (e.g. Solr). Some of the upcoming ones are attempting to rebuild what Lucene does and may or may not be good enough depending on your use case. There have been some lucene ports over the years, including a C port. Most of those have fallen behind or are no longer maintained. The Java implementation is actually pretty good as is and has had a lot of performance and optimization work done to it over the years. You'd be hard pressed to build something as good and as fast without essentially using the same algorithms and reinventing a lot of the same wheels.

IMHO the current effort to build a search engine in Rust makes a lot of sense. The language is uniquely suited to doing the kinds of things Lucene does and they seem to be pretty serious about doing things properly.

Definitely. In my mind the very first questions you should ask when evaluating search are "Do i need faceted search?" "Do I need boolean logic? proximity? stemming?". Because the answers to those questions will cut the field way down.

That's why some of these benchmarks (redis and the go search engine posted last week) seem a little apples/oranges to me.

I'm not very familiar with Redis but wouldn't it shine at set-based operations like faceting? You could also pre-tokenize the input data to RediSearch with Lucene analyzers.
Not to mention spelling correction, synonyms, nested taxonomies, etc. Search is incredibly complex, and I always snort when I see someone trying to create one from scratch.
I was just going to ask: Will SQL work with spelling corrections?

I was under the impression that if you wanted to do auto-complete, you need to handle mis-spellings, and that ElasticSearch is one of the best options for this.

SQL doesn't. Some relational databases have full-text search extensions like Postgres and SQL Server but they offer the basic stemming and trigram stuff, no spelling or synonyms. You can get an autocomplete working using wildcard matches but you won't be able to recognize that a word is misspelled without maintaining your own dictionary.
AFAIK, SQL engines will do a fairly reasonable full-text search, but if you need anything more, you have to upgrade to a full-featured search engine.
This X 2. I feel all these people who feel anything else is a viable alternative to Elasticsearch have a dumb, simple, small-scale use case, where even full-text search over Postgres would suffice.
>> all these people who feel anything else is a viable alternative to Elasticsearch have a dumb, simple, small-scale use case

I have a search use case. I want to create a simple language model where each token in the lexicon gets a unique ID (or ordinal) that I can use to create a more sophisticated model where each document is represented as a vector as wide as there are unique tokens and use clustering and give each cluster a unique ID (or ordinal) so that I can create an even more sophisticated language model, one with built-in semantic understanding. A natural language data structure, if you will, with multiple layers. I want to store the entire WWW in such a model. So I'm building a language model framework that is not build on Lucene because I'm not obliged to use ES in that capacity.

I feel you are wrong to call my use case simple and small scale.

> Where elasticsearch shines is in complex queries ...

If the "Multi-tenant indexing benchmark" is accurate it seems like it might be a robustness concern for ES. "Elasticsearch crashed after 921 indices and just couldn’t cope with this load." -- does that mean memory exhaustion or some other crash? If it's the latter, it seems like a quality problem more than a performance one.

This is exactly why Elasticsearch has a soft limit of 1000 shards per node since version 7.0: https://www.elastic.co/guide/en/elasticsearch/reference/7.0/...

This benchmark used 4605 shards (5 per index) on a single node, which is way above the recommended number.

Also, to prevent oversharding, the default number of shards per index has been changed to 1 in 7.0.

Yeah, anyone creating 921 indexes in the same cluster hasn't read the ES docs[0]. Utilizing aliases and possibly routing is a significantly better design.

I think we can all agree that misusing a tool, after appropriate documentation has been published, shouldn't be a considered a fault of the tool.

[0] https://www.elastic.co/guide/en/elasticsearch/guide/current/...

Very very few customers actually have 921 indices in production. That is an insane amount.. by a large factor.
Judging from what I see on irc and when I get called for “our ES cluster is on fire, can you put it out?”, 921 indices is not much. I sometimes joke that I could replace myself with a bot that answers “less indices, less shards” to each and every question about performance and that bot could solve 90% of the problems at a fraction of my cost. But alas, nobody wants to pay for a visit from my bot.
each ES shard is actually a lucene index, and it uses memory... why would anyone need thousand of indices on a single node?
What's the difference, memory-wise, between a single shard and two shards holding half the data each?
I'll give a very, very basic example of why two shards with "half" the data is less optimal. More complicated optimizations can be left as an exercise to the reader.

Lets pretend the only data structure within a Lucene Shard is a Trie.

Given 4 strings, ["Hello", "World", "Help", "Thanks"]; A total of 20 chars.

With one shard, Lucene can utilize prefixing to find overlap between "Hello" and "Help". Meanwhile, "World" and "Thanks" are always fully stored. Resulting in a Trie of only 17 chars, i.e. a whopping (1 - 17/20) = 15% storage optimization!

With two shards Lucene potentially looses that optimization.

If the split is: ["Hello", "Help"], ["World", "Thanks"] then Lucene needs to store two Tries with 6 chars and 11 chars. Totaling: 17 chars and we still get a 15% optimization.

However, if the split is: ["Hello", "World"], ["Help", "Thanks"] then Lucene needs to store two Tries with 10 chars and 10 chars. Totaling: 20 chars for a 0% optimization :(

Now lets get back to reality, and remember that Lucene not only uses a LOT of optimizations (for both storage, and query performance), but also (for many reasons) pre-processing the data to find optimal shard placement is generally not an option, and the amount of data being indexed is generally so large that these optimizations are extremely powerful.

Just to make sure this comment is never used out of context: Sharding is still extremely important, and using a single shard is only recommended if you have insignificant amounts of data.

I can’t quantify in bytes, but a shard comes with quite a bit of baggage, it has a mapping and other data stores in the cluster state, some other bookkeeping data attached to it (where, is it primary or replica, in sync or not, ...) and each shard allocates a chunk of the HEAP for index operations. That chunks size depends on whether the shard received writes or not and ranges from 5MB to 256(?)MB. The exact maximum varies from version to version and I don’t think it’s in the ES docs.
In a production setting, I wouldn't recommend doing ElasticSearch multi-tenancy in this manner. Indexes aren't free.
To counter some of the comments here, and after looking at the sources, the RediSearch module is pretty capable and matches a lot of the Elasticsearch features: https://github.com/RedisLabsModules/RediSearch

Agree with the general claim that this benchmark is poor though. A real study of complex searches with faceting, ranking and ordering against both databases in a distributed setup would be much more interesting.

> "Show me every match where this field contains 'extinction' within 10 words of 'impact crater' but NOT containing 'oceanic' and the publish date is > last month and one of the subjects is anthropology"

...and then aggregate into time-based buckets, and within each bucket split the results by this field, and then...

RediSearch can do all of that.
I've seen ppl use it as key value store, time series database( i think they have some apm support too) , nosql datastore.
What would be your go-to solution for a basic word search - lets say you only have a few MBs of data - not GBs...
Full-text search from MySQL or other similar database. When that gets overloaded, then consider something like Elastic Search. That is my rule.
Few mbs - just use lucene in memory if you're using Java.