Hacker News new | ask | show | jobs
by porker 2612 days ago
This. The best benchmarking for search engines is:

    1. Does it return relevant results?
    2. Can it handle complex queries?
2) is only required in specific use-cases, but when it's needed it's _really needed_.

1) is the main measure users care about, and in my experience is best evaluated by building a search in each system with the same corpus and giving to subject-matter experts.

2 comments

The classic metrics here are recall and precision. Does it return all of the results that it should and does it list the best results first.

Without a good search engine you might have the results you needed plus lots of other results. You'd have to scroll to page 20 of your results to actually see the result that you wanted, which means it wasn't very precise.

Think of internet search engines pre-google. With e.g. alta vista you had great recall but extremely poor precision. You'd often be scrolling multiple pages of results. Google turned that around by having great precision and similar recall. They made it so good that they implemented the "i feel lucky" button.

The trick with search is to have great precision and still good enough recall. That's super hard because what is precise is very subjective and highly dependent on your usecases, data, languages, etc.

This is why Elasticsearch is such a hugely complicated product: it includes a lot of solutions for essentially any use case you can imagine around search.

I have no experience with Redisearch; so I'll reserve my judgment. But this article is not doing it any favors.

There are competing things out there for Elasticsearch. Most of the serious ones also use Apache Lucene (e.g. Solr). Some of the upcoming ones are attempting to rebuild what Lucene does and may or may not be good enough depending on your use case. There have been some lucene ports over the years, including a C port. Most of those have fallen behind or are no longer maintained. The Java implementation is actually pretty good as is and has had a lot of performance and optimization work done to it over the years. You'd be hard pressed to build something as good and as fast without essentially using the same algorithms and reinventing a lot of the same wheels.

IMHO the current effort to build a search engine in Rust makes a lot of sense. The language is uniquely suited to doing the kinds of things Lucene does and they seem to be pretty serious about doing things properly.

Definitely. In my mind the very first questions you should ask when evaluating search are "Do i need faceted search?" "Do I need boolean logic? proximity? stemming?". Because the answers to those questions will cut the field way down.

That's why some of these benchmarks (redis and the go search engine posted last week) seem a little apples/oranges to me.

I'm not very familiar with Redis but wouldn't it shine at set-based operations like faceting? You could also pre-tokenize the input data to RediSearch with Lucene analyzers.