| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eskibars 1322 days ago

> I'm sure people have thought about these things: what technical challenges exist in improving search in these areas? is it just a matter of integrating engines like the one that's linked here? Or maybe keyword searches are often Good Enough, so no one is really clamoring for something better

Several technical challenges.

First, keyword searches have a lot of history that has led to a huge amount of tuning that users have gotten used to (mostly for the better in terms of results but mostly for the worse in terms of difficulty of configuration). For example, keyword systems have evolved over decades to have synonyms (unidirectional and bidirectional), a huge number of stemming algorithms for various languages (and some that cross languages), dictionaries for decompounding, various ngram/shingling methods, phrase matching and term overlap analysis, and the ability to combine all of these together with tunable weights, etc. These have generally resulted in a lot of keyword systems continuing to be "as good as it gets" for a long time. People generally like fiddling with these knobs/dials because it gives them a sense of control...until they realize the combinatory mess they get themselves into where they're essentially human hyperparameter tuning systems. Recently, some additional steps have come to take the "human" out of that with automated systems, but even then, most systems aren't set up to "learn" what synonyms to potentially introduce, whether/when/how to take word order into place, and in particular when/how these can/should combine together and when they shouldn't.

Semantic large language models "solve" some of these problems (automatic synonyms, built in linguistic understanding of root words, etc) if you build them right, but they have a lot of hidden technical depth. Most people try to throw something like BERT into their search and find the hardware costs and complexity go through the roof in ways they weren't ready to handle. And there's history weighing on the expectations for the operators ("where's my synonym configuration," etc) and the answers are very different ("go through a fine tuning step for your model") or sometimes nonexistent on most commercial platforms (how do you ensure only relevant results are returned)? And because the semantic/large language models don't know everything in the world, OOTB models still do underperform relative to keyword on certain query types (those heavy on obscure people names, etc) -- until they're retrained.

There's good research and companies/products coming out though that are changing a lot of this. See https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGq... for example where the BM25 rows are traditional keyword and rows 8+ are zero-shot language models, and you can start to see that in some of the recent developments, semantic/neural/large language models are starting to outperform keyword on the things keyword used to be better at. My sense (though I'm biased) is these solutions are going to rapidly evolve to eliminate many of these technical challenges.

Disclosure/source: I led product management for Elasticsearch for several years and am currently leading product management for Vectara (a neural search SaaS platform)