Hacker News new | ask | show | jobs
by trengrj 1206 days ago
I work at Weaviate, a few comments on why we implemented hybrid search [1].

- Using two separate systems for traditional BM25 and vector search and keeping them in-sync is pretty difficult from an operations perspective. A combined system is much easier to manage and will have better end-to-end latency.

- For combining scores, a linear combination like this article suggests is not recommended and instead rank fusion https://rodgerbenham.github.io/bc17-adcs.pdf (where you care what each method ranks first rather than the absolute score) is used.

- The point of adding both search methods is for dealing with what researchers term "out of domain data". This is for datasets the model producing the vectors was not trained on. Research from Google https://arxiv.org/abs/2201.10582 suggests hybrid search with rank fusion helps in this case by around 20.4%. For "in domain" data, the model (usually transformer based) will out perform BM25.

- Using a cross encoder [2] is a good component to add to improve relevance. It will just though rerank the final results, so if the initial search returns 100 garbage results the cross encoder won't be able to help.

[1] https://weaviate.io/blog/hybrid-search-explained [2] https://www.sbert.net/examples/applications/cross-encoder/RE...

2 comments

The latency of the combination of parallel systems is equal to the slowest component. And obviously, specialized tools will be faster than a component of a multi-tool system cause while dedicated engines can invest in optimizing specific functionality, multi-tool engines are stuck in the integration hell.

A solution assembled for a specific task from highly specialized components will always be more optimal than 'one-size-fits-all' pipelines. Meilisearch solves search-as-you-type better than anyone else, so why compromise? Not to mention that the scalability pattern of BM25 and vector search is entirely different.

This, by the way, is pretty obvious from the fact that you don't publish comparative benchmarks.

The cross-encoder will only rerank the results, that's right. And you're also correct that if the initial search returns 100 garbage results, it won't be able to help. But that's true for any reranking method. Even the rank fusion you use will rerank only the results returned by keyword and vector search. So what is the advantage of it over cross-encoders?
Adding a cross-encoder to your app means including pytorch/transformers and a model as a dependency. For people using OpenAI or Cohere embedding apis and lightweight infrastructure this can be a big pain.
Proposed architecture doesn't limit you to use self-hosted transformers only, you can use OpenAI just as easily. And you don't need a to install a "module" for that