|
|
|
|
|
by trengrj
1206 days ago
|
|
I work at Weaviate, a few comments on why we implemented hybrid search [1]. - Using two separate systems for traditional BM25 and vector search and keeping them in-sync is pretty difficult from an operations perspective. A combined system is much easier to manage and will have better end-to-end latency. - For combining scores, a linear combination like this article suggests is not recommended and instead rank fusion https://rodgerbenham.github.io/bc17-adcs.pdf (where you care what each method ranks first rather than the absolute score) is used. - The point of adding both search methods is for dealing with what researchers term "out of domain data". This is for datasets the model producing the vectors was not trained on. Research from Google https://arxiv.org/abs/2201.10582 suggests hybrid search with rank fusion helps in this case by around 20.4%. For "in domain" data, the model (usually transformer based) will out perform BM25. - Using a cross encoder [2] is a good component to add to improve relevance. It will just though rerank the final results, so if the initial search returns 100 garbage results the cross encoder won't be able to help. [1] https://weaviate.io/blog/hybrid-search-explained
[2] https://www.sbert.net/examples/applications/cross-encoder/RE... |
|
A solution assembled for a specific task from highly specialized components will always be more optimal than 'one-size-fits-all' pipelines. Meilisearch solves search-as-you-type better than anyone else, so why compromise? Not to mention that the scalability pattern of BM25 and vector search is entirely different.
This, by the way, is pretty obvious from the fact that you don't publish comparative benchmarks.