Hacker News new | ask | show | jobs
by CShorten 1201 days ago
1. Weaviate does all these things as well, please see - https://weaviate.io/blog/pulling-back-the-curtains-on-text2v.... Although there are interesting optimizations around fitting a tokenizer, these are all relatively simple things whereas the inverted index is more so the interesting thing where the optimizations of specialized systems come into play e.g. WAND scoring.

2. Following on #1 what optimizations does BM25 require that justify an entirely separate tool that requires maintenance of two separate search systems? Also helps that merge overhead to have both searches in the same system.

3. Any company's report of benchmarking itself against its competitors should be taken with a grain of salt.. this is obviously bad practice. The purpose of these benchmarks are to compare in this case hyperparameters of HNSW and in future works around the BEIR numbers I provided earlier, Hybrid Search performance.

4. Again, no company can seriously benchmark itself against its competitors due to obvious conflicts of interest. Maybe this competition could be hosted here - https://big-ann-benchmarks.com/index.html#organizers.

1 comments

Benchmark is open source - https://github.com/qdrant/vector-db-benchmark You are welcome to fork it and make your own measurements, if you suspect something.

Benchmarks like https://big-ann-benchmarks.com/index.html#organizers are good for comparing algorithms, but not engines. They are focused on a single use scenario and do not cover variety of possible applications. Like, for example, how filtering affect the performance.