I apologize for the repeated posts. The reason I posted it 6 times is that I aim to announce every release and significant commit. The reason my entire history is centered around this benchmark is because I wanted to introduce my project to the community, potentially with some bias. I began my Hacker News journey at that time and wanted to share what I was working on.
Generally these, say, too "proactive" moves to artificially gain attention to your own GitHub projects makes me less likely to test it out, so I'd rather stay with the mainstream options.
Yeah.. this combined with the fact that this benchmark happens to rank their cloud offering the highest by a wide margin sounds a bit like they are submitting it to market themselves.
The reason https://ann-benchmarks.com is so good, is that we can see a plot of recall vs latency. I can see you have some latency numbers in the leaderboard at the bottom, but it's very difficult to make a decision.
As a practitioner that works with vector databases every day, just latency is meaningless to me, because I need to know if it's fast AND accurate, and what the tradeoff is! You can't have it both ways. So it would be helpful if you showed plots showing this tradeoff, similar to ann-benchmarks.
Thanks for your suggestion and this is a super good question. I was asked some times and please allow me quote one of my response in the repo
"
With respect to recall vs Performance, your idea is indeed correct. However, several reasons have guided us to our current approach:
1. We are not solely benchmarking open-source systems; we are also focusing on cloud services. Some of these services, such as Zilliz and Pinecone, don't allow users to customize their parameters to tune the recall, aiming to simplify their usage. Consequently, creating a recall vs Performance graph is not feasible. Also this benchmark allow users to customize their parameters for systems allowing tuning to get their own result to do comparison.
2. There already exists a number of benchmarks doing what you've suggested, which target individuals with ANN search backgrounds. Our goal is to make this benchmark as straightforward as possible and to assist people who lack understanding about the inner workings of each system.
3. Concerning reproducibility, generating a recall vs QPS graph that you mentioned, would require conducting a multitude of tests to obtain enough data points, which considerably reduces reproducibility.
"
If they’re going to rank themselves so much higher than their competition, they might as well call that out up front and explain why the discrepancy is so large.
It's really hard to benchmark this sort of a thing. There are so many layers of caching and external factors that play into it, from all manner of sources including the operating system load and configuration, disk firmware, hardware configuration, and so forth; and the harder you try to isolate these effects, the farther you get from a realistic benchmark because all the factors that were removed are affecting real world performance in a big way.
This is a big reason why for a long time many large DBMS-providers had clauses in their licenses prohibiting 3rd party benchmarks. You can fairly easily construct a benchmark that makes any given DBMS seem great or awful, and there's no such thing as an objective test.
Fully agree with this idea. All tricks can be a real world strategy and it is impossible for anyone to claiming that they have an absolute fair benchmark.
So the only way we can do to approaching it is to provide more real-world-like cases and forget all tricks vendors might play inside their systems.
Also, people will concern the representative of the cases benchmarks provide. So we plan to make this benchmark more like a framework to support customized cases in the next step.
Yes, of course vendors have bias. But IMHO, if a benchmark is reproducible and the use cases can match users' needs, then we can say it can somehow help decision making.
The inclusion of the OpenAI dataset in this benchmark adds a layer of realism that's often missing in standardized tests with datasets like SIFT and DEEP