Hacker News new | ask | show | jobs
by peterldowns 659 days ago
Very cool project, I'm definitely going to try this out. One question — why use the OpenAI embeddings API instead of BGE (BERT) or other embeddings model that can be efficiently run client-side? Was there a quality difference or did you just default to using OpenAI embeddings?
2 comments

OP's cofounder here. For us, OpenAI embeddings worked best. When building a system that has many points of failure, I like to start with the highest quality ones (even if they're expensive / lack privacy) just to get an upper threshold of how good the system can be. Then start replacing pieces one by one and measure how much I'm losing in quality.

P.S. I worked on BERT at Google and have PTSD from how much we tried to make it work for retrieval, and it never really did well. Don't have much experience with BGE though.

Understood, thanks for the clear answer. Very cool that you worked on BERT at Google — thank you (and your team) for all of the open source releasing and publishing you've done over the years.

I'm using OpenAI embeddings right now in my own project and I'm asking because I'd like to evaluate other embedding models that I can run in/adjacent-to my backend server, so that I don't have to wait 200ms to embed the user's search phrase/query. I'm very impressed by your project and I thought I might save myself some trouble if you had done some clear evals and decided OpenAI is far-and-away better :)

I wish you could tell the stories of how you eval'ed BERT at Google. Sounds meaty.
Retrieval is rarely ever evaluated in isolation. Academics would indirectly evaluate it by how much it improved question answering. The really cool thing at Google is that there were so many products and use cases (beyond the academic QA benchmarks) that would indirectly tell you if retrieval is useful. Much harder to do for smaller companies with a smaller suite of products and user bases.
We ran some qualitative tests and there was a quality difference. In fact, benchmarks show that trend to generally hold: https://archersama.github.io/coir/

That being said, our goal was to make the library modular so you can easily add support for whatever embeddings you want. Definitely encourage experimenting for your use-case because even in our tests, we found that trends which hold true in research benchmarks don't always translate to custom use-cases.

> we found that trends which hold true in research benchmarks don't always translate to custom use-cases.

Exactly why I asked! If you don't mind a followup question, how were you evaluating embeddings models — was it mostly just vibes on your own repos, or something more rigorous? Asking because I'm working on something similar and based on what you've shipped, I think I could learn a lot from you!

Happy to help!

At the beginning, we started with qualitative "vibe" checks where we could iterate quickly and the delta in quality was still so significant that we could obviously see what was performing better.

Once we stopped trusting our ability to discern differences, we actually bit the bullet and made a small eval benchmark set (~20 queries across 3 repos of different sizes) and then used that to guide algorithmic development.

Thank you, I appreciate the details.