Hacker News new | ask | show | jobs
Show HN: Playground for comparing embedding models on Wikipedia+book retrieval (embeds.ai)
5 points by davidtsong 967 days ago
Introducing embeds.ai: an embedding playground to compare how embedding models work on a real world use case (retrieval augmented generation for Wikipedia articles + Elad Gil's High growth handbook)

A few weeks ago, Shreyan and I were looking for an embedding model to use for RAG. We eventually came across the MTEB leaderboard, but we struggled to understand the benchmark scores.

We wanted a tool to test various embedding models with example queries on real-world datasets. After unsuccessfully looking for such a “playground”, we decided to just build one ourselves!

We embedded HuggingFace’s Simple Wikipedia dataset using @OpenAI, @Cohere, and 2 open-source models via @Baseten. We then stored the embeddings in @Supabase using pgvector. Finally, we built a web app using NextJS and deployed it on @Vercel.

Now we’re hosting the playground for anyone to use for free, as well as open-sourcing our work so people can try evaluating other models, datasets, or indexes.

Learn more here in our full blog post here: https://shreyanjain.substack.com/p/announcing-embedding-batt...

And the repo is here: https://github.com/EGCap/playground

If you have other suggestions / pain points from working with embedding models, vector DBs, or RAG, or if you would like to collaborate on any of the above or unrelated projects, please reach out! @shreyanj98 @davidtsong on Twitter

6 comments

Awesome job guys, and thank you for creating it. Curious if you guys have any insights on long-term memory and if there are better ways to do retreivel apart from top-k.

Seems weird that every RAG app uses top-k especially since you might pull in information irrelevant to the context (e.g. if you were asking for the names of the authors of paper, you probably only want the top-1 embedding).

Definitely, top-k is a very naive way to do RAG. I think people have experimented with using a cross encoder like approach or even letting the LLM choose the sources. We will experiment with more approaches like this :)
Looks useful - will be cool to see the results as more models and datasets are added!
Thank you!
very cool work! if you used diff models to embed the docs, did they give you diff sized vectors? did this cause any problems in db storage or calculating vector distances?
Did u try VoyageAI’s new embedding models for this?
Not yet, we just saw their announcement today here (for context): https://blog.voyageai.com/2023/10/29/voyage-embeddings/.

We'll definitely work on adding this model next. Seems promising! Thanks for sharing.

nice tool! curious - what was your instruction prompt for instructor-large? did that change based on the document type at all?
We used a really simple prompt: "Represent the document for retrieval: <doc>". We did not get around to experimenting with it or changing it based on the document type; that's a great idea for future extension!
woah cool! What was the rationale for supabase vs vector db?
Supabase has pgvector which makes it pretty easy to get started :)!