Show HN: Playground for comparing embedding models on Wikipedia+book retrieval

Y	Hacker News new \| ask \| show \| jobs

Show HN: Playground for comparing embedding models on Wikipedia+book retrieval (embeds.ai)

5 points by davidtsong 967 days ago

Introducing embeds.ai: an embedding playground to compare how embedding models work on a real world use case (retrieval augmented generation for Wikipedia articles + Elad Gil's High growth handbook)

A few weeks ago, Shreyan and I were looking for an embedding model to use for RAG. We eventually came across the MTEB leaderboard, but we struggled to understand the benchmark scores.

We wanted a tool to test various embedding models with example queries on real-world datasets. After unsuccessfully looking for such a “playground”, we decided to just build one ourselves!

We embedded HuggingFace’s Simple Wikipedia dataset using @OpenAI, @Cohere, and 2 open-source models via @Baseten. We then stored the embeddings in @Supabase using pgvector. Finally, we built a web app using NextJS and deployed it on @Vercel.

Now we’re hosting the playground for anyone to use for free, as well as open-sourcing our work so people can try evaluating other models, datasets, or indexes.

Learn more here in our full blog post here: https://shreyanjain.substack.com/p/announcing-embedding-batt...

And the repo is here: https://github.com/EGCap/playground

If you have other suggestions / pain points from working with embedding models, vector DBs, or RAG, or if you would like to collaborate on any of the above or unrelated projects, please reach out! @shreyanj98 @davidtsong on Twitter

6 comments

varunshenoy 967 days ago

Awesome job guys, and thank you for creating it. Curious if you guys have any insights on long-term memory and if there are better ways to do retreivel apart from top-k.

Seems weird that every RAG app uses top-k especially since you might pull in information irrelevant to the context (e.g. if you were asking for the names of the authors of paper, you probably only want the top-1 embedding).

link

davidtsong 966 days ago

Definitely, top-k is a very naive way to do RAG. I think people have experimented with using a cross encoder like approach or even letting the LLM choose the sources. We will experiment with more approaches like this :)

link

clueless_stats 967 days ago

Looks useful - will be cool to see the results as more models and datasets are added!

link

davidtsong 967 days ago

Thank you!

link

sr33j 966 days ago

very cool work! if you used diff models to embed the docs, did they give you diff sized vectors? did this cause any problems in db storage or calculating vector distances?

link

bigheadgpt 967 days ago

Did u try VoyageAI’s new embedding models for this?

link

davidtsong 967 days ago

Not yet, we just saw their announcement today here (for context): https://blog.voyageai.com/2023/10/29/voyage-embeddings/.

We'll definitely work on adding this model next. Seems promising! Thanks for sharing.

link

tigs_ 967 days ago

nice tool! curious - what was your instruction prompt for instructor-large? did that change based on the document type at all?

link

shreyanj 967 days ago

We used a really simple prompt: "Represent the document for retrieval: <doc>". We did not get around to experimenting with it or changing it based on the document type; that's a great idea for future extension!

link

ankitd33 966 days ago

woah cool! What was the rationale for supabase vs vector db?

link

davidtsong 966 days ago

Supabase has pgvector which makes it pretty easy to get started :)!

link