Hacker News new | ask | show | jobs
by Euphorbium 543 days ago
Did they calculate embeddings for the entire archive? That must have cost a fortune.
3 comments

Arxiv has about 2.6M articles, assuming about 10 pages per article, that's 26M pages. According to OpenAI, their cheapest embedding model (text-embedding-3-small) costs a dollar for 62.5K pages. So the price for calculating embedding for the whole Arxiv is about $416.

I think doing it locally with an open source model would be a lot cheaper as well. Especially because they wouldn't have to keep using OpenAI's API for each new query.

Edit: I overlooked the about page (https://searchthearxiv.com/about), seems like they *are* using OpenAI's API, but they only have 300K papers indexed, use an older embedding model, and only calculate embeddings on the abstract. So this should be pretty cheap.

Embeddings are very cheap to generate
What are embeddings and why are they expensive?
Embeddings are vectors of chunks of documents, lists of 1024 (depending on a model) float numbers that represent that short snippet of text. This kind of search works by finding the most similar vectors, calculating them cost fractions of the cent, but when you need to do it billions to trillions of times, it adds up.
You could likely calculate them all on a modern MacBook easily enough.

Searching the embeddings is a different problem, but there are lots of specialised databases that can make it efficient.

You can, but it is a scale problem. Doing that would take an unreasonable amount of time at this scale.