Hacker News new | ask | show | jobs
by hskalin 536 days ago
Arxiv has about 2.6M articles, assuming about 10 pages per article, that's 26M pages. According to OpenAI, their cheapest embedding model (text-embedding-3-small) costs a dollar for 62.5K pages. So the price for calculating embedding for the whole Arxiv is about $416.

I think doing it locally with an open source model would be a lot cheaper as well. Especially because they wouldn't have to keep using OpenAI's API for each new query.

Edit: I overlooked the about page (https://searchthearxiv.com/about), seems like they *are* using OpenAI's API, but they only have 300K papers indexed, use an older embedding model, and only calculate embeddings on the abstract. So this should be pretty cheap.