Arxiv has about 2.6M articles, assuming about 10 pages per article, that's 26M pages. According to OpenAI, their cheapest embedding model (text-embedding-3-small) costs a dollar for 62.5K pages. So the price for calculating embedding for the whole Arxiv is about $416.
I think doing it locally with an open source model would be a lot cheaper as well. Especially because they wouldn't have to keep using OpenAI's API for each new query.
Edit: I overlooked the about page (https://searchthearxiv.com/about), seems like they *are* using OpenAI's API, but they only have 300K papers indexed, use an older embedding model, and only calculate embeddings on the abstract. So this should be pretty cheap.
Embeddings are vectors of chunks of documents, lists of 1024 (depending on a model) float numbers that represent that short snippet of text. This kind of search works by finding the most similar vectors, calculating them cost fractions of the cent, but when you need to do it billions to trillions of times, it adds up.
I think doing it locally with an open source model would be a lot cheaper as well. Especially because they wouldn't have to keep using OpenAI's API for each new query.
Edit: I overlooked the about page (https://searchthearxiv.com/about), seems like they *are* using OpenAI's API, but they only have 300K papers indexed, use an older embedding model, and only calculate embeddings on the abstract. So this should be pretty cheap.