| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Euphorbium 543 days ago
	Did they calculate embeddings for the entire archive? That must have cost a fortune.

3 comments

hskalin 543 days ago

Arxiv has about 2.6M articles, assuming about 10 pages per article, that's 26M pages. According to OpenAI, their cheapest embedding model (text-embedding-3-small) costs a dollar for 62.5K pages. So the price for calculating embedding for the whole Arxiv is about $416.

I think doing it locally with an open source model would be a lot cheaper as well. Especially because they wouldn't have to keep using OpenAI's API for each new query.

Edit: I overlooked the about page (https://searchthearxiv.com/about), seems like they *are* using OpenAI's API, but they only have 300K papers indexed, use an older embedding model, and only calculate embeddings on the abstract. So this should be pretty cheap.

link

machiaweliczny 543 days ago

Embeddings are very cheap to generate

link

HeatrayEnjoyer 543 days ago

What are embeddings and why are they expensive?

link

Euphorbium 543 days ago

Embeddings are vectors of chunks of documents, lists of 1024 (depending on a model) float numbers that represent that short snippet of text. This kind of search works by finding the most similar vectors, calculating them cost fractions of the cent, but when you need to do it billions to trillions of times, it adds up.

link

orf 543 days ago

You could likely calculate them all on a modern MacBook easily enough.

Searching the embeddings is a different problem, but there are lots of specialised databases that can make it efficient.

link

Euphorbium 543 days ago

You can, but it is a scale problem. Doing that would take an unreasonable amount of time at this scale.

link