Hacker News new | ask | show | jobs
by elashri 543 days ago
This seems interesting. I always thought of doing something similar but for specific topic in two specific arxiv categories. But for 300k papers like what is in here, Does anyone have an estimation on how much this will cost using OpenAI ada embeddings model?
1 comments

Ada is deprecated: use the text embedding models instead.

It depends on whether you do full text search, or abstract only. If you do full text, I’d guess about 1k tokens per page, 10 pages per paper? So that would be 3B tokens, which would cost you 60$ if you use the cheapest embedder.

If you just do abstracts, the costs will be negligible.

this project only uses kaggle metadata and abstract from arxiv. Moreover it is "focused" on only 5-6 categories in the arxiv. Therefore, the costs are marginal.

Plus you could use a mixed system: first you index the abstract of the most relevant 50 papers, then embedd the text of those 50 in order to asses which are truly relevant and/or meaningful.