| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by osmarks 493 days ago
	You could just run a local LLM over every document and ask it "is this related to this query". I don't think you actually want to wait a week (and holding all the documents you might ever want to search would run to petabytes). (the reasonable way is embedding search, which runs much faster with some precomputation, but you still have to store things)

2 comments

amelius 493 days ago

A better way would be to ask the LLM to generate keywords (or queries). And then use old school techniques to find a set of documents, and then filter those using another LLM.

link

brookst 493 days ago

How is that better than embeddings? You’re using embeddings to get a finite list of keywords, throwing out the extra benefits of embeddings (support for every human language, for instance), using a conventional index, and then going back to embeddings space for the final LLM?

That whole thing can be simplified to: compute and store embeddings for docs, compute embeddings for query, find most similar docs.

link

amelius 493 days ago

Yes, you can do the "old school search" part with embeddings.

link

brookst 493 days ago

Ah, I had interpreted “old school search” to mean classic text indexing and Boolean style search. I’d argue that if it’s using embeddings and cosine similarity, it’s not old school. But that’s just semantics.

link

osmarks 493 days ago

https://arxiv.org/abs/2212.10496

link

kortilla 493 days ago

The entire library of Congress is like 10TB. You don’t need anything near petabytes until you get out of text into rich media.

link

osmarks 493 days ago

Common Crawl is petabytes. Anna's Archive is about a petabyte, but it includes PDFs with images.

link