| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by marginalia_nu 927 days ago

> Traditional keyword-based search mechanisms often rely on algorithms like TF-IDF, BM25, or comparable methods. While these techniques internally utilize vectors, they typically involve sparse vector representations. In these methods, the vectors are predominantly filled with zeros, containing a relatively small number of non-zero values. Those sparse vectors are theoretically high dimensional, definitely way higher than the dense vectors used in semantic search. However, since the majority of dimensions are usually zeros, we store them differently and just keep the non-zero dimensions.

Yeah Heap's Law is a bit of a bitch in these situations. Like you'll definitely make sure those vectors are 64 bit if you plan of indexing a proper large number of documents.

I'd also advise caution leaning too much into the vector interpretation of these algorithms, as it's largely viewed as a quaint historical artifact that was bit of a dead end (e.g. as in Croft, Metzler & Strohman 7.2.1)

1 comments

Tomte 927 days ago

> e.g. as in Croft, Metzler & Strohman 7.2.1

https://ciir.cs.umass.edu/downloads/SEIRiP.pdf

link

marginalia_nu 926 days ago

Actually I think I transposed two digits in that reference, it's 7.1.2.

link