Hacker News new | ask | show | jobs
by thomasahle 973 days ago
> Into a giant "word OR word OR word OR word"

Does that mean you'd return other docs if they share just one word?

The idea of tfidf is that it gives you a vector (maybe combined with pca or a random dimensionality reduction) that you can use just like an Ada embedding. But you still need vector search.

1 comments

My goal for related articles was to first filter to every document that shared at least one word with the target - which is probably EVERY document in the set - but then rank them based on which ones share the MOST words, scoring words that are rare in the corpus more highly. BM25 does that for free.

Then I take the top ten by score and call those the "related articles".