Hacker News new | ask | show | jobs
by simonw 974 days ago
My previous implementation used TF-IDF - I basically took all the words in the post and turned them into a giant "word OR word OR word OR word" search query and piped that through SQLite full-text search. https://til.simonwillison.net/sqlite/related-content

I jumped straight from that to OpenAI embeddings. The results were good enough that I didn't spend time investigating other approaches.

2 comments

> Into a giant "word OR word OR word OR word"

Does that mean you'd return other docs if they share just one word?

The idea of tfidf is that it gives you a vector (maybe combined with pca or a random dimensionality reduction) that you can use just like an Ada embedding. But you still need vector search.

My goal for related articles was to first filter to every document that shared at least one word with the target - which is probably EVERY document in the set - but then rank them based on which ones share the MOST words, scoring words that are rare in the corpus more highly. BM25 does that for free.

Then I take the top ten by score and call those the "related articles".

That's not quite tfidf though. I agree you can get better results than that with Ada embeddings, but I would argue you can get even better results with embeddings from smaller chunks.
I guess technically it's bm25, since it's using the rank mechanism in SQLite FTS5: https://www.sqlite.org/fts5.html#sorting_by_auxiliary_functi...