| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by simonw 974 days ago
	My previous implementation used TF-IDF - I basically took all the words in the post and turned them into a giant "word OR word OR word OR word" search query and piped that through SQLite full-text search. https://til.simonwillison.net/sqlite/related-content I jumped straight from that to OpenAI embeddings. The results were good enough that I didn't spend time investigating other approaches.

2 comments

thomasahle 974 days ago

> Into a giant "word OR word OR word OR word"

Does that mean you'd return other docs if they share just one word?

The idea of tfidf is that it gives you a vector (maybe combined with pca or a random dimensionality reduction) that you can use just like an Ada embedding. But you still need vector search.

link

simonw 973 days ago

My goal for related articles was to first filter to every document that shared at least one word with the target - which is probably EVERY document in the set - but then rank them based on which ones share the MOST words, scoring words that are rare in the corpus more highly. BM25 does that for free.

Then I take the top ten by score and call those the "related articles".

link

rolisz 974 days ago

That's not quite tfidf though. I agree you can get better results than that with Ada embeddings, but I would argue you can get even better results with embeddings from smaller chunks.

link

simonw 974 days ago

I guess technically it's bm25, since it's using the rank mechanism in SQLite FTS5: https://www.sqlite.org/fts5.html#sorting_by_auxiliary_functi...

link