Hacker News new | ask | show | jobs
by hintymad 1770 days ago
> I match the query's embedding to the document embeddings,

I assume the doc size is relatively small, otherwise a document may contain too many different topics that make it hard to differentiate different queries?

1 comments

For my search use case, documents are mostly single topic and less than 10 pages. However I have found embeddings still work surprisingly well for longer documents with a few topics in them. But yes, multi-topic documents can certainly be an issue. Segmentation by sentence, paragraph, or page can help here. I believe there are ML-based topic segmentation algorithms too, but that certainly starts making it less simple.