Hacker News new | ask | show | jobs
by infecto 972 days ago
Thinking about it some more as I read through more comments. I guess in the stated case of research papers it can make sense if your task is looking for the common themes and not specific details. If you are embedding a sentence or a paragraph you miss out on the connection between those sentences across the whole paper...or at least its harder to manage that. By encoding a large number of pages from the paper (or the entire paper) you can hopefully do a better job of capturing the theme of that paper.

This also opens up another question though, how would that compare to using a LLM to summarize that paper and then just embed on top of that summary.

1 comments

I would guess that the embedded summary is better, but for many tasks where you use embeddings (like document search), summarizing every document with an LLM is too expensive and slow.