|
|
|
|
|
by jadbox
649 days ago
|
|
Vector searching had strange quirks where searching for "cat" would return mostly a lot of paragraphs unrelated to the word. I was using 3072 length for OAI text-embedding-3-large. Each entry was roughly 1-2 paragraphs. For my recent project, I found that PGroonga was more reliable for full text document lookup (with some fuzzy matching support). |
|
That said, the particular example you give may be either a configuration problem or just that "cat" is a particularly bad choice of word to use for a vector search in your corpus.
HNSW[1] is an approximate nearest neighbour search technique. So if you visualise the embedding as distilling your documents down to a set of numbers, this vector forms the coordinates of a point in a high-dimensional vector space. HNSW is going to take the word "cat" in your example, embed that, and find what are probably the closest other vectors (representing documents) in that space by your distance metric.[2]
Now, it's easy to imagine the problem - if your document corpus is about something very unrelated to cats (like say your corpus is a bunch of programming books), the results of this search are going to be basically just random, because you have a big blob of embedded documents in your corpus which are relatively close together and your search term embeds way off in empty space and therefore the distance from the search term to any given document is extremely large. If you were to search a topic that is reflected in your corpus, your search term could embed right in the middle of that blob and the distances would be much more meaningful meaning the results are likely to be more intuitive and useful.
[1] https://arxiv.org/abs/1603.09320 is I think the original paper that introduced the technique
[2] The probably/approximately get out is what allows the computational complexity of HNSW to stay reasonable even if the corpus is large