|
|
|
|
|
by joshua_s_penman
299 days ago
|
|
The thing is — for very long documents, it's actually pretty hard for humans to find things, even with a hierarchical structure. This is why we made indexes — the original indexes! — on paper. What you're saying makes pretty hard assumptions about document content, and of course doesn't start to touch multiple documents. My feeling is that what you're getting at is actually the fact that it's hard to get semantic chunks and when embedding them, it's hard to have those chunks retain context/meaning, and then when retrieving, the cosine similarity of query/document is too vibes-y and not strictly logical. These are all extremely real problems with the current paradigm of vector search. However, my belief is that one can fix each of these problems vs abandoning the fundamental technology. I think that we've only seen the first generation of vector search technology and there is a lot more to be built. At Vectorsmith, we have some novel takes on both the comptuation and storage architecture for vector search. We have been working on this for the last 6 months and have seen some very promising resutls. Fundamentally my belief is that the system is smarter when it mostly stays latent. All the steps of discretization that are implied in a search system like the above lose information in a way that likely hampers retrieval. |
|