| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joshua_s_penman 299 days ago

The thing is — for very long documents, it's actually pretty hard for humans to find things, even with a hierarchical structure. This is why we made indexes — the original indexes! — on paper. What you're saying makes pretty hard assumptions about document content, and of course doesn't start to touch multiple documents.

My feeling is that what you're getting at is actually the fact that it's hard to get semantic chunks and when embedding them, it's hard to have those chunks retain context/meaning, and then when retrieving, the cosine similarity of query/document is too vibes-y and not strictly logical.

These are all extremely real problems with the current paradigm of vector search. However, my belief is that one can fix each of these problems vs abandoning the fundamental technology. I think that we've only seen the first generation of vector search technology and there is a lot more to be built.

At Vectorsmith, we have some novel takes on both the comptuation and storage architecture for vector search. We have been working on this for the last 6 months and have seen some very promising resutls.

Fundamentally my belief is that the system is smarter when it mostly stays latent. All the steps of discretization that are implied in a search system like the above lose information in a way that likely hampers retrieval.

1 comments

zan2434 299 days ago

interesting, so you think the issue with the above approach is the graph structure being too rigid / lossy (in terms of losing semantics)? And embeddings are also too lossy (in terms of losing context and structure)? But you guys are working on something less lossy for both semantics and context?

joshua_s_penman 299 days ago

> interesting, so you think the issue with the above approach is the graph structure being too rigid / lossy (in terms of losing semantics)?

Yeah, exactly.

>And embeddings are also too lossy (in terms of losing context and structure)

Interestingly, it appears that the problem is not embeddings but rather retrieval. It appears that embeddings can contain a lot more information than we're currently able to pull out. Like, obviously they are lossy, but... less than maybe I thought before I started this project? Or at least can be made to be that way?

> But you guys are working on something less lossy for both semantics and context?

Yes! :) We're getting there! It's currently at the good-but-not-great like GPT-2ish kind of stage. It's a model-toddler - it can't get a job yet, but it's already doing pretty interesting stuff (i.e. it does much better than SOTA on some complex tasks). I feel pretty optimistic that we're going to be able to get it to work at a usable commercial level for at least some verticals — maybe at an alpha/design partner level — before the end of the year. We'll definitely launch the semantic part before the context part, so this probably means things like people search etc. first — and then the contextual chunking for big docs for legal etc... ideally sometime next year?