Hacker News new | ask | show | jobs
by ethanahte 1162 days ago
Hi, author here.

1. You make a great point about longer documents requiring multiple vectors which I should've mentioned in the post. Depending on your use case, this can certainly explode your dataset size! 2. Good to know about the pgvector limitations -- I haven't used it yet. 3. I guess "index" would be the more database-y term. That said, one thing I'll call out is that you have to re-index if you ever change your embedding model, and indexing can be slow. It took me ~20-30 minutes to index the 10 million embeddings in my benchmark.

2 comments

I'm interested if anyone has some hard data on the "best" size of the document "fragments" that are used for embedding into a dense vector.

Obviously, embedding single words probably aren't particularly useful for reassembling portions of a document for submission to an LLM in the prompt. I'm currently pondering on what size of string is best for embedding, and considering a variable size might be one option.

Testing with strings around 512 characters seem to do pretty well, but it may be storing multiple lengths of similar runs in the document might be a better way to do it.

This will depend on the specific model you're using, because:

- if a model has been trained on shorter paragraphs, it will likely do better on those than on longer ones, and vice versa

- each model has some maximum input length (e.g. 512 tokens, or about 350 words), and might silently discard words when it's given a longer chunk

I don't know whether or not processing multiple lengths is worthwhile, but you probably want to have some overlap when you turn your docs into chunks.

Maybe take a look at Langchain or LlamaGPT: someone has probably come up with sensible defaults for overlap and chunk size.

If you want to do embeddings locally, check out sentence-transformers/all-MiniLM-L6-v2

On your last point: I guess recalculating 10 million embeddings takes much longer than the 20-30 mins to re-index?

Or perhaps we care because calculating the embeddings can be done in parallel with no limit, but the indexing is somehow constrained?

Yeah, depending on the model, calculating the 10 million embeddings could take longer sequentially, but, as you mention, it's also an embarrassingly parallel operation. I don't think that indexing can be performed in parallel, but I may be wrong on that one.