|
|
|
|
|
by ethanahte
1162 days ago
|
|
Hi, author here. 1. You make a great point about longer documents requiring multiple vectors which I should've mentioned in the post. Depending on your use case, this can certainly explode your dataset size!
2. Good to know about the pgvector limitations -- I haven't used it yet.
3. I guess "index" would be the more database-y term. That said, one thing I'll call out is that you have to re-index if you ever change your embedding model, and indexing can be slow. It took me ~20-30 minutes to index the 10 million embeddings in my benchmark. |
|
Obviously, embedding single words probably aren't particularly useful for reassembling portions of a document for submission to an LLM in the prompt. I'm currently pondering on what size of string is best for embedding, and considering a variable size might be one option.
Testing with strings around 512 characters seem to do pretty well, but it may be storing multiple lengths of similar runs in the document might be a better way to do it.