|
|
|
|
|
by romanhn
468 days ago
|
|
Thanks for posting this, very timely as I'm also playing around with pgvector for semantic search. I saw that you ended up trimming inputs longer than 8K tokens. Have you looked into chunking (breaking input into smaller chunks and doing vector search on the chunks)? Embedding models I'm playing with have a max of 512 tokens, so chunking is pretty much a must. Choosing a chunking strategy seems to be a deep rabbit hole of its own. |
|
Ohh I had not seriously considered this until reading this. I could have multiple embeddings per issue and search across those embeddings and if the same issue is matched multiple times, I would probably take the strongest match and dedupe it.
I could create embeddings for comments too and search across those.
Thanks for the suggestion, would be a good think to try!
> Choosing a chunking strategy seems to be a deep rabbit hole of its own.
Yes this is true. In my case, I think the metadata fields like Title and Labels are probably doing a lot of the work (which would be duplicated across chunks?) and, within an issue body, off the top of my head, I can't see any intuitive ways to chunk it.
I have heard that for standard RAG, chunking goes a surprisingly long way!