Hacker News new | ask | show | jobs
by xfalcox 294 days ago
Depends on your needs. You surely don't want 32k long chunks for doing the standard RAG pipeline, that's for sure.

My use case is basically a recommendation engine, where retrieve a list of similar forum topics based on the current read one. As with dynamic user generated content, it can vary from 10 to 100k tokens. Ideally I would generate embeddings from an LLM generated summary, but that would increase inference costs considerably at the scale I'm applying it.

Having a larger possible context out of the box just made a simple swap of embeddeding models increase quality of recommendations greatly.