Hacker News new | ask | show | jobs
by andai 466 days ago
This appears to do no chunking. It just shoves the entire document (entire book, in my case) into the embedding request to Ollama. So it's only helpful if all your documents are small (i.e. no books).

The embedding model (bge-m3 in this case) has a sequence length of 8192 tokens, i.e. rlama tries to embed the whole book, but Ollama can only put the first few pages into the embedding request.

Then when retrieving, it retrieves the entire document instead of the relevant passage (because there is no chunking), but truncates this to the first 1000 characters, i.e. the first half-page of Table of Contents.

As a result, when queried, the model says: "There is no direct mention of the Buddha in the provided documents." (The word Buddha appears 44,121 times in the documents I indexed.)

A better solution (and, as far as I can tell, what every other RAG does) is to split the document into chunks that can actually fit the context of the embedding model, and then retrieve those chunks -- ideally with metadata about which part of the document it's from.

---

I'd also recommend showing the search results to the user (I think just having a vector search engine is already an extremely useful feature, even without the AI summary / question answering), and altering the prompt to provide references (e.g. the based on the chunk metadata like page number).

5 comments

I have just implemented chunking with overlap for larger documents to split texts into smaller chunks and ensure access to all documentation in your RAG. It's currently in the testing phase, and I’d like to experiment with different models to optimize the process. Once I confirm that everything is working correctly, I can merge the PR into the main branch, and you’ll just need to update Rlama with `rlama update`.
Sadly, the hardest part of running local models with tools like Ollama appears to be longer context prompts.

Models that respond really quickly to a short sentence prompt need vastly more RAM and CPU/GPU time for significantly longer inputs. I'm finding this really damages their utility for me.

> A better solution (and, as far as I can tell, what every other RAG does) is to split the document into chunks that can actually fit the context of the embedding model, and then retrieve those chunks -- ideally with metadata about which part of the document it's from.

Books have author provided logical chunking in chapters. You can further split/summarize smaller sections and then do a hierarchical search (naive chunking kind of sucks from my experience)

What's the gold standard paid offering that does this?
Not a paid solution, but great for testing models yourself: AWS bedrock.

Wonky documentation (definitely released too early), but imo the best model agnostic diy solution out there.

yeah, chunking seems to be the key for any decent RAG implementation... it's interesting how much the retrieval strategy impacts the final answer quality. i've seen some community members mention that even with chunking, things like chunk overlap and smart metadata can significantly improve results. also, presenting search results to the user alongside the AI summary is a great point.
This is my next step. Currently, I’ve built an MVP to test the features, integrations, and see how far I can go with rLlama. I’m already developing a RAG on my end by chunking the data, adding overlap, and using metadata to retrieve the best possible context. This should be deployed soon. The version on GitHub has been pushed for days now, and it was only a version to showcase the features. I can’t wait to improve it and make it useful for everyone!