Hacker News new | ask | show | jobs
by gavmor 810 days ago
Thanks for sharing! I look forward to playing with this once I get off my phone. Took a look at the code, though, to see if you've implemented any of the tricks I've been too lazy to try.

`text_splitter=RecursiveCharacterTextSplitter( chunk_size=8000, chunk_overlap=4000)`

Does this simple numeric chunking approach actually work? Or are more sophisticated splitting rules going to make a difference?

`vector_store_ppt=FAISS.from_documents(text_chunks_ppt, embeddings)`

So we're embedding all 8000 chars behind a single vector index. I wonder if certain documents perform better at this fidelity than others. To say nothing of missed "prompt expansion" opportunities.

1 comments

Of all the off the shelf text splitters I have tried, the recursive character splitter actually performs really well. Especially if the chunk size is so large you will likely have more than the actual needed context in a chunk anyway.

Regarding the index usually a mix of BM25 and vector index seems to perform best for most generic data.