Hacker News new | ask | show | jobs
by bluejay2387 32 days ago
So about a year ago I wrote my own attempt at something like this using vector indexing and BM25 (the latest version uses CocoIndex, I had a custom coded solution using ChromaDB before). I wrote a comprehensive enough test set that showed performance increases on the quality of search results and reduction in token usage versus grep and rg. I haven't had time to really polish it but it worked well enough, particularly for one project where I have around 250k documentation files and docs out number code files 1000 to 1 (about 50% reduction in tokens and 30% increase in successful searches). Yesterday for grins I tried this project and was fairly disappointed to see it blow away my kludged solution particularly given that it doesn't have a lengthy indexing process. I haven't tested it on the 250k doc project yet, but in another project that I have a test suite for semantic search on it outperformed my solution by about 20% even on documentation in terms of successful search results (which I didn't expect given that it seems to only be tuned for code). I haven't gone through the code to see what its doing differently than what I tried, but what ever its doing it seems to have potential.
1 comments

Wow, thanks for sharing, and cool that you're working on similar things! Feel free to drop any feedback on the repo if you want!