Hacker News new | ask | show | jobs
by dennisy 1047 days ago
I think this is a huge point! Surprised no one asked it sooner. Where does all the code go which you tokenise?
1 comments

Our code is messy (sweep hasn't gotten around to it yet), but here's where we save the code! https://github.com/sweepai/sweep/blob/main/sweepai/core/vect...

So for context, this is running in a ephemeral function from Modal https://modal.com/docs/reference/modal.Function#modalfunctio....

We need a way to store the computed embeddings, because the function doesn't persist any state by default, so we use Redis. But we don't want to store the actual code as the key, so we hash the code + add some versioning. Because it's a cache, it supports concurrent writes + reads, which a lot of vector dbs do poorly.

So the actual code is only accessed at runtime (using the GitHub app authentication to clone the repo), and we also build the vector db in memory at runtime. It's slow(redis call, embedding the misses, constructing the index), but 1-2s is negligible in the context of Sweep because a single openai call could be 7s+.

And one nice feature is that when you have Sweep running on 10+ branches (which probably share 95%+ of the code) we just use the cache hits/misses to automatically handle diffs in the vector db. It's super easy to setup, we don't need to manage different indices (imagine a new index per branch), and it's very cost efficient.