Hacker News new | ask | show | jobs
by williamzeng0 1046 days ago
Our code is messy (sweep hasn't gotten around to it yet), but here's where we save the code! https://github.com/sweepai/sweep/blob/main/sweepai/core/vect...

So for context, this is running in a ephemeral function from Modal https://modal.com/docs/reference/modal.Function#modalfunctio....

We need a way to store the computed embeddings, because the function doesn't persist any state by default, so we use Redis. But we don't want to store the actual code as the key, so we hash the code + add some versioning. Because it's a cache, it supports concurrent writes + reads, which a lot of vector dbs do poorly.

So the actual code is only accessed at runtime (using the GitHub app authentication to clone the repo), and we also build the vector db in memory at runtime. It's slow(redis call, embedding the misses, constructing the index), but 1-2s is negligible in the context of Sweep because a single openai call could be 7s+.

And one nice feature is that when you have Sweep running on 10+ branches (which probably share 95%+ of the code) we just use the cache hits/misses to automatically handle diffs in the vector db. It's super easy to setup, we don't need to manage different indices (imagine a new index per branch), and it's very cost efficient.