The logs from Sweep(which contain snippets of code) are logged for debugging purposes. We don't train on any of your code. These will only be stored for 30 days. We send this data to OpenAI to generate code.
We're using the OpenAI api, and OpenAI has an agreement stating they will not train on this data and will persist it for 30 days to monitor trust and safety.
We index your codebase for search, but we use a system that only reads your repo at runtime in Modal. This runs as a serverless function which is torn down after your request completes. Here's a blog we wrote about it! https://docs.sweep.dev/blogs/search-infra
We need a way to store the computed embeddings, because the function doesn't persist any state by default, so we use Redis. But we don't want to store the actual code as the key, so we hash the code + add some versioning. Because it's a cache, it supports concurrent writes + reads, which a lot of vector dbs do poorly.
So the actual code is only accessed at runtime (using the GitHub app authentication to clone the repo), and we also build the vector db in memory at runtime. It's slow(redis call, embedding the misses, constructing the index), but 1-2s is negligible in the context of Sweep because a single openai call could be 7s+.
And one nice feature is that when you have Sweep running on 10+ branches (which probably share 95%+ of the code) we just use the cache hits/misses to automatically handle diffs in the vector db. It's super easy to setup, we don't need to manage different indices (imagine a new index per branch), and it's very cost efficient.
The logs from Sweep(which contain snippets of code) are logged for debugging purposes. We don't train on any of your code. These will only be stored for 30 days. We send this data to OpenAI to generate code. We're using the OpenAI api, and OpenAI has an agreement stating they will not train on this data and will persist it for 30 days to monitor trust and safety.
We index your codebase for search, but we use a system that only reads your repo at runtime in Modal. This runs as a serverless function which is torn down after your request completes. Here's a blog we wrote about it! https://docs.sweep.dev/blogs/search-infra