| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jhales 1046 days ago
	What is your data privacy policy?

2 comments

williamzeng0 1046 days ago

Here it is: https://docs.sweep.dev/privacy

The logs from Sweep(which contain snippets of code) are logged for debugging purposes. We don't train on any of your code. These will only be stored for 30 days. We send this data to OpenAI to generate code. We're using the OpenAI api, and OpenAI has an agreement stating they will not train on this data and will persist it for 30 days to monitor trust and safety.

We index your codebase for search, but we use a system that only reads your repo at runtime in Modal. This runs as a serverless function which is torn down after your request completes. Here's a blog we wrote about it! https://docs.sweep.dev/blogs/search-infra

link

dennisy 1046 days ago

I think this is a huge point! Surprised no one asked it sooner. Where does all the code go which you tokenise?

link

williamzeng0 1046 days ago

Our code is messy (sweep hasn't gotten around to it yet), but here's where we save the code! https://github.com/sweepai/sweep/blob/main/sweepai/core/vect...

So for context, this is running in a ephemeral function from Modal https://modal.com/docs/reference/modal.Function#modalfunctio....

We need a way to store the computed embeddings, because the function doesn't persist any state by default, so we use Redis. But we don't want to store the actual code as the key, so we hash the code + add some versioning. Because it's a cache, it supports concurrent writes + reads, which a lot of vector dbs do poorly.

So the actual code is only accessed at runtime (using the GitHub app authentication to clone the repo), and we also build the vector db in memory at runtime. It's slow(redis call, embedding the misses, constructing the index), but 1-2s is negligible in the context of Sweep because a single openai call could be 7s+.

And one nice feature is that when you have Sweep running on 10+ branches (which probably share 95%+ of the code) we just use the cache hits/misses to automatically handle diffs in the vector db. It's super easy to setup, we don't need to manage different indices (imagine a new index per branch), and it's very cost efficient.

link