Launch HN: Sarus (YC W22) – Work on sensitive data with differential privacy

Y	Hacker News new \| ask \| show \| jobs

136 points by maximeago 1553 days ago

Hi HN! Maxime, Nicolas, and Vincent here, founders of Sarus (https://www.sarus.tech). Sarus is a privacy engineering software that lets data scientists work on data without the need to access it. It works like a proxy between the practitioner and the data. All queries and data processing jobs are executed on the original data with the privacy guarantees of differential privacy.

When data is sensitive, getting access can be a huge pain. It means going through a long manual validation process that includes designing, and implementing an appropriate data anonymization. It takes weeks to months and some data utility may be lost to the masking requirements.

Sarus makes all of it irrelevant by letting analysts work on data that is never accessed. Analysts only access outputs of their data jobs, and those can be protected with appropriate privacy measures.

With past lives in healthtech, finance, and marketing, we’ve experienced first-hand that data governance has taken a huge part in data operations. It’s a rightful objective to protect data but it should not have to hamstring all innovation. For most data science or analytics objectives, the analyst has no interest in the information of a given individual. They look for patterns that are valid across the dataset. Access to user-level information is just an unfortunate way to get there.

We decided to build Sarus so that data access is no longer a requirement.

The Sarus API proxies all queries, compiles them into a privacy-safe version, runs them on the original data (which never moves outside of our clients’ infrastructure) and outputs the protected results to the practitioner. The protection relies on differential privacy, a mathematical definition of privacy already used by leading tech companies. Differential privacy works by adding calibrated randomness to outputs so that the information of any given individual cannot be inferred. One of its main benefits is that it does not make any assumption on what is sensitive in the data or what the recipient of the output may already know or do. This is the ideal candidate for replacing all manual data governance processes by something fully automated. Each query gets rewritten by Sarus in a way that implements its core principles.

For the core primitives of differential privacy, we leverage the latest research (Dwork & Roth 2014, Abadi 2016, Dong 2019, Koskela 2020 or Wilson 2019) and open source implementations (tensorflow-privacy, Google Differential Privacy, OpenDP, Smartnoise). Our key contribution is to bundle everything into an API that can be queried without seeing the data in the first place. It requires proper privacy accounting (we use PLD accounting as in Koskela 2020) but also setting all the technical parameters that are required by the framework (estimating range of input data, allocating privacy budget across computation steps…). We also optimize the privacy utility trade-off by memoizing previous queries as much as possible.

Wait, but the first thing data scientists do is to check out the data, how do I do that now? Not a problem, the API provides synthetic data samples with the same schema and statistical distribution by default. It effectively replaces the need to see any record, and data scientists can still do feature engineering, test and debug code with it. Of course, synthetic data is not something you would want to build insights or ML models on, you’d use the API to do that on the original data.

How it works: the app is deployed in the cloud infrastructure (any cloud vendor is compatible). The data admin lists relevant data sources from the UI or the API, and grants learning access to practitioners by applying a privacy policy among predefined templates. The synthetic data sample is automatically generated. From there, data scientists can run their analyses with their usual tools (pandas, numpy, TF, scikit-learn, Metabase, Redash, Tableau…), whether from a python SDK or a hiveSQL connector.

Curious? We have released a self-serve demo for you to try it out. It lets you make a dataset available from the Sarus proxy, set up access policies and then, as a data practitioner, use it for analytics and machine learning. It is limited to a handful of datasets but should give you a good understanding of Sarus. You can sign up at https://demo.sarus.tech/signup and begin using Sarus for free, no credit card required (tutorial on https://www.sarus.tech/post/we-just-released-an-open-demo-tr...).

Our model is a software license to run on our clients’ cloud. Our pricing is on a per-dataset per-month basis and starts at $600/month.

Please let us know what you think! We look forward to hearing your questions, feedback, ideas, and experience!

14 comments

ZeroCool2u 1553 days ago

So, I work in an org that has truly sensitive data and this has been a barrier for us more times than I can count, so this is obviously very interesting to us and something we've thought about a lot. A couple questions I have are:

1. How well does Sarus work with data that is not in a database, like unstructured data such as documents/text?

2. How does Sarus handle 'legacy' DB's, where the schema for a table might not be quite right, but due to operational constraints these schemas can't be easily corrected? The canonical example I'm thinking of is date times that have been specified as strings and no one bothered to change them.

3. What kind of language support exists for interacting with the Sarus proxy? Obviously, you have Python support but for large enterprises that might need Sarus oftentimes there are a few languages that are popular internally and all need equal support. I think the comprehensive list of analytics languages in use in large orgs would look something like, [python, R, Julia, Matlab, SAS]. Rust/C++ support would be ideal as well as they're commonly used in Python/R to accelerate hot code. Do you have plans to develop SDK's? Would they be hand crafted or do you plan to develop generated SDK's similar to how GCP does it?

4. Are you moving to get any security certs? Of course you're a startup right now, but I know from experience enterprise orgs will still blindly ask questions like, "Are you FedRamp Moderate/High certified?" (This doesn't even make sense for your sales model and I'm certain you'll still have to answer this question and explain why over and over.) or "Do you have a Soc 2 Type 2 report we can look at?". The orgs that actually need something like this are going to be asking these questions pretty quickly.

5. When I use Sarus, do I have to use your IDE/interface? One of the things I noticed when looking at your demo gifs is there is a lot of use of notebooks, which of course are popular, but you'll be met with a lot of resistance if your users can't use the tooling they prefer (PyCharm Pro / DataGrip plugin to interact with DB's in my teams case).

6. How exactly is Sarus deployed? Terraform? Is it a containerized application? Does it scale vertically or horizontally? Can its logging mechanism integrate with StackDriver, Splunk, or Cloudtrail?

7. Have you proved out the technology with more complex time series data? I'm thinking of sensitive trading data.

8. Do you provide benchmarks for showing that a model trained on a real dataset is equivalent in performance a model trained on the synthetic dataset?

Super cool product and you're in a great position to make a ton of money if you nail the execution and get some large customers!