Hacker News new | ask | show | jobs
Launch HN: Sarus (YC W22) – Work on sensitive data with differential privacy
136 points by maximeago 1553 days ago
Hi HN! Maxime, Nicolas, and Vincent here, founders of Sarus (https://www.sarus.tech). Sarus is a privacy engineering software that lets data scientists work on data without the need to access it. It works like a proxy between the practitioner and the data. All queries and data processing jobs are executed on the original data with the privacy guarantees of differential privacy.

When data is sensitive, getting access can be a huge pain. It means going through a long manual validation process that includes designing, and implementing an appropriate data anonymization. It takes weeks to months and some data utility may be lost to the masking requirements.

Sarus makes all of it irrelevant by letting analysts work on data that is never accessed. Analysts only access outputs of their data jobs, and those can be protected with appropriate privacy measures.

With past lives in healthtech, finance, and marketing, we’ve experienced first-hand that data governance has taken a huge part in data operations. It’s a rightful objective to protect data but it should not have to hamstring all innovation. For most data science or analytics objectives, the analyst has no interest in the information of a given individual. They look for patterns that are valid across the dataset. Access to user-level information is just an unfortunate way to get there.

We decided to build Sarus so that data access is no longer a requirement.

The Sarus API proxies all queries, compiles them into a privacy-safe version, runs them on the original data (which never moves outside of our clients’ infrastructure) and outputs the protected results to the practitioner. The protection relies on differential privacy, a mathematical definition of privacy already used by leading tech companies. Differential privacy works by adding calibrated randomness to outputs so that the information of any given individual cannot be inferred. One of its main benefits is that it does not make any assumption on what is sensitive in the data or what the recipient of the output may already know or do. This is the ideal candidate for replacing all manual data governance processes by something fully automated. Each query gets rewritten by Sarus in a way that implements its core principles.

For the core primitives of differential privacy, we leverage the latest research (Dwork & Roth 2014, Abadi 2016, Dong 2019, Koskela 2020 or Wilson 2019) and open source implementations (tensorflow-privacy, Google Differential Privacy, OpenDP, Smartnoise). Our key contribution is to bundle everything into an API that can be queried without seeing the data in the first place. It requires proper privacy accounting (we use PLD accounting as in Koskela 2020) but also setting all the technical parameters that are required by the framework (estimating range of input data, allocating privacy budget across computation steps…). We also optimize the privacy utility trade-off by memoizing previous queries as much as possible.

Wait, but the first thing data scientists do is to check out the data, how do I do that now? Not a problem, the API provides synthetic data samples with the same schema and statistical distribution by default. It effectively replaces the need to see any record, and data scientists can still do feature engineering, test and debug code with it. Of course, synthetic data is not something you would want to build insights or ML models on, you’d use the API to do that on the original data.

How it works: the app is deployed in the cloud infrastructure (any cloud vendor is compatible). The data admin lists relevant data sources from the UI or the API, and grants learning access to practitioners by applying a privacy policy among predefined templates. The synthetic data sample is automatically generated. From there, data scientists can run their analyses with their usual tools (pandas, numpy, TF, scikit-learn, Metabase, Redash, Tableau…), whether from a python SDK or a hiveSQL connector.

Curious? We have released a self-serve demo for you to try it out. It lets you make a dataset available from the Sarus proxy, set up access policies and then, as a data practitioner, use it for analytics and machine learning. It is limited to a handful of datasets but should give you a good understanding of Sarus. You can sign up at https://demo.sarus.tech/signup and begin using Sarus for free, no credit card required (tutorial on https://www.sarus.tech/post/we-just-released-an-open-demo-tr...).

Our model is a software license to run on our clients’ cloud. Our pricing is on a per-dataset per-month basis and starts at $600/month.

Please let us know what you think! We look forward to hearing your questions, feedback, ideas, and experience!

14 comments

So, I work in an org that has truly sensitive data and this has been a barrier for us more times than I can count, so this is obviously very interesting to us and something we've thought about a lot. A couple questions I have are:

1. How well does Sarus work with data that is not in a database, like unstructured data such as documents/text?

2. How does Sarus handle 'legacy' DB's, where the schema for a table might not be quite right, but due to operational constraints these schemas can't be easily corrected? The canonical example I'm thinking of is date times that have been specified as strings and no one bothered to change them.

3. What kind of language support exists for interacting with the Sarus proxy? Obviously, you have Python support but for large enterprises that might need Sarus oftentimes there are a few languages that are popular internally and all need equal support. I think the comprehensive list of analytics languages in use in large orgs would look something like, [python, R, Julia, Matlab, SAS]. Rust/C++ support would be ideal as well as they're commonly used in Python/R to accelerate hot code. Do you have plans to develop SDK's? Would they be hand crafted or do you plan to develop generated SDK's similar to how GCP does it?

4. Are you moving to get any security certs? Of course you're a startup right now, but I know from experience enterprise orgs will still blindly ask questions like, "Are you FedRamp Moderate/High certified?" (This doesn't even make sense for your sales model and I'm certain you'll still have to answer this question and explain why over and over.) or "Do you have a Soc 2 Type 2 report we can look at?". The orgs that actually need something like this are going to be asking these questions pretty quickly.

5. When I use Sarus, do I have to use your IDE/interface? One of the things I noticed when looking at your demo gifs is there is a lot of use of notebooks, which of course are popular, but you'll be met with a lot of resistance if your users can't use the tooling they prefer (PyCharm Pro / DataGrip plugin to interact with DB's in my teams case).

6. How exactly is Sarus deployed? Terraform? Is it a containerized application? Does it scale vertically or horizontally? Can its logging mechanism integrate with StackDriver, Splunk, or Cloudtrail?

7. Have you proved out the technology with more complex time series data? I'm thinking of sensitive trading data.

8. Do you provide benchmarks for showing that a model trained on a real dataset is equivalent in performance a model trained on the synthetic dataset?

Super cool product and you're in a great position to make a ton of money if you nail the execution and get some large customers!

1. Sarus works on data that is organized in records. The intuition is that one record should not transpire in the results (hence protecting their privacy) but studying all records conjointly should be possible. It may be flat files, parquet filets, etc. but we do need this record-level organization. In a given record, there may be columns that are text or images, Sarus will work fine. We never worked on pdf documents. Conceptually it could work but this is quite far down the road.

2. Sarus has connectors to the main DB and we add more when we meet them. The basic assumption is that the experience should be the same as working on the data in its original form. For instance if your data is in a CSV with a weird date format, you will be able to (i) get synthetic data with this same weird date format, (ii) apply python code that transforms this weird date format into something more conventional and use that reformatted version. When running your data job, Sarus will apply your preprocessing code and take it from there.

3. Today we have a python SDK and a SQL connector. Both leverage the same low-level API. We may build other SDKs for other languages but haven't started doing so.

4. Indeed, we don't have any cert yet but we are looking into getting some soon. We are about to start Soc2 for instance. This is somewhat less of a requirement as we never host any of our clients' data. Of course, everything that helps get the green light of the ITSec team is useful.

5. The python SDK is standard python code so you can use in any python env. The notebook is just here to make it more user-friendly in demos. Same for SQL, you can use any SQL querying tool, we did the demo with Metabase.

6. The easiest way is to deploy a docker image with Docker compose. It does not scale on multiple machine yet (stay tuned). In that sense, big data sources are only partially supported: if the source is RedShift and you submit a SQL query to the API, we'll rewrite it and send it to Redshift (which scales), but if you want to do ML on the same data, we won't be able to scale the same.

7. Complex time series is not a problem for the remote execution part provided it is stored in a traditional format. That being said, we don't have a specific synthetic data model for time series yet, so that part of the experience will be a bit different.

8. This is a debate we leave to researchers because there is not a single answer. It depends directly on the number of records in your dataset and the dimensionality of your data. However, you can set up privacy policies so that the weights of ML model without DP are allowed to be shared. This is considered acceptable by 99% of compliance teams in the world today so it's not a huge compromise. If you use Sarus this way, you are guaranteed to have exactly the same performance.

Would love to continue the conversation offline of course!

If my model is used to profile a given user such as to maximize revenue from them (my objective is generally increasing with a more accurate classification of a user to the degree that such categories are revenue relevant), does this model still work?

If so, how is it privacy compliant, i.e. suffice the intent of the law in say, EU countries, or will not be identified as "privacy theater" in the US? If not, what do you do in these cases?

Cool to get your take on this.

If the model training is designed to profile just one user, no, the model won't work by design. What you describe is an attack on the privacy of that user and we do want to make sure they fail.

The way differential privacy works with machine learning is that it guarantees that one given record cannot have a significant impact on the weights of the models and therefore on its performance. In the particular case of SGD-based models, the guarantee holds for every step of the descent. A good place to start on the topic is Abadi 2016 (https://arxiv.org/pdf/1607.00133.pdf).

What is important in the approach is that we don't need to detect that there is something funny in the loss function of the model. Sarus uses the exact same approach whether the model or the loss function is malevolent or not. The guarantees still hold. This is important because a lot of models can extract personal information even with no intention of doing so and no real way to detect it.

A good way to think about model performance is that we are looking for models that perform well irrespective of one record. If there are many users that have the same pattern of the user you are trying to spy on, the model may still be good but you won't know whether it's because of that user or not.

Sarus would typically fit in organizations that legitimately collect personal data and can take decisions based on these data. In these cases you don't want (and most of the time cannot legally) let anyone in the organisation have access to the full personal data records. Using Sarus anyone, even untrusted parties can run analysis on your data safely. These analysis can be classification-model-fitting. You can then classify accurately users to maximize your revenue as long as you can observe the values to feed into your classifier.
Much needed. Great opportunity. Happy hunting.

> ...the API provides synthetic data samples with the same schema and statistical distribution by default

Neat. Repurposing test data generators. I like it.

> Our model is a software license to run on our clients’ cloud.

Just to confirm my understanding: Sarus never sees the client's data? Cool.

--

I'm fine with differential privacy. I haven't read those most recent papers, so I'm a little out of date.

That said...

Encrypting data at rest at the field level is an important missing piece from the future perfect privacy stack.

Just like how proper password vaults work. Salt, hash, encrypt. Never store the actual password.

The book Translucent Databases details clever examples of this strategy for misc use cases. Never store PII as plaintext.

Translucent databases and differential privacy are orthogonal. I have no ideas on how to productize (or SaaS-atize) translucent strategies.

Absolutely, Sarus never sees the client's data, because the software runs on their infrastructure.

And indeed, data encryption and privacy-preserving analysis (what Sarus does) is quite orthogonal. You may combine them for some use cases (so that data are protected on the machine, but also cannot be re-identified from queries) For example, on our infra (used only for demo, not for clients), by default all data are encrypted at rest by the cloud provider. You could even try to add FHE (Fully homomorphic encryption), but that's quite complex (and probably wouldn't support many type of data analysis).

You're correct, Sarus never sees the data. The software runs directly on the data infrastructure of the client. It's typically deployed on the public cloud for instance.

And here, of course, differential privacy only guarantees the data protection in the flow of data between the data source and the data practitioner. It should not be a replacement for other best practices like the ones you mention.

Do you have Linkedin?
Yes, Sarus is on https://www.linkedin.com/company/sarus-technologies, feel free to follow us or add the founders directly.
It's unclear for me from the landing page hero section what the product is/does or what problem does it solve:

"PRIVACY-BY-DESIGN

Time-to-data: from months to minutes

Organizations that use Sarus outperform their peers at execution speed for machine learning and analytics while being more secure "

The product solves the problem of the time it takes to access sensitive data for analytics and machine learning. When you work in a large healthcare or financial organization, each dataset is highly protected. Each time a data practitioner needs to work on it, they may have to wait for months for compliance processes to opine on a data masking strategy and engineering teams to prepare a data lab and implement this strategy. With Sarus, data practitioner no longer need to access data to do analytics or machine learning on sensitive data assets.

When internal access to personal data is not a concern within an organization, data sharing with external partners certainly is. This process can be avoided just the same.

Hence the promise of taking time-to-data form months to minutes.

Hope that helps clarify.

Thanks, this makes it a lot more clear!

Maybe the hero text could be more clear, explaining in summary what it does (similar to this comment). "Get instant access to sensitive data for analytics and machine learning."

That's a great suggestion actually! We'll definitely work on it and thanks for your help.
This sounds like a genuinely useful product with terrible marketing/copywriting on the landing page. Make your next hire in product marketing.
This person just joined last month! ;)
Very interesting indeed. I read that Sarus has been designed with Data Scientists persona in mind. Would that also be easily solving internal access for other engineering teams. Basically, allowing engineering teams create new features including all the local/staging/production environments sensitive data masking? Is the Sarus approach also validated by Privacy or Security authorities?
Sarus is designed for all data use cases, provided that access to a given user's information is not the objective. This is the case for all of BI, analytics, or machine learning. It also works for testing or debugging, building APIs, etc. It resonates with organizations' aspiration for the democratization of data.

Differential privacy provides much better protection than data masking, but most importantly, it does not require any manual decision (which column to mask, how, etc.). This is what makes it easy to apply at scale to all datasets in the data warehouse or data lake instead of having dataset per dataset decision making involved.

Differential privacy is used by Apple, Google, Microsoft, or the US Census. When used properly, the data protection it provides does not need to be proven to regulators or security teams anymore. That being said, regulators do not require DP protection per se. They require organizations to put in place the best practices in terms of data governance, data minimization, or data security as a whole. This is part of the answer.

I think this is interesting but I'm having trouble seeing how it would apply to the sorts of machine learning tasks that are drawing heavy interest in a radiology department. How does it apply to, say, development or testing of image segmentation tools? Quite often vendors want to sell us software and we would very much like to test it at scale on our own data to see whether it's trash or not because procurement is a beast. Does this sort of tool provide that sort of an interface somehow? I can see how it works for tablular data, I'm just not sure how you can guarantee PHI is fuzzed sufficiently in images.
Here is how it would work in theory (not including the scalability question of working with heavy DICOM files and huge DNN). I'm assuming your data is made of records composed by an image and some information about the image or the patient.

The system will generate a fake dataset with the exact same structure and schema (the information on patients is realistic, the images look reasonable and importantly has the right encoding, size, etc.). The purpose of this fake data is for the vendor to adjust their algorithm to be able to consume your data as it is. The vendor builds up the preprocessing on the fake data and then submit their data job to the API (say a preprocessing function to be applied on each record and a Tensorflow model to be fitted on the data, or just to measure the performance on the data). The preprocessing code runs on the original records, the model would be trained or validated against the real data. In the end they can prove the value of their model without having to get their hands on the real data.

The problem we generally have is that plugging the vendor's [insert tensorflow model component] into our network seems to always become an operational no-go prior to purchase due to a variety of reasons including intrusiveness and questions about privacy and the vendor's ability to manipulate the process to get access to datasets. So it's actually the preprocessing step that's we keep hitting as the pain point. In some cases we generate de-identified datasets for demonstration and testing but it can be very labor intensive.

I've not encountered differential privacy in my work before now, but at least for dealing with metadata in the DICOM it could probably be helpful for some datasets. But it could still be challenging to ensure the IODs are correct (or that known quirks are preserved). Anyway this is very interesting. I have a colleague who is working on some utilization/value research using billing records and I'll show him this.

Thanks! Our goal is that no matter what preprocessing function they pass, the only end up accessing outputs that comply with the privacy policies. The code gets access to the real data but it is shielded from the vendor who can only see protected outputs. It should address the risk of private information being exposed to them, but for sure, the more sophisticated the preprocessing code will be, the more challenging it will become. Deep learning on Dicom data is pushing the system to the edge a bit.
This looks exciting, thank you for sharing! Implementing differential privacy correctly is famously difficult and easy to get wrong. Will you be making the privacy-critical part of your code available publicly for people to audit?
You are absolutely right, we are leveraging many open-source bricks to build our product, so that they can be reviewed, mainly:

- https://github.com/google/differential-privacy (for basic mechanisms and PLD accounting)

- https://github.com/tensorflow/privacy (for DP-SGD and RDP accounting)

- https://github.com/opendp (for our SQL module)

We actively contribute to some of them.

We also open-sourced some tech bricks we are using:

- https://github.com/sarus-tech/dp-xgboost (see also https://arxiv.org/pdf/2110.12770.pdf)

We plan to continue building trust in the tools we are using by publishing some of them.

Interesting. Is the API hosted with Sarus supposed to be used by in-house analysts? Or 3rd party?

> Our key contribution is to bundle everything into an API that can be queried without seeing the data in the first place.

Without being seen by who?

The API is designed to be hosted by our clients so that the software runs directly on their data infrastructure and no sensitive data leaves their systems. In this demo, it is obviously hosted by us.

A big innovation is that, with Sarus, the data practitioner does not need to see the data and can still manipulate it. Most DP libraries are designed for researchers that have access to the data. They can prepare the data however they like, tune the libraries all they want, and eventually use the library to produce protected outputs from the data. With Sarus, someone who never saw the data, can achieve the same.

Pretty impressive. What do you use for synthetic data generation? Also, you say in the blog post that it works with any type of data. Can you tell a bit more? Does it work for text and images?
We developed our own generative model for synthetic data generation. It is an autoregressive model where each variable is derived from previously generated ones using Transformers networks. If you are interested, you have more details in: https://arxiv.org/pdf/2202.02145.pdf When we say it works on any types of data, we mean: numerical, categorical, text, images and compositions of those types (see the paper).
for synthetic data generation, what methods are they using to sample data from the distribution? What assumptions about the distribution are being made? Does it model correlations between sample attributes that could adversely effect some ML methods (multi-colinearity can cause problems).
We developed our own generative model for synthetic data generation. It is an autoregressive model where each variable/attribute is derived from previously generated ones using Transformers networks (more details there: https://arxiv.org/pdf/2202.02145.pdf). So yes, correlations are modelled, although exact multicollinearity (when there is a linear relationship between bunch of attributes) would be a bit blurry in the synthetic data.

This being said, the goal of Sarus is to enable analysis on the original data with privacy guarantee on the result (synthetic data is merely used as a tool and a fallback when there is no better solution) so you can write a statistical test to detect multicollinearity and run it on the original data within Sarus.

Very interesting! Looking forward to hearing more from you guys. What distinguishes you from other differential privacy companies such as Privitar or Leapyear?
Privitar and Leapyear are indeed part of competition on the more mature side of the spectrum. Even if all three of us use differential privacy, I would say that each company's core value prop is a bit different:

- Sarus: replaces the manual governance of data access by "no-access". Analysts or data scientists can manipulate data without accessing it. The absence of access means that the process is considerably simplified and no longer relies on many manual decisions and controls. Differential privacy is here as a way to automate protection.

- Privitar: it is a more traditional data governance solution. It is all about controls and manual decisions. In their own works, they feature an "unbeatable breadth of privacy techniques". Differential privacy is one of them. They leave it to the privacy professional to make their own implementation decisions, which is exactly what Sarus offers to disrupt.

- Leapyear: it is a data analysis solution powered by differential privacy. It does not seek to replace existing data governance processes. This is why they don't focus on blending into existing data workflows and only offers differential privacy as an way to access where Sarus can disappear into existing operations without requiring a learning curve on the part of analysts and data scientists.

Thanks a lot, very clear!
FYI: The headings on your Careers page are all in French, e.g. "Qui sommes-nous ? ", but the actual content is all in English.
Thanks for catching it! will fix it.
The demo signup form has a required field "Token" that's blocking me from signing up
You can use Google SSO without a token. If you don't have a Google account, can you contact us with the contact form? we'll send you a token.
The blog post, demo, and website are incredibly uninformative (maybe informative, but not on own product's details). Eventually, pressing on "getting started" goes to a sign up for updates page.
The first link is the corporate website, it may not include all the product details you expected, sorry about that. You should get a lot more details on how it works if you try the tutorial and play with it yourself. This is at the bottom of the post, hopefully it satisfies your curiosity but happy to answer outstanding questions here of course.
Is this a productized Duet[1]? Are you using it under the hood?

(As far as I'm concerned, if the answer is yes to both, this has much potential. I'm just trying to figure out what I'm looking at)

Thank you!

[1]: https://blog.openmined.org/duet-demo-how-to-do-data-science-...

Yes, there are many parallels with Duets we can look at Sarus as a productized version of it.

There are some differences though: - we designed for the trusted curator model where Duet is mostly for federated learning tasks in mind - the privacy policies are based on principles (such as: "DP-outputs with epsilon < 2 can be shared", "DP-synthetic data can be shared", or "weights of ML models can be shared"), then the gateway applies the principles to any query, whether it is a SQL query, an ML model or else. In Duet, it's all about manual validation of given queries.

The PySyft project is well-documented and researched if you want to learn about the technology.
I'm very familiar with pysyft and tensorflow federated (and Duet which may be the open source basis for this kind of product). I have much interest on the topic and that's why I was seriously scanning the website and tried to understand what the product is exactly. I failed.
Yes, this is a very rich resource. Thx
Is PySyft used under the hood?
No, we do not. Pysyft was mostly first designed to do federated learning. Sarus targets organizations that have their data in one central repository in a trusted curator model. It lets external data practitioners query that data with all sorts of data jobs (not just ML, but also SQL analysis, and spark soon).
side question: Is federated learning used in production? if not why?
No, we don't do federated learning at Sarus today. We operate in the trusted curator model: a party has a centralized database and lets external practitioner leverage it. This is the most common setup in the industry (think hospitals, health insurance companies, banks, streaming services...).

That being said, Sarus can be used to protect one node of a federated learning network. For instance each hospital could have a Sarus instance. The data scientist would need to take care of the orchestration of the nodes themselves but the Sarus API would make their life easy to interact with each data source, especially if all the sources are not identical.

It looks like someone asked an AI model to generate a website for the next YC company

this page specifically https://www.sarus.tech/solutions just screams "UX is an afterthought"