| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by maximeago 1561 days ago

1. Sarus works on data that is organized in records. The intuition is that one record should not transpire in the results (hence protecting their privacy) but studying all records conjointly should be possible. It may be flat files, parquet filets, etc. but we do need this record-level organization. In a given record, there may be columns that are text or images, Sarus will work fine. We never worked on pdf documents. Conceptually it could work but this is quite far down the road.

2. Sarus has connectors to the main DB and we add more when we meet them. The basic assumption is that the experience should be the same as working on the data in its original form. For instance if your data is in a CSV with a weird date format, you will be able to (i) get synthetic data with this same weird date format, (ii) apply python code that transforms this weird date format into something more conventional and use that reformatted version. When running your data job, Sarus will apply your preprocessing code and take it from there.

3. Today we have a python SDK and a SQL connector. Both leverage the same low-level API. We may build other SDKs for other languages but haven't started doing so.

4. Indeed, we don't have any cert yet but we are looking into getting some soon. We are about to start Soc2 for instance. This is somewhat less of a requirement as we never host any of our clients' data. Of course, everything that helps get the green light of the ITSec team is useful.

5. The python SDK is standard python code so you can use in any python env. The notebook is just here to make it more user-friendly in demos. Same for SQL, you can use any SQL querying tool, we did the demo with Metabase.

6. The easiest way is to deploy a docker image with Docker compose. It does not scale on multiple machine yet (stay tuned). In that sense, big data sources are only partially supported: if the source is RedShift and you submit a SQL query to the API, we'll rewrite it and send it to Redshift (which scales), but if you want to do ML on the same data, we won't be able to scale the same.

7. Complex time series is not a problem for the remote execution part provided it is stored in a traditional format. That being said, we don't have a specific synthetic data model for time series yet, so that part of the experience will be a bit different.

8. This is a debate we leave to researchers because there is not a single answer. It depends directly on the number of records in your dataset and the dimensionality of your data. However, you can set up privacy policies so that the weights of ML model without DP are allowed to be shared. This is considered acceptable by 99% of compliance teams in the world today so it's not a huge compromise. If you use Sarus this way, you are guaranteed to have exactly the same performance.

Would love to continue the conversation offline of course!