Hacker News new | ask | show | jobs
by ZeroCool2u 1553 days ago
So, I work in an org that has truly sensitive data and this has been a barrier for us more times than I can count, so this is obviously very interesting to us and something we've thought about a lot. A couple questions I have are:

1. How well does Sarus work with data that is not in a database, like unstructured data such as documents/text?

2. How does Sarus handle 'legacy' DB's, where the schema for a table might not be quite right, but due to operational constraints these schemas can't be easily corrected? The canonical example I'm thinking of is date times that have been specified as strings and no one bothered to change them.

3. What kind of language support exists for interacting with the Sarus proxy? Obviously, you have Python support but for large enterprises that might need Sarus oftentimes there are a few languages that are popular internally and all need equal support. I think the comprehensive list of analytics languages in use in large orgs would look something like, [python, R, Julia, Matlab, SAS]. Rust/C++ support would be ideal as well as they're commonly used in Python/R to accelerate hot code. Do you have plans to develop SDK's? Would they be hand crafted or do you plan to develop generated SDK's similar to how GCP does it?

4. Are you moving to get any security certs? Of course you're a startup right now, but I know from experience enterprise orgs will still blindly ask questions like, "Are you FedRamp Moderate/High certified?" (This doesn't even make sense for your sales model and I'm certain you'll still have to answer this question and explain why over and over.) or "Do you have a Soc 2 Type 2 report we can look at?". The orgs that actually need something like this are going to be asking these questions pretty quickly.

5. When I use Sarus, do I have to use your IDE/interface? One of the things I noticed when looking at your demo gifs is there is a lot of use of notebooks, which of course are popular, but you'll be met with a lot of resistance if your users can't use the tooling they prefer (PyCharm Pro / DataGrip plugin to interact with DB's in my teams case).

6. How exactly is Sarus deployed? Terraform? Is it a containerized application? Does it scale vertically or horizontally? Can its logging mechanism integrate with StackDriver, Splunk, or Cloudtrail?

7. Have you proved out the technology with more complex time series data? I'm thinking of sensitive trading data.

8. Do you provide benchmarks for showing that a model trained on a real dataset is equivalent in performance a model trained on the synthetic dataset?

Super cool product and you're in a great position to make a ton of money if you nail the execution and get some large customers!

1 comments

1. Sarus works on data that is organized in records. The intuition is that one record should not transpire in the results (hence protecting their privacy) but studying all records conjointly should be possible. It may be flat files, parquet filets, etc. but we do need this record-level organization. In a given record, there may be columns that are text or images, Sarus will work fine. We never worked on pdf documents. Conceptually it could work but this is quite far down the road.

2. Sarus has connectors to the main DB and we add more when we meet them. The basic assumption is that the experience should be the same as working on the data in its original form. For instance if your data is in a CSV with a weird date format, you will be able to (i) get synthetic data with this same weird date format, (ii) apply python code that transforms this weird date format into something more conventional and use that reformatted version. When running your data job, Sarus will apply your preprocessing code and take it from there.

3. Today we have a python SDK and a SQL connector. Both leverage the same low-level API. We may build other SDKs for other languages but haven't started doing so.

4. Indeed, we don't have any cert yet but we are looking into getting some soon. We are about to start Soc2 for instance. This is somewhat less of a requirement as we never host any of our clients' data. Of course, everything that helps get the green light of the ITSec team is useful.

5. The python SDK is standard python code so you can use in any python env. The notebook is just here to make it more user-friendly in demos. Same for SQL, you can use any SQL querying tool, we did the demo with Metabase.

6. The easiest way is to deploy a docker image with Docker compose. It does not scale on multiple machine yet (stay tuned). In that sense, big data sources are only partially supported: if the source is RedShift and you submit a SQL query to the API, we'll rewrite it and send it to Redshift (which scales), but if you want to do ML on the same data, we won't be able to scale the same.

7. Complex time series is not a problem for the remote execution part provided it is stored in a traditional format. That being said, we don't have a specific synthetic data model for time series yet, so that part of the experience will be a bit different.

8. This is a debate we leave to researchers because there is not a single answer. It depends directly on the number of records in your dataset and the dimensionality of your data. However, you can set up privacy policies so that the weights of ML model without DP are allowed to be shared. This is considered acceptable by 99% of compliance teams in the world today so it's not a huge compromise. If you use Sarus this way, you are guaranteed to have exactly the same performance.

Would love to continue the conversation offline of course!