Hacker News new | ask | show | jobs
by NumberCruncher 1525 days ago
> That doesn't mean an experienced full stack developer would do Data Science better, because he might lack a lot of skills that matter more in that domain.

This resonates with my experience. I had the opportunity to work on a DS codebase written entirely in Scala with all the typing, parallelism, actor model, whatnot. Basically I joined the company because of this technical factor. It was fun until I figured out that DS was "typed IF-THEN-ELSE written by Java devs in Scala returning stuff the users complain about with high reliability within milliseconds". Now I am happy to be back to the single threaded untyped Python world. Still no bugs in production, because we validate all requests to death, have unit tests and integration tests running on real data not on mokups. Basically we follow the principle: if the integration test passes, our typing is just right, or at least good enough. Funnily all the typing errors we catch are caused by wrongly typed data, coming from the productive system written in a typed programming language... what a strange world.

1 comments

> integration tests running on real data not on mokups.

I can see you are enjoying the life outside of a highly regulated industry. Having certain kinds of production data in tests (or feeding that to test environment) would be a major audit finding in any finance or healthcare company.

Makes for both a blessing and a curse.

You can anonymize such data or get the necessary agreements for a small subset. All of which is tricky, but not impossible.
> You can anonymize such data

...thereby destroying the production-like features for which you want it for testing, which you then need to recreate and reintroduce, so you might as well just synthesize test data in the first place, since that's what you end up doing anyway, in effect.

I know you probably just want to be annoying, but really, there is a world of difference between completely synthetic and anonymized data. Without talking about specifics, anonymizing is just the operation of making the process of deanonymization a lot of harder. "Hard enough" is usually specified in some form by the regulator. You can identify an individual by their ECG data, for example, it's just really hard...

No, in actual practice you don't scrub the stuff you actually need to test.

> I know you probably just want to be annoying, but really, there is a world of difference between completely synthetic and anonymized data.

No, I’ve spent ~20 years in healthcare, with this issue as a frequently recurring issue.

> No, in actual practice you don't scrub the stuff you actually need to test.

In actual practice, the stuff you really need to test often overlaps with the stuff minimally required to scrub to legally deanonymize the data. The most common scenario I’ve seen trying to do this is both creating most of the work of generating synthetic data and failing to legally deidentify the source data.

What I wrote was meant in the context of data science. You can not do ML without having access to real data, not even in highly regulated industries. Obviously you won't touch PIIs. But whether the real data is sitting in your train/test/validation data set or you use it for integration tests, doesn't make any difference from the perspective of an audit.