Hacker News new | ask | show | jobs
by bostik 1525 days ago
> integration tests running on real data not on mokups.

I can see you are enjoying the life outside of a highly regulated industry. Having certain kinds of production data in tests (or feeding that to test environment) would be a major audit finding in any finance or healthcare company.

Makes for both a blessing and a curse.

2 comments

You can anonymize such data or get the necessary agreements for a small subset. All of which is tricky, but not impossible.
> You can anonymize such data

...thereby destroying the production-like features for which you want it for testing, which you then need to recreate and reintroduce, so you might as well just synthesize test data in the first place, since that's what you end up doing anyway, in effect.

I know you probably just want to be annoying, but really, there is a world of difference between completely synthetic and anonymized data. Without talking about specifics, anonymizing is just the operation of making the process of deanonymization a lot of harder. "Hard enough" is usually specified in some form by the regulator. You can identify an individual by their ECG data, for example, it's just really hard...

No, in actual practice you don't scrub the stuff you actually need to test.

> I know you probably just want to be annoying, but really, there is a world of difference between completely synthetic and anonymized data.

No, I’ve spent ~20 years in healthcare, with this issue as a frequently recurring issue.

> No, in actual practice you don't scrub the stuff you actually need to test.

In actual practice, the stuff you really need to test often overlaps with the stuff minimally required to scrub to legally deanonymize the data. The most common scenario I’ve seen trying to do this is both creating most of the work of generating synthetic data and failing to legally deidentify the source data.

What I wrote was meant in the context of data science. You can not do ML without having access to real data, not even in highly regulated industries. Obviously you won't touch PIIs. But whether the real data is sitting in your train/test/validation data set or you use it for integration tests, doesn't make any difference from the perspective of an audit.