Hacker News new | ask | show | jobs
by onion2k 1516 days ago
This project needs a giant heading box in the README stating 3 things;

- staging databases that hold data generated from production databases should be considered production data, with the same level of consideration for security and access as production.

- staging databases that hold production data are a GDPR violation waiting to happen. Make sure your data controller / lawyers knows exactly what you're doing with production data.

- ask yourself why you need production data in staging in the first place. What are you gaining over a script that generates data? If you want data at scale you can generate it randomly. If you want data that covers all edge cases you can generate it non-randomly. If you want "real-looking" data then maybe this tool is useful.

People copying data from production to staging and then failing to look after it properly is a nightmare. It shouldn't be encouraged except in very unusual circumstances. In my experience of dev, your development and staging data should be covering the weird edge cases that you need to handle far more than the nice "happy path" data you get in production.

1 comments

Do you consider transformed data in staging harmful? (Transformed data = where all the sensitive data have been hidden)
I consider it potentially harmful. Anonymizing data is a hard problem, and what is considered sensitive is not settled. For example, an IP address is personal identifiable information under the GDPR. Most people don't mask that in their logs though. If you copy records from production that have network information in them (last known IP for example) then your data controller should be very concerned.

Another major problem with tools like replibyte is that people use them properly, and then a database schema changes, but people don't update their script to anonymize new tables or columns. Then a few months later someone notices sensitive data has made its way in to staging, and into the backups, and the database dumps devs made to debug things because "it's only staging data, who cares!"

Protecting user data is something that you need to be extremely vigilant about. In my experience, the less access I have to production data the happier I am. Copying it and using it in staging, even if you're careful about it, fills me with dread.

It makes sense to me and that's why:

  1. Auto-detection of sensitive data is planned

  2. Detecting database schema change is also plan to prevent leaking sensitive data.
RepliByte responds to a very common need that almost every company end to build internally. The idea is to collaboratively work on a tool that can be used by anyone and that can be improved to avoid leaking data.
Those are great features to have but they're not in the app yet. This is why there should be a warning in the README to tell users to be careful when they use it now.