Hacker News new | ask | show | jobs
by gregwebs 1519 days ago
Using a customer's production data outside of production probably violates their expectations of your data security practices. I couldn't see myself using this unless there was a mode where only allowed fields are copied and non-id fields are first transformed in a lossy way.
3 comments

That was the first thing I checked whether it supports data masking. And it does, via transforms, as evoxmusic already pointed out.

However, there may be times when data masking must be nuanced. Suppose some random email/domain pair is bad and you would rather replace all "example.com" domain instances with "fake.com", and not "random1.com", "random2.com", etc (for ML, 3rd party random analysis). Out of the box I don't see it is provided, HOWEVER I see that you can write custom transformer: https://github.com/Qovery/replibyte/tree/main/examples/wasm and fulfill your needs.

Excellent :)

That's why RepliByte uses S3 to store transformed data. Then the real production data never leave the production environment. https://github.com/Qovery/replibyte/blob/main/docs/DESIGN.md
The doc refers to S3 as an intermediary- it is then going to be loaded in a different DB where all these issues exist. The intermediary adds one more place where data could be leaked.
And not using the chaos of the real world to harden your software before it hits production probably violates their expectations of correctness. It's a real and interesting tension! Software that helps people walk this line well is valuable.
Cannot agree more. There are tons of edgecases that only happens with the chaotic blob that we call production data. Tools like this or anyother that helps anonymizing data is soo useful for debugging issues