| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gregwebs 1519 days ago
	Using a customer's production data outside of production probably violates their expectations of your data security practices. I couldn't see myself using this unless there was a mode where only allowed fields are copied and non-id fields are first transformed in a lossy way.

3 comments

jve 1519 days ago

That was the first thing I checked whether it supports data masking. And it does, via transforms, as evoxmusic already pointed out.

However, there may be times when data masking must be nuanced. Suppose some random email/domain pair is bad and you would rather replace all "example.com" domain instances with "fake.com", and not "random1.com", "random2.com", etc (for ML, 3rd party random analysis). Out of the box I don't see it is provided, HOWEVER I see that you can write custom transformer: https://github.com/Qovery/replibyte/tree/main/examples/wasm and fulfill your needs.

Excellent :)

link

evoxmusic 1519 days ago

That's why RepliByte uses S3 to store transformed data. Then the real production data never leave the production environment. https://github.com/Qovery/replibyte/blob/main/docs/DESIGN.md

link

gregwebs 1519 days ago

The doc refers to S3 as an intermediary- it is then going to be loaded in a different DB where all these issues exist. The intermediary adds one more place where data could be leaked.

link

closeparen 1519 days ago

And not using the chaos of the real world to harden your software before it hits production probably violates their expectations of correctness. It's a real and interesting tension! Software that helps people walk this line well is valuable.

link

pyr0hu 1519 days ago

Cannot agree more. There are tons of edgecases that only happens with the chaotic blob that we call production data. Tools like this or anyother that helps anonymizing data is soo useful for debugging issues

link