| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by exdsq 1518 days ago
	You’d imagine Postgres or whatever would have a built in function to populate a DB based on types as a sort of fuzzing tool tbh I worked on a gov app years ago that required anonymized databases and I remember thinking that then - why isn’t it available out the box? Everyone must need this from time to time

1 comments

carterschonwald 1518 days ago

I build a dataset anonymizer at jpmorgan years ago. There’s a surprising amount of nuance needed just to do a decent job generating schematically valid fake data, let alone stuff that’s statistically faithful to true data

link

exdsq 1518 days ago

Does it need to be statistically faithful or can it just fuzz the data type? Maybe that’s better for testing anyway?

link

closeparen 1518 days ago

* Data that is tightly clustered on certain keys and widely dispersed on other keys can hit some "fun" interactions with sharding regimes, indexes, etc. that random data doesn't.

* Brute forcing a whole bunch of invalid values can be a lot less interesting than lighting up unconventional combinations of valid values.

* Sometimes you're wrong about the validation rules, i.e. you think you know the allowable enum values here but in fact production systems that really exist and have customers behind them are setting other values. Rejecting those would itself be a bug.

link

carterschonwald 1507 days ago

Exactly!

A fun example is city and state and country fields in a row!

We tend to see banks or businesses with locations in nyc, ny, USA. And quite unlikely to see a business hq in New York, Hawaii USA. If it even exists.

link

cpeterso 1518 days ago

Since generating valid values is difficult, it sounds like an opportunity for an open-source fuzzy data generator with rules for different database types. Encode all the lessons shared in those "falsehoods programmers believe" articles about names, addresses, phone numbers, email addresses, credit card numbers, dates, time zones, etc into one testing tool that everyone can use.

I don't work with databases, so maybe this already exists. :)

link