Hacker News new | ask | show | jobs
by exdsq 1518 days ago
You’d imagine Postgres or whatever would have a built in function to populate a DB based on types as a sort of fuzzing tool tbh

I worked on a gov app years ago that required anonymized databases and I remember thinking that then - why isn’t it available out the box? Everyone must need this from time to time

1 comments

I build a dataset anonymizer at jpmorgan years ago. There’s a surprising amount of nuance needed just to do a decent job generating schematically valid fake data, let alone stuff that’s statistically faithful to true data
Does it need to be statistically faithful or can it just fuzz the data type? Maybe that’s better for testing anyway?
* Data that is tightly clustered on certain keys and widely dispersed on other keys can hit some "fun" interactions with sharding regimes, indexes, etc. that random data doesn't.

* Brute forcing a whole bunch of invalid values can be a lot less interesting than lighting up unconventional combinations of valid values.

* Sometimes you're wrong about the validation rules, i.e. you think you know the allowable enum values here but in fact production systems that really exist and have customers behind them are setting other values. Rejecting those would itself be a bug.

Exactly!

A fun example is city and state and country fields in a row!

We tend to see banks or businesses with locations in nyc, ny, USA. And quite unlikely to see a business hq in New York, Hawaii USA. If it even exists.

Since generating valid values is difficult, it sounds like an opportunity for an open-source fuzzy data generator with rules for different database types. Encode all the lessons shared in those "falsehoods programmers believe" articles about names, addresses, phone numbers, email addresses, credit card numbers, dates, time zones, etc into one testing tool that everyone can use.

I don't work with databases, so maybe this already exists. :)