| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nartz 2133 days ago

Hey guys - here's just some critical feedback from a fellow dev - here's my n of 1 perspective - of course this could be a very different perspective for e.g. large enterprise companies struggling with this.

Feedback:

It seems overly complicated. You lost me when you said i have to train models? Are you assuming that software developers want to train machine learning models to do something as simple as creating some test data? In reality - I reach for tools that make things easier for me, which includes not having to read a ton of documentation, download new external tools, and things that 'just work'.

It is 100% easier for me to export a little production data to test on (and maybe sanitize), or to write a small script to generate a few users and those things I need to test. Plus - then I know exactly what I'm going to get. A lot of times, after I've done this once, it will work for a good while as well - if I do change the schema, I can add some additional data for that column, and go from there, or otherwise.

For those companies who have 'messy' fixture data - is the tool the issue? My take is that the difficulty with maintaining the data could contribute to this issue, but is also more an issue of simply bad housekeeping - e.g. rushing and not tending the garden. While your system might handle this, your system also seems to require a different skillset (e.g. specific training/knowledge) than the standard QA developer might have.

If I did use it, i'd prefer it to be much easier to use - if I could include a ruby gem, and incorporte it into the testing progress, e.g. an 'after' hook after migrating the db, that would be ideal. Then, I dont really need to know much. However, I would still be concerned about whether this is deterministically creating data or if its random?

Good luck!

2 comments

openquery 2133 days ago

Thanks for the feedback. This is exactly what we're looking for.

> It is 100% easier for me to export a little production data to test on (and maybe sanitize), or to write a small script to generate a few users and those things I need to test.

In your case it may very well be. But when you are an organization with a schema which has 100+ tables and these tables have scattered sensitive information this can become a nightmare to manage. I've seen this first hand. Furthermore if you are trying to generate more than 'a little' data this can get more complex as you have to create factories and write a lot of code to make the whole thing coherent and tell a story. I think undertaking the added complexity of Synth is a trade-off one should consider depending on the sophistication of the testing data they require.

> If I did use it, i'd prefer it to be much easier to use

I think this misconception may be attributed to the fact that we use machine learning under the hood. We've spent a lot of time abstracting the developer away from this. In fact you can run the whole lifecycle with 1 line of code:

`synth model new --from-database <database-uri> --train --deploy`

> I would still be concerned about whether this is deterministically creating data or if its random?

At this point you can choose. You can either pick a seed with which the whole generation process starts (this may not be in production yet) or elect to randomly seed it.

Thanks for the great questions :)

link

treis 2133 days ago

>things I need to test

I think this is the biggest problem. I don't need a lot of random data in my database. I need a lot of specific scenarios set up. And a way to get those scenarios back after I test something.

I've definitely been in a lot of situations where test data is a problem. A particularly egregious one that comes to mind is the poor developer that had to develop the fraud functionality. Marking an account as fraud nuked it in the back end. Lots of angry testers/developers when their favorite test account got marked as fraud.

link

openquery 2132 days ago

> I need a lot of specific scenarios set up

Yes we've seen this quite a lot in the wild. The truth is this is not very well defined - how do you get your data to tell a story depends on the story you are trying to tell.

We are trying to come up with a more rigorous framework for abstract representations of 'scenarios' . It's on our roadmap so keep an eye out for this :)

link