| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hodgesrm 2138 days ago

I'm unconvinced your approach works beyond a narrow range of use cases. The weakness is the "is this a problem" issue. You have a diff. Is it really significant? If it's significant, how did it arise? You can spend an inordinate amount of time answering those two questions, and you may have to do it again with every run. Diffs are cheap to implement but costly to use over time. That inversion of costs means users may end up bogged down maintaining the existing mechanism and unable to invest in other approaches.

If I were going after the same problem I would try to do a couple of things.

1. Reframe the QA problem to make it smaller. Reducing the number and size of pipelines is a good start. That has a bunch of knock-on benefits beyond correctness.

2. Look at data cleaning technologies. QA on datasets is a variation on this problem. For example if you can develop predicates that check for common safety conditions on data like detecting bad addresses or SSANs you give users immediately usable quality information. There's a lot more you can do here.

Assuming you are working on this project, I wish you good luck. You can contact me at rhodges at altinity dot com if you want to discuss further. I've been dealing with QA problems on data for a long time.

1 comments

hodgesrm 2138 days ago

p.s., to expand on #2 if you can "discover" useful safety conditions on data you change the economics of testing, much as #1 does.

link