Wow, this has the ability to be a total gamechanger. You have to be really observant about the bugs though, I would have totally missed the one with the price discount without executing it.
Similar problem to automated driving - as long as it's better than most humans, occasional bugs will be ok. Virtually no software is bug free.
It's much more difficult problem than automated driving though - for software, the space of intents of the user is orders of magnitude greater in size. It's the job of the model to determine the intent of the "programmer". Perhaps we could meet the model half way and come up with heavily-structured natural language to communicate intent.
Programming skills are directly correlated to programmer ability to debug, I will go as far stating; programming is not about writing code, but in ability to find bugs and figure out how to resolve them.
I've always found it easier to debug code I wrote, mostly because I find it the easier to read the code I wrote, since I understand the author's intent.
But as the programmer finds bugs and corrects them, they are simultaneously generating more, accurate data for training. So over time this should theoretically improve
I notice that the bug was in the user's failure to communicate the intent of the scalar. Presumably with regular use users would learn to be more clear and/or anticipate the likely fixes to ambiguous labels.
Also, since it would be used to build tests as well, I'd expect such misunderstandings to be pretty obvious. I would be willing to bet you'd see a net reduction in bugs, and a substantial reduction in typo related bugs.
But if you mean by lowered barrier of entry you mean the population of programmers would be less competent, yes bugs in the design might increase, however being able to more quickly get to the point of evaluating a design is a great way to learn better design.