Is it expected to be able to solve arbitrary (simple) bugs, or only the list of bugs in the benchmark set?