The results/numbers aren't interesting because the number of samples is woefully insufficient to draw any conclusions beyond "that's a nice looking dashboard" or maybe "this is a cool idea"
You right, results and numbers are mainly for entertainment purposes. This sample size would allow to analyze main reasoning failure modes and how often they occur.
You right, results and numbers are mainly for entertainment purposes. This sample size would allow to analyze main reasoning failure modes and how often they occur.