| Aider has had an Exercism benchmarking suite for quite some time. Interestingly, my benchmark results of GPT 4 Turbo show an opposite result: the new gpt-4-1106-preview did significantly better on the first try than the March and June models. https://aider.chat/docs/benchmarks-1106.html Aider benchmarks against the 133 Exercism python exercises, not js exercises that mentat's benchmark uses. So this is not an apples-to-apples comparison, but there doesn't seem to be a strong reason to expect qualitatively different results. I also notice that the instructions prompt that mentat uses seems to be inspired by the aider benchmark? Glad to see others adopting similar benchmarking approaches. https://github.com/AbanteAI/mentat/blob/main/tests/benchmark... https://github.com/paul-gauthier/aider/blob/main/benchmark/p... Edit: Not sure if the mentat authors are in this thread? After looking around a bit, there seems to be a bunch of aider code in your repo. Some attribution would be appreciated. It might even be required under aider's Apache 2.0 license? |
> I also notice that the instructions prompt that mentat uses seems to be inspired by the aider benchmark? Glad to see others adopting similar benchmarking approaches.
We were inspired by you to use Exercism as a benchmark, thank you! We will add attribution for that. We switched our original instruction prompts for that benchmark to be similar to Aiders to allow for fair comparison.
> After looking around a bit, there seems to be a bunch of aider code in your repo. Some attribution would be appreciated.
We have an unused implementation of your output response format (https://github.com/AbanteAI/mentat/blob/main/mentat/parsers/...), but I don't know what else you are seeing? We implemented that to compare with our response formats and didn't find much difference in performance.