Hacker News new | ask | show | jobs
by anotherpaulg 954 days ago
Aider has had an Exercism benchmarking suite for quite some time.

Interestingly, my benchmark results of GPT 4 Turbo show an opposite result: the new gpt-4-1106-preview did significantly better on the first try than the March and June models.

https://aider.chat/docs/benchmarks-1106.html

Aider benchmarks against the 133 Exercism python exercises, not js exercises that mentat's benchmark uses. So this is not an apples-to-apples comparison, but there doesn't seem to be a strong reason to expect qualitatively different results.

I also notice that the instructions prompt that mentat uses seems to be inspired by the aider benchmark? Glad to see others adopting similar benchmarking approaches.

https://github.com/AbanteAI/mentat/blob/main/tests/benchmark...

https://github.com/paul-gauthier/aider/blob/main/benchmark/p...

Edit: Not sure if the mentat authors are in this thread? After looking around a bit, there seems to be a bunch of aider code in your repo. Some attribution would be appreciated. It might even be required under aider's Apache 2.0 license?

7 comments

Hey Paul, I'm a Mentat author.

> I also notice that the instructions prompt that mentat uses seems to be inspired by the aider benchmark? Glad to see others adopting similar benchmarking approaches.

We were inspired by you to use Exercism as a benchmark, thank you! We will add attribution for that. We switched our original instruction prompts for that benchmark to be similar to Aiders to allow for fair comparison.

> After looking around a bit, there seems to be a bunch of aider code in your repo. Some attribution would be appreciated.

We have an unused implementation of your output response format (https://github.com/AbanteAI/mentat/blob/main/mentat/parsers/...), but I don't know what else you are seeing? We implemented that to compare with our response formats and didn't find much difference in performance.

I didn't spend much time looking, but your benchmark prompting inspired me to search your repo for "aider". The results were 3 PRs where aider was mentioned in the conversations [0].

The "code map" PR in particular mentions being "inspired by aider", links to aider and seems to include a bunch of code from aider's old ctags based "repo map" implementation. This isn't an insignificant component of an AI coding tool.

Aider is open source and I try and share my learnings as I'm building it. So it's great when other projects get inspiration from aider! But it is polite to provide attribution for such inspiration, especially if you crib from code with an attribution license.

[0] https://github.com/search?q=repo%3AAbanteAI%2Fmentat+aider&t...

I’ve been using the new model with Aider since it was released, and my anecdata agrees—the “edits applied successfully “ failure rate is much lower than classic gpt4.

Also THANK YOU for Aider! I talk it up to all my programmer friends; it really feels like a glimpse into the future of coding.

Isn't it a good thing that of the benchmarks they ran, the newer model has fewer of the answers memorized (aka, its parroting less)?

Wouldn't this actually be exactly proof that the model has improved over its predecessor by having to solve the problem itself rather than rely on memory?

What use is a model that memorizes the answers to all the benchmarks (see the 7b models on open llm leaderboard for more info on that).

I feel like I see this A LOT these days. If you do a Show HN (for example) and your project is directly inspired by somebody else's who came before you, the least you can do is give nominal attribution.

What is it about software development in particular that makes people so seemingly ethically unfettered by blatant plagiarism?

I am also noticing a massive improvement over the old model
Sorry about that. We updated the blog with attribution and put an attributing comment in our code base where we use your benchmarking prompts. We'll probably delete our implementation of your response format later today since we just had it for benchmarking.
Does aider work with c# at all?
Yes!

Thanks for asking. I've been meaning to address these kinds of questions in the aider FAQ [0]. Here's the entry I just added:

Aider supports pretty much all the popular coding languages. This is partly because GPT-4 is fluent in most mainstream languages, and familiar with popular libraries, packages and frameworks.

In fact, coding with aider is sometimes the most magical when you're working in a language that you are less familiar with. GPT often knows the language better than you, and can generate all the boilerplate to get to the heart of your problem. GPT will often solve your problem in an elegant way using a library or package that you weren't even aware of.

Aider uses tree-sitter to do code analysis and help GPT navigate larger code bases by producing a repository map [1].

Aider can currently produce repository maps for most mainstream languages, listed below. But aider should work quite well for other languages, even without repo map support.

  - C
  - C#
  - C++
  - Emacs Lisp
  - Elixir
  - Elm
  - Go
  - Java
  - Javascript
  - OCaml
  - PHP
  - Python
  - QL
  - Ruby
  - Rust
  - Typescript
[0] https://aider.chat/docs/faq.html#what-code-languages-does-ai...

[1] https://aider.chat/docs/repomap.html

I've just started playing with aider this week, and I find it extremely fun and exciting. But I will say that I've had middling results with an Elixir / Phoenix app. I don't think this has anything to do with aider - rather, I think that the GPT models haven't quite internalized the new approaches in Phoenix 1.7, since up until Turbo their training data was fairly old and probably still contains more pre 1.7 Phoenix examples than post 1.7.

In spite of these frustrations, I have had some genuinely amazing moments coding with GPT-4 lately though. I upgraded to ChatGPT plus lately and it's just mindblowing how helpful it can be in the right contexts. I'm hoping that as I get better with aider I might just drop the ChatGPT sub and stick to API usage.

I totally understand the skepticism many have, because this stuff is still a bit finicky - but I'm overwhelmed by a sense of how fucking _cool_ this stuff is quite often.

I was actually wondering this myself yesterday. So it's not possible to plug a different tree-sitter implementation in for a niche language?
It should be possible, but not currently. Aider would need a bit more configurability to be able to load up arbitrary tree-sitter language implementations at runtime.

There's an open issue you might want to follow for updates:

https://github.com/paul-gauthier/aider/issues/321