Hacker News new | ask | show | jobs
by yeldarb 673 days ago
Tried it a few weeks ago for a task (had a few dozen files in an open source repo I wanted to write tests for in a similar way to each other).

I gave it one example and then asked it to do the work for the other files.

It was able to do about half the files correctly. But it ended up taking an hour, costing >$50 in OpenAI credits, and took me longer to debug, fix, and verify the work than it would have to do the work manually.

My take: good glimpse of the future after a few more Moore’s Law doublings and model improvement cycles make it 10x better, 10x faster, and 10x cheaper. But probably not yet worth trying to use for real work vs playing with it for curiosity, learning, and understanding.

Edit: writing the tests in this PR given the code + one test as an example was the task: https://github.com/roboflow/inference/pull/533

This commit was the manual example: https://github.com/roboflow/inference/pull/533/commits/93165...

This commit adds the partially OpenDevin written ones: https://github.com/roboflow/inference/pull/533/commits/65f51...

5 comments

OpenDevin maintainer here. This is a reasonable take.

I have found it immensely useful for a handful of one-off tasks, but it's not yet a mission-critical part of my workflow (the way e.g. Copilot is).

Core model improvements (better, faster, cheaper) will definitely be a tailwind for us. But there are also many things we can do in the abstraction layer _above_ the LLM to drive these things forward. And there's also a lot we can do from a UX perspective (e.g. IDE integrations, better human-in-the-loop experiences, etc)

So even if models never get better (doubtful!) I'd continue to watch this space--it's getting better every day.

As a comparison, I use aider every day to develop aider.

Aider wrote 61% of the new code in its last release. It’s been averaging about 50% since the new Sonnet came out.

Data and graphs about aider’s contribution to its own code base:

https://aider.chat/HISTORY.html

It’d be really great to see a video or cast of you using aider to work on aider.

I can’t get anything useful out of these AI tools for my tasks and I’d really like to see what someone who can does.

I’d like to know if it’s me or my tasks that aren’t working for the llm.

Can I ask what language/stack you’re using for your project? More specifically, is it in Python? I’ve had mediocre (though at least partly usable) results on JavaScript repos, and relatively poor ones on anything less popular.
Aider is written in Python (they have a great Discord community, btw). My experience matches yours: for Python, aider/Sonnet seems to do much better than for Javascript so far. I strongly recommend aider despite LLM limitations at the moment for anyone interested in this space.

It's also very sensitive, unsurprisingly, to development documentation that is moving quickly, e.g., most AI APIs right now. A lot of manual intervention is still required here because of out-of-date references to imports, etc.

How heavy are the API costs for that?

For a project like yours I guess you should be given free credits. I hope that happens, but so far nobody has even given Karpathy a good standalone mic.

If you use DeepSeek Coder V2 0724 (that is #2 after Claude 3.5 Sonnet on the Aider leaderboard), the costs are very, very small. https://aider.chat/2024/07/25/new-models.html
Not much. I spent $25 on Anthropic in July.
Using sonnet?
I'm an active aider user, I spent ~$120 last month on a combo of Sonnet and Opus. It was much more expensive, as you probably know, with Opus. Now it's rather reasonably priced and more sustainable, IMO.
aider is great, i also use it almost daily. thanks for writing it Paul!
> 10x better, 10x faster, and 10x cheaper

Which is the elephant in the room.

There is no roadmap for any of these to happen and a strong possibility that we will start to see diminishing returns with the current LLM implementation and available datasets. At which point all of the hype and money will come out of the industry. Which in turn will cause a lull in research until the next big breakthrough and the cycle repeats.

While we have started seeing diminishing returns on rote data ingestion, especially with synthetic data leading to collapse, there is plenty of other work being done to suggest that the field will continue to thrive. Moore’s law isn’t going anywhere for at least a decade - so as we get more computing power, faster memory interconnects, and purpose built processors, there is no reason to suspect AI is going to stagnate. Right now the bottleneck is arguably more algorithmic than compute bound anyways. No one will ever need more than 640kb of RAM, right?
I feel like the GP and this response are a common exchange right before the next AI Winter hits.
a) It's been widely acknowledged that we are approaching a limit on useful datasets.

b) Synthetic data sets have been shown to not be a substitute.

c) I have no idea why you are linking Moore's Law with AI. Especially when it has never applied to GPUs and we are in a situation where we have a single vendor not subject to normal competition.

Synthetic data absolutely does work well for code.

While Moore's Law probably doesn't strictly apply to GPUs, it's not far off. See [1] where they find "We find that FLOP/s per dollar for ML GPUs double every 2.07 years (95% CI: 1.54 to 3.13 years) compared to 2.46 years for all GPUs." (Moore's law predicts doubling every 2 years)

https://epochai.org/blog/trends-in-gpu-price-performance#tre...

It’d be really nice to see research in this area from somewhere without a financial interest in hyping AI.

That incentive doesn’t invalidate research, but AI results are so easy to nudge in any direction that it’s hard to ignore.

I wonder when people mention Moores law do they use that vernacular literally or figuratively. IE literal as having to do with shrinking of the transistors, figuratively with any and all efforts to increase overall computational speed up.
In this context it’s the latter, but practically speaking they’re the same thing.
b is made up. They have absolutely not been shown to not be a substitute. It's just a big flood of bad research which people treat as summing up to a good argument.
Maybe not 10x yet, but deepcoder has done some impressive things recently. Instead of a generic LLM, they have a relatively smaller one which is coding specific and gpt4-class in quality. This makes it cheaper. In addition, they can do caching which ~10x reduces the cost of follow-up request. And there are still improvements around Star, which reduces the need for learning datasets (models can self-reflect and improve without additional data)

So while we're not 10x-ing everything, it's not like there's no significant improvements in many places.

I meant deepseek coder. Can't edit anymore.
Unfortunately the smaller model is not anywhere near GPT4 in quality and no one seems to want to host the bigger model (it was even removed from fireworks ai this week). And no one in their right mind want to send their code to deepmind chinese API hosting.
I'm perfectly fine sending my open source code to them. I'm also happy to send 95% of my private repos. Let's be honest, it's just boilerplate code not doing anything fancy, just routing/validating data for the remaining 5%. Nobody cares about that and it's exactly why I want AI to handle it. But I wouldn't send that remaining 5% to OpenAI either.
Much of nvidias marketing material covers this if you want to believe it. They at minimal claim that there will be a million fold increase in compute available specifically to ML over the next decade.
You don't know where it will go, just as people didn't know the development of LLMs at all would happen. There are no real oracles to this level of detail (more vaguely in broad lines and over decades some Sci-Fi authors do a reasonable job, and they get a lot wrong).

There have been a lot of people making these sorts of claims for years, and they nearly never end up accurately predicting what will actually happen. That's what makes observing what happens exciting.

Actually the improvement graphs are still scaling exponentially with training/compute being the bottleneck. So there isn't yet any evidence of diminishing returns.

source: https://youtu.be/zjkBMFhNj_g?feature=shared&t=1545

I just viewed an Andrew NG video (he is the guy i tended to learn the latest best prompting, agentic, visual agentic practices from) that hardware companies as well as software are working on making these manifest especially at inference stage.
Can you include link to Andrew NG's video please.
I think this was the relevant video not 100% sure. https://www.youtube.com/watch?v=8lH1mUcxODw&t=2013s
Guessing you used 4o and not 4o-mini. For stuff like this you are better off letting it use mini which is practically free, and then have it double and triple check everything.
This assumes that the model knows it is wrong. It doesn't.

It only knows statistically what is the most likely sequence of words to match your query.

For rarer datasets e.g. I had Claude/OpenAI help out with an IntelliJ plugin it would continually invent methods for classes that never existed. And could never articulate why.

This is where supporting machinery & RAG are very useful.

You can auto- lint and test code before you set eyes on it, then re-run the prompt with either more context or an altered prompt. With local models there are options like steering vectors, fine-tuning, and constrained decoding as well.

There's also evidence that multiple models of different lineages, when their outputs are rated and you take the best one at each input step, can surpass the performance of better models. So if one model knows something the others don't you can automatically fail over to the one that can actually handle the problem, and typically once the knowledge is in the chat the other models will pick it up.

Not saying we have the solution to your specific problem in any readily available software, but that there are approaches specific to your problem that go beyond current methods.

It doesn't make sense that the solution here is to put more load on the user to continually adjust the prompt or try different models.

I asked Claude and OpenAI models over 30x times to generate code. Both failed every time.

If Claude and OpenAI are so useless why does every company ban it during interviews?
Managers make most of those decisions and they have no idea what is achievable, reasonable or even particularly likely.
Do think that says more about the tools or the interview process?
This is a really complicated (and more expensive) setup that doesn't fundamentally fix any of the problems with these systems.
Yep when I read stuff like this I think, "nah I'll just write the damn code." Looking forward to being replaced by a robot, myself.
Popular programming in a nutshell.

It’s the new pop psych.

4o-mini is cheap, but is not practically free. At scale it will still rack up a cost, although I acknowledge that we are currently in the honeymoon phase with it. Computing is the kind of thing that we just do more of when it becomes cheaper, with the budget being constant.
It doesn't work like that. You're more likely to end up with a fractal pattern of token waste, potentially veering off into hallucinations than some actual progress by "double" or "triple checking everything".
Strong chance Moores law stops this decade due to the physical limits on the size of atoms lol.
I’m hopeful that there are some possible model topologies that don’t just stack matmuls.

Maybe there’s some wins to be had on the software side still.

I've heard variations on this argument for the past two decades, and it's amusing every time.
I’ve been hearing that for at least a decade.
And now it's here.
I’ll check back in 2030
instead of using openAI api, can it use the locally hosted ollama http API?
Yes. It's not really "open" if it depends on a non-libre service. To be legit, they must at least enable this experimentally.