| I keep seeing this. The "for now" comments, and how much better it's getting with each model. I don't see it in practice though. The fundamental problem hasn't changed: these things are not reasoning. They aren't problem solving. They're pattern matching. That gives the illusion of usefulness for coding when your problem is very similar to others, but falls apart as soon as you need any sort of depth or novelty. I haven't seen any research or theories on how to address this fundamental limitation. The pattern matching thing turns out to be very useful for many classes of problems, such as translating speech to a structured JSON format, or OCR, etc... but isn't particularly useful for reasoning problems like math or coding (non-trivial problems, of course). I'm pretty excited about the applications for AI overall and it's potential to reduce human drudgery across many fields, I just think generating code in response to prompts is a poor choice of a LLM application. |
Have you actually tried the latest agentic coding models?
Yesterday I asked claude to implement a working web based email client from scratch in rust which can interact with a JMAP based mail server. It did. It took about 20 minutes. The first version had a few bugs - like it was polling for mail instead of streaming emails in. But after prompting it to fix some obvious bugs, I now have a working email client.
Its missing lots of important features - like, it doesn't render HTML emails correctly. And the UI looks incredibly basic. But it wrote the whole thing in 2.5k lines of rust from scratch and it works.
This wasn't possible at all a couple of years ago. A couple of years ago I couldn't get chatgpt to port a single source file from rust to typescript without it running out of context space and introducing subtle bugs in my code. And it was rubbish at rust - it would introduce borrow checker problems and then get stuck, trying and failing to get it to compile. Now claude can write a whole web based email client in rust from scratch, no worries. I did need to manually point out some bugs in the program - claude didn't test its email client on its own. There's room for improvement for sure. But the progress is shocking.
I don't know how anyone who's actually pushed these models can claim they haven't improved much. They're lightyears ahead of where they were a few years ago. Have you actually tried them?