Anecdotal but I was always shocked to see Claude 3.5 perform so poorly in the benchmarks, when it generates 80% of my code in Cursor (and in cases it fails, no other model succeeds)
Different people seem to get wildly different results here, and I'm not sure what percentage is down to the type of software being built vs the usage patterns.
In my case, I would guess less than 10% of the code I get out of AIs is useful.
What sort of code are you getting those results with? Is it yet-another-react-frontend-button? Is it ebpf programs? Is it a parser in rust?
For the latter two, I've found AI to have pretty low rates, and for the former I haven't had the desire to try.
Almost every time someone says "but most of my code nowadays is LLM generated" it's usually one of three things:
1. Very greenfield work where the LLM doesn't really have a lot of constraints to deal with and can fully control the setup + doesn't have to ingest a lot of existing context
2. Very small projects that largely follow established patterns (CRUD, frontends, etc.)
3. Well established implementation work (the kind of feature that's a simple JIRA ticket).
In my experience they're painfully bad at:
- Novel/niche work where there aren't really answers online to what you're trying to do
- Complex refactoring
- Architecting within existing constraints (other systems, etc.)
I'm pretty confident in my ability to write any code in my main language. But AI is still very useful in just filling out boiler plate, or noticing a pattern and filling out the rest of some repetitive code. Or say, I need to write wrapper around a common command-line utility. It's pretty good at generating the code for that.
What I mostly enjoy using it for is just writing bash scripts for me. I hate writing bash but Claude is excellent at writing the scripts I need.
AI isn't writing software features or anything close to that for me at the moment. But what it is great at is just being a really excellent intellisense. Knowing what you're likely to want to do in the next ~5 lines and just filling it out in one button press. Things like intellisense and automatic refactoring tools were big productivity improvements when they became ubiquitous. AI will be the same for most people, an intellisense on steroids.
Also, writing tests. Writing tests can be quite mundane and boring. But I can just type out what I want tested, give it some files as context and it can be pretty good at generating some tests.
Does AI get it right every time? No way. But, as a developer, I'd rather spend 10 minutes trying to coax an AI into generating me 90% useable code for some boring task than spend 20 minutes typing it out myself. Often, I probably could write the code faster than I could prompt an AI, but being lazy and telling something else to do the work feels pretty good and relaxing.
>AI is still very useful in just filling out boiler plate
That's what I tend to find with English writing as well. It's not great. But sometimes you just need decent generic prose for an introduction or an explanation of something. If you know enough to adjust as needed, it can save time for something that readers are probably just skimming anyway. As I've written previously, about a year ago I was working on cleaning up a bunch of reference architectures and I used Google's Bard in that case to give me a rough draft of background intros for some of them which I modified as needed. Nothing miraculous but saved me a bit of time.
> For the latter two, I've found AI to have pretty low rates, and for the former I haven't had the desire to try.
Similar. I've got a joke language project on the back burner, doing it properly requires going back over my 23 year old university notes on yacc etc., so I tried AI… the AI just makes a mess of it*.
For anything front end, even the original ChatGPT-3.5 model is basically magic (i.e. sufficiently advanced technology).
* I think the last time I touched it was just before o1 was announced; as o3 is now in the free tier of ChatGPT, I should try again…
My gut tells me the AIs will be best for small web projects that are greenfield. The kind a 1-3 person team could maintain.
And my gut tells me they are the worst for the kinds of long-established software conglomerates many professionals work at, which have tons of internal services, integrated acquisitions, etc. etc.
Ultimately the AI is good at what the average developer online is good at, probably full-stack web dev of projects from scratch.
but that kind of code is so easy to write, and code is already way more terse than natural language! it's literally more typing to explain to an LLM how to write some greenfield web CRUD than it is to just type out the code, and if there's a lot of boilerplate it's faster to generate the repetitive parts with keyboard macros!
where's the value everyone on this site and on LinkedIn (but NONE in my real or professional life) seems to get?
I feel like I'm being gaslit when people say Cursor writes 80% of their code, and honestly, it's the conclusion that makes the most sense to me -- the people making these posts must be well-invested in the startups that stand to profit if AI is actually as good as they say. You know, shills.
I work on web crawlers and data mining at scale and well over 50% of my code output is written by AI. I use mostly o1 (copying and pasting isolated snippets) or Jetbrains' AI service.
I also have access to a full-service "junior developer" AI that can take in an entire git repo at once, and its code outputs are significantly less useful -- maybe 10%.
I think a lot of peoples' success rate with AI boils down to their choices in language/toolkit (AI does much better the more common it is) and how they prompt it.
Note that you still need an experienced set of eyes supervising, the thought of an LLM committing to a git repo without a human in the loop scares me.
Have you tried the AI intellisense models like Copilot?
I don't understand the notion that it is faster to generate repetitive code with keyboard macros. I use Vim-mode exclusively, and while I'm not a Vim master, I don't think there's any set of macros that will do what Copilot can do.
It's not that Copilot is smart. It's that 60% of what I do doesn't require much intelligence to anticipate. It is the 40% that matters, the remainder can be trivially guessed, and this is exactly what Copilot does.
Maybe this will help: you need to imagine with an AI intellisense that with each keystroke, you are collapsing the possibility space down to a smaller, finite number of outcomes. You write exactly what code you need for the dumb AI to predict the rest of it.
There are a LOT of reasons why AI intellisense is not all there yet; it can be distracting; it can try to generate too much at once; none of the tools have LSP integrated, so it will provide bullshit suggestions of library methods that don't exist. This is all true, and yet it is still highly valuable in some domains, for some people.
That said, if you write x86 assembly for a living, you are probably out of luck.
(I write Kotlin, Java for Android apps and services, C++ that is tightly integrated with the SoC. Python and Bash for command-line tools that invoke REST APIs. Copilot is useful for these domains.)
I’ve sat through some interviews recently with candidates who started their careers in the last 6 years or so… during the boom cycle. Some were quite good but a troubling amount were clearly over-leveled at their current/previous employers.
For example, last month we interviewed someone for a Staff Engineering role (current role: L5 Senior II engineer), for Python. This person was unable to explain what a set was in Python, didn’t seem to grok the basic HTTP request/response pattern etc. This wasn’t a leetcode interview; it was an engineering conversation. It was the same questions we’d given dozens and dozens engineers in the past. It wasn’t a language barrier issue (guy was American, interviewer was American). Dude just seemed to have a very very narrow set of skills.
For people like this I imagine AI feels like a superpower.
I'm pretty sure that's what's going on too. The quality of junior -> midlevel engineers has plummeted and these AI tools have been a major crutch to help them appear productive/competent again.
Problem is they don't know enough to really assess if what the LLM is spitting out is any good or not so they claim amazing wins.
> but that kind of code is so easy to write, and code is already way more terse than natural language! it's literally more typing to explain to an LLM how to write some greenfield web CRUD than it is to just type out the code, and if there's a lot of boilerplate it's faster to generate the repetitive parts with keyboard macros!
> where's the value everyone on this site and on LinkedIn (but NONE in my real or professional life) seems to get?
I can remember how to describe that every time I need to make a button. I can’t remember the new flavor of the months special snowflake way of expressing that. I’ve had decent traction just listing the pieces in my stack and then subbing those out whenever it changes
> it's literally more typing to explain to an LLM how to write some greenfield web CRUD than it is to just type out the code, and if there's a lot of boilerplate it's faster to generate the repetitive parts with keyboard macros!
I mostly agree with you, but I do think it's faster than searching for and finding the boilerplate you need. I also think AI code completions and the ability to use it to generate the small blocks you will put together into the main app are helpful. Idk, it's not a nothing burger. It's not going to start working at AWS either.
I work in machine learning research: training loops and loss functions are incredibly repetitive and pattern filled, highly represented in the code the LLMs are trained on, and typically short. They are exactly my intuition of simple code that LLMs would work well on.
With respect, having trialed these tools on pretty large ML codebases it's very much most folks' experiences that they're not very good across the board.
Training loops, sure... those are pretty much straight pattern recognition w/ well-represented APIs. But more broadly? Not so much.
I didn't say it cannot work well on anything other than greenfield web projects. I said it would probably be best at those as those have the most training data available. It can work well for your use case and still fit the pattern I laid out
I think it's frontend javascript versus everything else.
There's a few languages/tools I use often but am not an expert in and have been using Claude 3.5 to help me work with existing code. On paper this is a perfect use case. In practice it's like working with an intern that has google in front of them and enough jargon to convince me what they're saying isn't bullshit. Eventually, I'll be able to coax the answers I need out of it.
I'll say though the fact AI can't say "I don't know" and closely related "that is not possible in the context you've given me" combined with the inability to reason is what gives you results that look OK but are subtly trash.
I've been using LLMs for tab autocomplete for a while and just recently started trying out agentic coding AI (Copilot Edits and Cline). I think the disappointing shortfall of agentic AIs (at least for me) comes from the feedback loop being so much looser than the autocomplete style. With autocomplete, I don't have to actively think about what context to feed it, and I can gently correct it if it goes in the wrong direction on a line-by-line basis. With AI agents, they have a lot more leeway to generate a ton of code and reason themselves off the rails before you're able to step in and correct them. Now granted, I am also not very good yet at managing context and crafting prompts, but it feels a lot harder to get good at than simply dropping an AI autocompleter into an existing programming workflow. It's a new paradigm.
I think the big thing overlooked is how much the human steering the models matters. If you know what you’re doing and what changes you need, cursor and other tools make you so productive.
If you don’t know what you’re doing, these things can sometimes produce good code, and sometimes produce things that don’t work at all
That's been my experience too, but I would guess the problem of "here is a ton of context, produce a small amount of code" is significantly better suited for LLMs than "here is a problem, produce a ton of code".
I write a lot of Python and personally I find Claude significantly worse than OpenAI’s reasoning models. I really feel like this varies a ton language to language.
In my case, I would guess less than 10% of the code I get out of AIs is useful.
What sort of code are you getting those results with? Is it yet-another-react-frontend-button? Is it ebpf programs? Is it a parser in rust?
For the latter two, I've found AI to have pretty low rates, and for the former I haven't had the desire to try.