Hacker News new | ask | show | jobs
by coltonv 312 days ago
Appreciate the comment!

> I mentioned in another comment the major flaw in your productivity calculation, is that you aren’t accounting for the work that wouldn’t have gotten done otherwise. That’s where my improvements are almost universally coming from. I can improve the codebase in ways that weren’t justifiable before in places that do not suffer from the coordination costs you rightly point out.

I'm a bit confused by this. There is work that apparently is unlocking big productivity boosts but was somehow not justified before? Are you referring to places like my ESLint rule example, where eliminating the startup costs of learning how to write one allows you to do things you wouldn't have previously bothered with? If so, I feel like I covered this pretty well in the article and we probably largely agree on the value that productivity boost. My point is still stands that that doesn't scale. If this is not what you mean, feel free to correct me.

Appreciate your thoughts on hallucinations. My guess is the difference between what we're experiencing is that in your code hallucinations are still happening but getting corrected after tests are run, whereas my agents typically get stuck in these write-and-test loops and can't figure out how to solve the problem, or it "solves" it by deleting the tests or something like that. I've seen videos and viewed open source AI PRs which end up in similar loops as to what I've experienced, so I think what I see is common.

Perhaps that's an indication of that we're trying to solve different problems with agents, or using different languages/libraries, and that explains the divergence of experiences. Either way, I still contend that this kind of productivity boost is likely going to be hard to scale and will get tougher to realize as time goes on. If you keep seeing it, I'd really love to hear more about your methods to see what I'm missing. One thing that has been frustrating me is that people rarely share their workflows after makign big claims. This is unlike previous hype cycles where people would share descriptions of exactly what they did ("we rewrote in Rust, here's how we did it", etc.) Feel free to email me at the address in my about page[1] or send me a request on LinkedIn or whatever. I'm being 100% genuine that I'd love to learn from you!

[1] https://colton.dev/about/

2 comments

> but getting corrected after tests are run, whereas my agents typically get stuck in these write-and-test loops

This maybe a definition problem then. I don’t think “the agent did a dumb thing that it can’t reason out of” is a hallucination. To me a hallucination is a pretty specific failure mode, it invents something that doesn’t exist. Models still do that for me but the build test loop sets them aright on that nearly perfectly. So I guess the model is still hallucinating but the agent isn’t so the output is unimpacted. So I don’t care.

For the agent is dumb scenario, I aggressively delete and reprompt. This is something I’ve actually gotten much better at with time and experience, both so it doesn’t happen often and I can course correct quickly. I find it works nearly as well for teaching me about the problem domain as my own mistakes do but is much faster to get to.

But if I were going to be pithy. Aggressively deleting work output from an agent is part of their value proposition. They don’t get offended and they don’t need explanations why. Of course they don’t learn well either, that’s on you.

What I'm saying is that the model will get into one of these loops where it needs to be killed, and I'll look at some of the intermediate states and the reasons for failure and they are because it hallucinated things, ran tests, got an error. Does that make sense?

Deleting and re-prompting is fine. I do that too. But even one cycle of that often means the whole prompting exercise takes me longer than if I just wrote the code myself.

I think maybe this is another disconnect. A lot of the advantage I get does not come from the agent doing things faster than me, though for most tasks it certainly can.

A lot of the advantage is that it can make forward progress when I can’t. I can check to see if an agent is stuck, and sometimes reprompt it, in the downtime between meetings or after lunch before I start whatever deep thinking session I need to do. That’s pure time recovered for me. I wouldn’t have finished _any_ work with that time previously.

I don’t need to optimize my time around babysitting the agent. I can do that in the margins. Watching the agents is low context work. That adds the capability to generate working solutions during times that was previously barred from that.

I've done a few of these types of hands off and go to a meeting style interactions. It has worked a few times, but I tend to just find that they over do it or cause issues. Like you ask them to fix an error and they add a try catch, swallow the error, and call it a day. Or the PR has 1000 line changes when it should have two.

Either way, I'm happy that you are getting so much out of the tools. Perhaps I need to prompt harder, or the codebase I work on has just deviated too much from the stuff the LLMs like and simply isn't a good candidate. Either way, appreciate talking to you!

> One thing that has been frustrating me is that people rarely share their workflows after making big claims

Good luck ever getting that. I've asked that about a dozen times on here from people making these claims and have never received a response. And I'm genuinely curious as well, so I will continue asking.

People share this stuff all the time. Kenton Varda published a whole walkthrough[1], prompts and all. Stories about people's personal LLM workflows have been on the front page here repeatedly over the last few months.

What people aren't doing is proving to you that their workflows work as well as they say they do. You want proof, you can DM people for their rate card and see what that costs.

[1] https://news.ycombinator.com/item?id=44159166

Thanks for sharing and that is interesting to read through. But it's still just a demo, not live production code. From the readme:

> As of March, 2025, this library is very new, prerelease software.

I'm not looking for personal proof that their workflows work as well as they say they do.

I just want an example of a project in production with active users depending on the service for business functions that has been written 1.5/2/5/10/whatever x faster than it otherwise would have without AI.

Anyone can vibe code a side project with 10 users or a demo meant to generate hype/sales interest. But I want someone to actually have put their money where their mouth is and give an example of a project that would have legal, security, or monetary consequences if bad code was put in production. Because those are the types of projects that matter to me when trying to evaluate people's claims (since those are what my paycheck actually depends on).

Do you have any examples like that?

Dude.

That code tptacek linked you to? It's part of our (Cloudflare's) MCP framework. Which means all of the companies mentioned in this blog post are using this code in production today: https://blog.cloudflare.com/mcp-demo-day/

There you go. This is what you are looking for. Why are you refusing to believe it?

(OK fine. I guess I should probably update the readme to remove that "prerelease" line.)

Lol misunderstanding a disclaimer in a readme is not refusing to believe something. But my apologies and appreciate the clarification.
Yeah OK fair that line in the readme is more prominent than I remember it being.

I never look at my own readmes so they tend to get outdated. :/

Fixing: https://github.com/cloudflare/workers-oauth-provider/pull/59

See, I just shared Kenton Varda describing his entire workflow, and you came back asking that I please show you a workflow that would find more credible. Do you want to learn about people's workflows, or do you want to argue with them that their workflows don't work? Nobody is interested in doing the latter with you.
I don't think you understood me at all. I don't care about the actual workflow. I just want an example of of a project that:

1. Would have legal, security, or monetary consequences if bad code was put in production

2. Was developed using an AI/LLM/Agent/etc that made the development many times faster than it otherwise would have (as so many people claim)

I would love to hear an example where "I used Claude to develop this hosting/ecommerce/analytics/inventory management service that is used in production by 50 paying companies. Using an LLM we deployed the project in 4 week where it would normally take us 4 months." Or "We updated an out of date code base for a client in half the time it would normally take and have not seen any issues since launch"

At the end of the day I code to get paid. And it would really help to be able to point to actual cases where both money and negative consequences of failure are on the line.

So if you have any examples please share. But the more people deflect the more skeptical I get about their claims.

Seems like I understand you pretty well! If you wanted to talk about workflows in a curious and open way, your best bet would have been finishing that comment with something other than "the more people deflect the more skeptical I get". Stay skeptical! You do you.
It almost feels like sealioning. People say nobody shares their workflow, so I share it. They say well that's not production code, so I point to PRs in active projects I'm using, and they say well that doesn't demonstrate your interactive flow. I point out the design documents and prompts and they say yes but what kind of setup do you do, which MCP servers are you running, and I point them at my MCP repo.

At some point you have to accept that no amount of proof will convince someone that refuses to be swayed. It's very frustrating because, while these are wonderful tools already, its clear that the biggest thing that makes a positive difference is people using and improving them. They're still in relative infancy.

I want to have the kind of conversations we had back at the beginning of web development, when people were delighted at what was possible despite everything being relatively awful.

I don't care about your workflow, that can be figured out from the 10,000 blog posts all describing the same thing. My issue is with people claiming this huge boost in productivity only to find out that they are working on code bases that have no real consequence if something fails, breaks, or doesn't work as intended.

Since my day job is creating systems that need to be operational and predictable for paying clients - examples of front end mockups, demos, apps with no users, etc don't really matter that much at the end of the day. It's like the difference between being a great speaker in a group of 3 friends vs standing up in front of a 30 person audience with your job on the line.

If you have some examples, I'd love to hear about them because I am genuinely curious.