| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by abathur 222 days ago

I think both of those experiments do a good job of demonstrating utility on a certain kind of task.

But this is cherry-picking.

In the grand scheme of the work we all collectively do, very few programming projects entail something even vaguely like generating an Nth HTML parser in a language that already has several wildly popular HTML parsers--or porting that parser into another language that has several wildly popular HTML parsers.

Even fewer tasks come with a library of 9k+ tests to sharpen our solutions against. (Which itself wouldn't exist without experts trodding this ground thoroughly enough to accrue them.)

The experiments are incredibly interesting and illuminating, but I feel like it's verging on gaslighting to frame them as proof of how useful the technology is when it's hard to imagine a more favorable situation.

1 comments

jodrellblank 222 days ago

> "it's hard to imagine a more favorable situation"

Granted, but this reads a bit like a headline from The Onion: "'Hard to imagine a more favourable situation than pressing nails into wood' said local man unimpressed with neighbour's new hammer".

I think it's a strong enough example to disprove "they're an interesting phenomenon that people have convinced themselves MUST BE USEFUL ... either through ignorance or a sense of desperation". Not enough to claim they are always useful in all situations or to all people, but I wasn't trying for that. You (or the person I was replying to) basically have to make the case that Simon Willison is ignorant about LLMs and programming, is desperate about something, or is deluding himself that the port worked when it actually didn't, to keep the original claim. And I don't think you can. He isn't hyping an AI startup, he has no profit motive to delude him. He isn't a non-technical business leader who can't code being baffled by buzzwords. He isn't new to LLMs and wowed by the first thing. He gave a conference talk showing that LLMs cannot draw pelicans on bicycles so he is able to admit their flaws and limitations.

> "But this is cherry-picking."

Is it? I can't use an example where they weren't useful or failed. It makes no sense to try and argue how many successes vs. failures, even if I had any way to know that; any number of people failing at plumbing a bathroom sink don't prove that plumbing is impossible or not useful. One success at plumbing a bathroom sink is enough to demonstrate that it is possible and useful - it doesn't need dozens of examples - even if the task is narrowly scoped and well-trodden. If a Tesla humanoid robot could plumb in a bathroom sink, it might not be good value for money, but it would be a useful task. If it could do it for $30 it might be good value for money as well even if it couldn't do any other tasks at all, right?

link

abathur 222 days ago

> Granted, but this reads a bit like a headline from The Onion: "'Hard to imagine a more favourable situation than pressing nails into wood' said local man unimpressed with neighbour's new hammer".

Chuffed you picked this example to ~sneer about.

There's a near-infinite list of problems one can solve with a hammer, but there are vanishingly few things one can build with just a hammer.

> You (or the person I was replying to) basically have to make the case that Simon Willison is ignorant about LLMs and programming, is desperate about something, or is deluding himself that the port worked when it actually didn't, to keep the original claim.

I don't have to do any such thing.

I said the experiments were both interesting and illuminating and I meant it. But that doesn't mean they will generalize to less-favorable problems. (Simon's doing great work to help stake out what does and doesn't work for him. I have seen every single one of the posts you're alluding to as they were posted, and I hesitated to reply here because I was leery someone would try to frame it as an attack on him or his work.)

> Is it? I can't use an example where they weren't useful or failed.

  https://en.wiktionary.org/wiki/cherry-pick

  (idiomatic) To pick out the best or most desirable items
  from a list or group, especially to obtain some advantage
  or to present something in the best possible light. 

  (rhetoric, logic, by extension) To select only evidence which supports an argument, 
  and reject or ignore contradictory evidence.

> any number of people failing at plumbing a bathroom sink don't prove that plumbing is impossible or not useful. One success at plumbing a bathroom sink is enough to demonstrate that it is possible and useful - it doesn't need dozens of examples - even if the task is narrowly scoped and well-trodden.

This smells like sleight of hand.

I'm happy to grant this (with a caveat^) if your point is that this success proves LLMs can build an HTML parser in a language with several popular source-available examples and thousands of tests (and probably many near-identical copies of the underlying HTML specs as they evolve) with months of human guidance^ and (with much less guidance) rapidly translate that parser into another language with many popular source-available answers and the same test suite. Yes--sure--one example of each is proof they can do both tasks.

But I take your GP to be suggesting something more like: this success at plumbing a sink inside the framework an existing house with plumbing provides is proof that these things can (or will) build average fully-plumbed houses.

^Simon, who you noted is not ignorant about LLMs and programming, was clear that the initial task of getting an LLM to write the first codebase that passed this test suite took Emil months of work.

> If a Tesla humanoid robot could plumb in a bathroom sink, it might not be good value for money, but it would be a useful task. If it could do it for $30 it might be good value for money as well even if it couldn't do any other tasks at all, right?

The only part of this that appears to have been done for about $30 was the translation of the existing codebase. I wouldn't argue that accomplishing this task for $30 isn't impressive.

But, again, this smells like sleight of hand.

We have probably plumbed billions of sinks (and hopefully have billions or even trillions more to go), so any automation that can do one for $30 has clear value.

A world with a billion well-tested HTML parsers in need of translation is likely one kind of hell or another. Proof an LLM-based workflow can translate a well-tested HTML parser for $30 is interesting and illuminating (I'm particularly interested in whether it'll upend how hard some of us have to fight to justify the time and effort that goes into high-quality test suites), but translating them obviously isn't going to pay the bills by itself.

(If the success doesn't generalize to less favorable situations that do pay the bills, this clearly valuable capability may be repriced to better reflect how much labor and risk it saves relative to a human rewrite.)

link

jodrellblank 222 days ago

> "Yes--sure--one example of each is proof they can do both tasks."

Therefore LLMs are useful. Q.E.D. The claim "people who say LLMs are useful are deluded" is refuted. Readers can stop here, there is no disagreement to argue about.

> "But I take your GP to be suggesting something more like: this success at plumbing a sink inside the framework an existing house with plumbing provides is proof that these things can (or will) build average fully-plumbed houses."

Not exactly; it's common to see people dismiss internet claims of LLMs being useful. Here[1] is a specific dismissal that I am thinking of where various people are claiming that LLMs are useful and the HN commenter investigated and says the LLMs are useless, the people are incompetent, and others are hand-writing a lot of the code. No data is provided for use the readers to make any judgement one way or the other. Emil taking months to create the Python version could be dismissed this way as well, assuming a lot of hand-writing of code in that time. Small scripts can be dismissed with "I could have written that quickly" or "it's basically regurgitating from StackOverflow".

Simon Willison's experiment is a more concrete example. The task is clearly specified, not vague architecture design. The task has a clear success condition (the tests). It's clear how big the task is and it's not a tiny trivial toy. It's clear how long the whole project took and how long GPT ran for, there isn't a lot of human work hiding in it. It ran for multiple hours generating a non-trivial amount of work/code which is not likely to be a literal example regurgitated from its training data. The author is known (Django, Datasette) to be a competent programmer. The LLM code can be clearly separated from any human involvement.

Where my GP was going is that the experiment is not just another vague anecdote, it's specific enough that there's no room left for dismissing it how the commenter in [1] does. It's untenable to hold the view that "LLMs are useless" in light of this example.

> (repeat) "But I take your GP to be suggesting something more like: this success at plumbing a sink inside the framework an existing house with plumbing provides is proof that these things can (or will) build average fully-plumbed houses."

The example is not proof that these things can do anything else, but why would you assume they can't do tasks of similar complexity? Through time we've gone from "LLMs don't exist" to "LLMs exist as novelties and toys (GPT-1 2018)" to "LLMs might be useful but might not be". If things keep progressing we will get to "LLMs are useful". I am taking the position that we are past that point, and I am arguing that position. We are definitely into the time "they are useful". Other people have believed that for a long time. Not just useful for that task, but for tasks of that kind of complexity.

Sometime between GPT-1 babbling (2018) and today (Q4 2025) the GPTs and the tooling improved from not being able to do this task to yes being able to do this task. Some refinement, some chain of thought, some enlarged context, some API features, some CLI tools.

Since one can't argue that LLMs are useless by giving a single example of a failure, to hold the view that LLMs are useless, one would need to broadly dismiss whole classes of examples by the techniques in [1]. This specific example can't be dismissed in those ways.

> "If the success doesn't generalize to less favorable situations that do pay the bills"

Most bill-paying code in the world is CRUD, web front end, business logic, not intricate parsing and computer science fundamentals. I'm expecting that "AI slop" is going to be good enough for managers no matter how objectionable programmers find it. If I order something online and it arrives, I don't care if the order form was Ruby on Rails emailing someone who copied the order docs into a Google Spreadsheet using an AI generated If This Then That workflow. and as long as the error rate and credit card chargeback rate are low enough, nor will the company owners. Even though there are tons of examples of companies having very poor systems and still being in business, I don't have any specific examples so I wouldn't argue this vehemently - but the world isn't waiting for LLMs to be as 'useful' as HN commenters are waiting for, before throwing spaghetti at the wall and letting 'Darwinian Natural Selection' find the maximum level of slop the markets will tolerate.

----

On that note, a pedantic bit about cherry-picking: there's a difference between cherry-picking as a thing, and cherry-picking as a logical fallacy / bad-faith argument. e.g. if someone claims "Plants are inedible" and I point to cabbage and say it proves the claim is false, you say I'm cherry-picking cabbage and ignoring poisonous foxgloves. However, foxgloves existing - and a thousand other inedible plants existing - does not make edible cabbage stop existing. Seeing the ignored examples does not change the conclusion "plants are inedible" is false, so ignoring those things was not bad. Similarly "I asked GPT5 to port the Linux kernel to Rust and it failed" does not invalidate the html5 parser port.

Definition 2 is bad form; e.g. saying "smoking is good for you, here is a study which proves it" is a cherry-picking fallacy because if the ignored-studies were seen, they would counter the claim "smoking is good for you". Hiding them is part of the argument, deceptively.

"LLMs are useless and only a deluded person would say otherwise" is an example of the former; it's countered by a single example of a non-deluded person showing an LLM doing something useful. It isn't a cherry-picking fallacy to pick one example because no amount of "I asked ChatGPT to port Linux to Rust and it failed" makes the HTML parser stop existing and doesn't change the conclusion.

[1] https://news.ycombinator.com/item?id=45560885

link