Hacker News new | ask | show | jobs
by satisfice 1058 days ago
If I understand this correctly, the source code of a Marsha program does not fully determine the running code. And we aren’t talking about immaterial optimizations, the LLM could do vastly different things with the same Marsha source.

A programmer is a human who connects the world of humans with the world of machines. To do this, he is required to sufficiently understand both worlds. On the human side this requires social competence and professional accountability, which machines don’t have. On the computing side, it requires at least that machines behave in predictable and comprehensible ways. Marsha appears to fall short on both counts.

Using an LLM for programming is inherently irresponsible. The people arguing in favor of doing so have not subjected LLMs to any kind of rigorous testing. They simply have unshakeable faith.

I am in the midst of a careful review and surgical takedown of a 9000 word demonstration of ChatGPT’s supposed ability to help testers test. It took maybe 20 minutes for some drooling consultant fan-boy to produce the demo. It has so far been about 30 hours of work to carefully pore through each sentence and show how it is wrong. I am doing the testing and critical thinking that the original consultant failed to do.

The Marsha site has a brief line about how it produces “tested” Python code. The one thing you can bank on with LLMs is none of you big eyed enthusiasts have a serious attitude about testing. It’s all simplistic demonstration.

I’m frustrated by this culture of fawning adoration of unproven and unprovable tools. I hope this trend peaks and become a generally acknowledged joke soon! Then we can resume with craftsmanship and responsible engineering.

1 comments

I would say that you do not quite understand it. Part of the process of generating the code that does work is that it also generates a test suite using the examples you provide as the test cases and it actually executes the test suite against the code that was generated and iterates with the LLM until the test suite passes.

This is where the claim that it's tested code comes from, because it is literally tested.

One of the examples we added is a simple tool to get headlines from CNN.com[1]. We don't commit the generated python to the repository because we're treating it as a compiler artifact, but here's a gist[2] of one of the runs, including the test suite it created to validate proper behavior. It's not just relying purely on the LLM's ability to string tokens together, but goes through a validation phase to make sure what it built is real.

[1]: https://github.com/alantech/marsha/blob/main/examples/web/cn... [2]: https://gist.github.com/dfellis/a758a7321b4f62f820ddbad57aac...