| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SR2Z 85 days ago
	I mean, having unit tests and not allowing PRs in unless they all pass is pretty easy (or requiring human review to remove a test!). A software engineer takes a spec which "shifts the distribution of acceptable responses" for their output. If they're 100% accurate (snort), how good does an LLM have to be for you to accept its review as reasonable?

1 comments

59nadir 85 days ago

We've seen public examples of where LLMs literally disable or remove tests in order to pass. I'm not sure having tests and asking LLMs to not merge things before passing them being "easy" matters much when the failure modes here are so plentiful and broad in nature.

link

ElFitz 85 days ago

My favourite so far was Claude "fixing" deployment checks with `continue-on-error: true`

link

jawiggins 84 days ago

You'd want to have the tests run as a github action and then fail the check if the tests don't pass. Optio will resume agents when the actions fail and tell them to fix the failures.

link

SR2Z 84 days ago

So... add another presubmit test that fails when a test is removed. Require human reviews.

It's not like a human being always pushes correct code, my risk assessment for an LLM reading a small bug and just making a PR is that thinking too hard is a waste of time. My risk assessment for a human is very similar, because actually catching issues during code review is best done by tests anyways. If the tests can't tell you if your code is good or not then it really doesn't matter if it's a human or an LLM, you're mostly just guessing if things are going to work and you WILL push bad code that gets caught in prod.

link