| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by incorrecthorse 686 days ago
	Unless you want an empty test suite or a test suite full of `assert True`, the reward function is more complicated than you think.

4 comments

gizmo 686 days ago

It's easy to imagine why something could never work.

It's more interesting to imagine what just might work. One thing that has plagued programmers for the past decades is the difficulty of writing correct multi-threaded software. You need fine-grained locking otherwise your threads will waste time waiting for mutexes. But color-coding your program to constrain which parts of your code can touch which data and when is tedious and error-prone. If LLMs can annotate code sufficiently for a SAT solver to prove thread safety that's a huge win. And that's just one example.

link

imtringued 686 days ago

Rust is that way.

link

rafaelmn 686 days ago

Code coverage exists. Shouldn't be hard at all to tune the parameters to get what you want. We have really good tools to reason about code programmatically - linters, analyzers, coverage, etc.

link

SkiFire13 686 days ago

In my experience they are ok (not excellent) for checking whether some code will crash or not. But checking whether the code logic is correct with respect to the requirements is far from being automatized.

link

rafaelmn 686 days ago

But for writing tests that's less of an issue. You start with known good/bad code and ask it to write tests against a spec for some code X - then the evaluation criteria is something like did the test cover the expected lines and produce the expected outcome (success/fail). Pepper in lint rules for preferred style etc.

link

SkiFire13 686 days ago

But this will lead you to the same problem the tweet is talking! You are training a reward model based on human feedback (whether the code satisfies the specification or not). This time the human feedback may seem more objective, but in the end it's still non-exhaustive human feedback which will lead to the reward model being vulnerable to some adversarial inputs which the other model will likely pick up pretty quickly.

link

rafaelmn 686 days ago

It's based on automated tools and evaluation (test runner, coverage, lint) ?

link

SkiFire13 686 days ago

The input data is still human produced. Who decides what is code that follows the specification and what is code that doesn't? And who produces that code? Are you sure that the code that another model produces will look like that? If not then nothing will prevent you from running into adversarial inputs.

And sure, coverage and lints are objective metrics, but they don't directly imply the correctness of a test. Some tests can reach a high coverage and pass all the lint checks but still be incorrect or test the wrong thing!

Whether the test passes or not is what's mostly correlated to whether it's correct or not. But similarly for an image recognizer the prompt of whether an image is a flower or not is also objective and correlated, and yet researchers continue to find adversarial inputs for image recognizer due to the bias in their training data. What makes you think this won't happen here too?

link

layer8 686 days ago

Who writes the spec to write tests against?

In the end, your are replacing the application code by a spec, which needs to have a comparable level of detail in order for the AI to not invent its own criteria.

link

incorrecthorse 686 days ago

Code coverage proves that the code runs, not that it does what it should do.

link

rafaelmn 686 days ago

If you have a test that completes with the expected outcome and hits the expected code paths you have a working test - I'd say that heuristic will get you really close with some tweaks.

link

WithinReason 686 days ago

Adversarial networks are a straightforward solution to this. The reward for generating and solving tests is different.

link

imtringued 686 days ago

That's a good point. A model that is capable of implementing a nonsense test is still better than a model that can't. The implementer model only needs a good variety of tests. They don't actually have to translate a prompt into a test.

link

littlestymaar 686 days ago

It's not trivial to get right but it sounds within reach, unlike “hallucinations” with general purpose LLM usage.

link