Hacker News new | ask | show | jobs
by miki123211 251 days ago
This feels like RLVR, not RLHF.

With RLVR, the LLM is trained to pursue "verified rewards." On coding tasks, the reward is usually something like the percentage of passing tests.

Let's say you have some code that iterates over a set of files and does processing on them. The way a normal dev would write it, an exception in that code would crash the entire program. If you swallow and log the exception, however, you can continue processing the remaining files. This is an easy way to get "number of files successfully processed" up, without actually making your code any better.

1 comments

> This is an easy way to get "number of files successfully processed" up, without actually making your code any better.

Well, it depends a bit on what your goal is.

Sometimes the user wants to eg backup as many files as possible from a failing hard drive, and doesn't want to fail the whole process just because one item is broken.

You're right, but the way to achieve this is to allow the error to propagate at the file level, then catch it one function above and continue to the next one.

However, LLM generated code will often, at least in my experience, avoid raising any errors at all, in any case. This is undesirable, because some errors should result in a complete failure - for example, errors which are not transient or environment related but a bug. And in any case, a LLM will prefer turning these single file errors into warnings, though the way I see it, they are errors. They just don't need to abort the process, but errors nonetheless.

Yes, that's cleaner.

> And in any case, a LLM will prefer turning these single file errors into warnings, though the way I see it, they are errors.

Well, in general they are something that the caller should have opportunity to deal with.

In some cases, aborting back to the caller at the first problem is the best course of action. In some other cases, going forward and taking note of the problems is best.

In some systems, you might event want to tell the caller about failures (and successes) as they occur, instead of waiting until the end.

It's all very similar to the different options people have available when their boss sends them on an errand and something goes wrong. A good underling uses their best judgement to pick the right way to cope with problems; but computer programs don't have that, so we need to be explicit.

See https://en.wikipedia.org/wiki/Mission-type_tactics for a related concept in the military.