| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by niea_11 66 days ago

I'm confused by the last part saying that if "weak" models (like gpt oss) find the openbsd bug they are just hallucinating. and also stronger models not finding it is because they dont hallucinate but are not strong enough.

AISLE demonstrated in the last few weeks that small (weak per the author) models can find the openBSD bug (when pointed at the code). And apparently did several runs with the same results. Was gpt oss hallucinating on all those runs?

And what separates a strong model from a weak one? Is qwen3.5 27b weak?

Don't trust who says that weak models can find the OpenBSD SACK bug. I tried it myself. What happens is that weak models hallucinate (sometimes causally hitting a real problem) that there is a lack of validation of the start of the window (which is in theory harmless because of the start < end validation) and the integer overflow problem without understanding why they, if put together, create an issue. It's just pattern matching of bug classes on code that looks may have a problem, totally lacking the true ability to understand the issue and write an exploit. Test it yourself, GPT 120B OSS is cheap and available.

BTW, this is why with this bug, the stronger the model you pick (but not enough to discover the true bug), the less likely it is it will claim there is a bug. Stronger models hallucinate less, so they can't see the problem in any side of the spectrum: the hallucination side of small models, and the real understanding side of Mythos