| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tezza 428 days ago

Precisely. The tools often hallucinate: including in its instructions higher up even before your prompt portion. Also the behind the scenes stuff not show to the user during reasoning.

You see binary failures all the time when doing function calls or JSON outputs.

That is… “please call this function” … does not call function

“calling JSON endpoint”… does not emit JSON

so from the article the tool generates hallucinations that the tool has used external stuff: but that was entirely fictitious. it does not know that this tool usage was fictitious and then sticks by its guns.

The workaround is to have verification steps, throw away “bad” answers. Instead of expecting one true output, expect a stream of results which have a yield (agriculture) of a certain amount. say 95% work, 5% garbage. never consider the results truly accurate, just “accurate enough”. Verify always

1 comments

atoav 428 days ago

As an electrical engineer it is absolutely amazing how much LLMs suck at describing electrical circuits. It is somewhat ok with natural language, which works for the simplest circuits. For more complex stuff Chatgpt (regardless of model) seems to default to absolutely nonsensical ASCII circuit diagrams, you can ask it to list each part with each terminal and describe the connections to other parts and terminals and it will fail spectacularly with missing parts, missing terminals, parts no one ever heard of, short circuits, dangling nodes with no use..

If tou ask it to draw a schematic thigns somehow get even worse.

But what it is good at is proposing ideas. So if you want to do a thing that could be solved by using a Gilbert cell, the chances it might mention a Gilbert Cell are realistically there.

But I am already having students coming by with LLM slob circuits asking why the don't work..

link

mcv 428 days ago

Makes sense. It's not trained at complex electrical circuits, it's trained at natural language. And code, sure. And other stuff it comes across while training on those, no doubt including simple circuitry, but ultimately, all it does is produce plausible conversations, plausible responses, stuff that looks and sounds good. Whether it's actually correct, whether it works, I don't think that's even a concept in these systems. If it gets it correct by accident, that's mostly because correct responses also look plausible.

It claims to have run code on a Macbook because that's a plausible response from a human in this situation. It's basically trying to beat the Turing Test, but if you know it's a computer, it's obvious it's lying to you.

link

code_biologist 428 days ago

Whether it's actually correct, whether it works, I don't think that's even a concept in these systems.

I'm not an expert, but it is a concept in these systems. Check out some videos on Deepseek's R1 paper. In particular there's a lot they did to incentivize the chain-of-thought reasoning process towards correct answers in "coding, mathematics, science, and logic reasoning" during reinforcement learning. I presume basically all the state of the art CoT reasoning models have some similar "correct and useful reasoning" portion in their RL tuning. This explains why models are getting better at math and code, but not as much at creative writing. As I understand it, everybody is pretty data limited, but it's much easier to generate synthetic training data where there is a right answer than it is to make good synthetic creative writing. It's also much easier to check that the model is answering those problems correctly during training, rather than waiting for human feedback via RLHF.

It seems that OpenAI forgot to make sure their critic model punished o3 for being wrong it claimed it had a laptop, lol.

link