| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bhu8 543 days ago
	Yeah, the more pages I read, the more disappointed I became. Here is the reason they cite for the low performance (which is even more worrying): "The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance."

2 comments

tippytippytango 543 days ago

Good to know openai knows the frustration of trying to argue with their RL based models as well.

link

eightysixfour 543 days ago

aider found that with R1, the best performance was to use R1 to think through the solution, and use claude to implement the solution. I suspect that, in the near term, we'll need combinations of reasoning models and instruction-following coding models for excellent code output.

My experience is that most of the models focused on reasoning improvements has been that they tend to be a bit worse at following specific instructions. It is also notable that a lot of 3rd party fine-tunes of Llamas and others gain in knowledge based benchmarks while reducing instruction following scores.

I wonder why that seems to be some sort of continuum?

link

arresin 543 days ago

Kind of like an ai “thinking fast and thinking slow”.

link

eightysixfour 543 days ago

Sort of? I don't see why thinking slow should inhibit the ability to follow instructions.

link

Arcuru 543 days ago

I think they're referencing "Thinking, Fast and Slow" - https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

"The book's main thesis is a differentiation between two modes of thought: "System 1" is fast, instinctive and emotional; "System 2" is slower, more deliberative, and more logical. "

link

eightysixfour 543 days ago

Yes, I understand the reference. I don't understand their argument that this is a good example of that common mental model for LLMs.

In this case "fast, instinctive, and emotional" models are better at instruction following than "slower, more deliberative, and more logical" models.

link