| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mediaman 242 days ago
	That's already happening. It started happening when they incorporated reinforcement learning into the training process. It's been some time since LLMs were purely stochastic average-token predictors; their later RL fine tuning stages make them quite goal-directed, and this is what has given some big leaps in verifiable domains like math and programming. It doesn't work that well with nonverifiable domains, though, since verifiability is what gives us the reward function.

1 comments

santadays 242 days ago

That makes sense for why they are so much better at writing code than actually following the steps the same code specifies.

Curious, is anyone training in adversarial simulations? In open world simulations?

I think what humans do is align their own survival instinct with a surrogate activities and then rewrite their internal schema to be successful in said activities.

link