| HN Mirror

Huh?

I thought the latest advance in computing (spring 2025 - last year) is self-play / reinforcement learning. Like we've ran out of training data a few years ago.

https://github.com/OpenPipe/ART

Reinforcement learning having the large language model devise puzzles that they solve via llm-as-judge.

The definition of llm-as-judge is your llm generate 8-12 trajectories and a different llm judges the result. I'd use an oracle like windows or linux operating system execution for the problem of ISA-assembly creation.

The winning entries are used to train the large language model.