| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by phh 383 days ago
	Cool cool. I'm a bit put off by calling it "reasoning" /"thought". These RL targets can be achieved without "thinking" model but still cool. Gotta love the brainfuck task. I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling). So I've been wanting a "RL Zoo" for quite a while. I hope this project won't be a one-off and will be maintained long term with many external contributions to add new targets!

3 comments

CuriouslyC 383 days ago

Gemini 2.5 Pro's superiority is IMO largely driven by their long context support and training methodology. Compare Gemini as a beta reader for a 100k token book with GPT4.1 or Claude 4, and it becomes quite clear how much more effectively it can reason across its context than other comparable models. This also makes it much better for architecting new features into a system, since you can load a lot of the current system into the context and it'll conform to existing styles and architecture patterns more closely.

link

jacob019 382 days ago

Agreed, 2.5 flash too. I analyze a large json document of metrics for pricing decisions. Typically around 200k, occtionallly up to 1M, Gemini 2.5 significantly outperforms for my task. It isn't 100%, but role playing gets close. I suppose that's a form of inference time compute.

link

t55 382 days ago

For a 100k token context window; all those models are comparable though

gemini 2.5 pro shines for 200k+ tokens

link

CuriouslyC 382 days ago

I can confirm from first hand experience that even at 100k they are most definitely not comparable for the task of beta reading.

link

throwaway314155 382 days ago

splitting hairs much?

link

t55 383 days ago

> I personally think that Gemini 2.5 Pro's superiority comes from having hundreds or thousands RL tasks (without any proof whatsoever, so rather a feeling).

Given that GDM pioneered RL, that's a reasonable assumption

link

flowerthoughts 383 days ago

Assuming with GDM, you mean Google-Deep Mind. They pioneered RL with deep nets as policy function estimator. The deep nets being a result of CNNs and massive improvements in hardware parallelization at the time.

RL was established, at the latest, with Q-learning in 1989: https://en.wikipedia.org/wiki/Q-learning

link

t55 382 days ago

i didn't say they invented everything; in science you always stand on the shoulders of giants

i still think my original statement is fair

link

lechatonnoir 382 days ago

"gdm pioneered rl" is definitely not actually right, but it's correct to assert that they were huge players.

people who knew from context that your statement was broadly not actually right would know what you mean and agree on vibes. people who didn't could reasonably be misled, i think.

link

olliestanley 383 days ago

We definitely plan to maintain the project for as long as there is interest in it. If you have ideas for new tasks, we'd always welcome contributions!

link

phh 383 days ago

Thanks for the answer! As a toy project I implemented wikiracing with trl. I'll probably try to PR that to your gym. (can't say that I managed to improve score with it though)

link