| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by spaceman_2020 650 days ago

1673 ELO is wild

If its actually true in practice, I sincerely cannot imagine a scenario where it would be cheaper to hire actual junior or mid-tier developers (keyword: "developers", not architects or engineers).

1,673 ELO should be able to build very complex, scalable apps with some guidance

2 comments

usaar333 650 days ago

I'm not sure how well codeforces percentiles correlate to software engineering ability. Looking at all the data, it still isn't. Key notes:

1. AlphaCode 2 was already at 1650 last year.

2. SWE-bench verified under an agent has jumped from 33.2% to 35.8% under this model (which doesn't really matter). The full model is at 41.4% which still isn't a game changer either.

3. It's not handling open ended questions much better than gpt-4o.

link

deisteve 650 days ago

i think you are right now actually initially i got excited but now i think OpenAI pulled the hype card again to seem relevant as they struggle to be profitable

Claude on the other hand has been fantastic and seems to do similar reasoning behind the scenes with RL

link

usaar333 650 days ago

The model is really impressive to be fair. It's just how economically relevant it is.

link

deisteve 650 days ago

currently my workflow is generate some code, run it, if it doesn't run i tell LLM what I expected, it will then produce code and I frequently tell it how to reason about the problem.

with O1 being in the 89th percentile would mean it should be able to think at junior to intermediate level with very strong consistency.

i dont think people in the comments realize the implication of this. previously LLMs were able to only "pattern match" but now its able to evaluate itself (with some guidance ofc) essentially, steering the software into depth of edge cases and reason about it in a way that feels natural to us.

currently I'm copying and pasting stuff and notifying LLM the results but once O1 is available its going to significantly lower that frequency.

For example, I expect it to self evaluate the code its generate and think at higher levels.

ex) oooh looks like this user shouldn't be able to escalate privileges in this case because it would lead to security issues or it could conflict with the code i generated 3 steps ago, i'll fix it myself.

link