| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chad1n 377 days ago
	The guys in the other thread who said that OpenAI might have quantized o3 and that's how they reduced the price might be right. This o3-pro might be the actual o3-preview from the beginning and the o3 might be just a quantized version. I wish someone benchmarks all of these models to check for drops in quality.

5 comments

simonw 377 days ago

That's definitely not the case here. The new o3-pro is slow - it took two minutes just to draw me an SVG of a pelican riding a bicycle. o3-preview was much faster than that.

https://simonwillison.net/2025/Jun/10/o3-pro/

link

teruakohatu 376 days ago

Do you think a cycling pelican is still a valid cursory benchmark? By now surely discussions about it are in the training set.

There is quite a few on Google Image search.

On the other hand they still seem to struggle!

link

FergusArgyll 377 days ago

Wow! pelican benchmark is now saturated

link

esperent 377 days ago

Not until I can count the feathers, ask for a front view of the same pelican, then ask for it to be animated, all still using SVG.

link

dtech 376 days ago

I wonder how much of that is because it's getting more and more included in training data.

We now need to start using walrusses riding rickshaws

link

CamperBob2 377 days ago

Would you say this is the best cycling pelican to date? I don't remember any of the others looking better than this.

Of course by now it'll be in-distribution. Time for a new benchmark...

link

jstummbillig 377 days ago

I love that we are in the timeline where we are somewhat seriously evaluating probably super human intelligence by their ability to draw a svg of a cycling pelican.

link

CamperBob2 377 days ago

I still remember my jaw hitting the floor when the first DALL-E paper came out, with the baby daikon radish walking a dog. How the actual fuck...? Now we're probably all too jaded to fully appreciate the next advance of that magnitude, whatever that turns out to be.

E.g., the pelicans all look pretty cruddy including this one, but the fact that they are being delivered in .SVG is a bigger deal than the quality of the artwork itself, IMHO. This isn't a diffusion model, it's an autoregressive transformer imitating one. The wonder isn't that it's done badly, it's that it's happening at all.

link

datameta 376 days ago

This makes me think of a reduction gear as a metaphor. At a high enough ratio, the torque is enormous but being put toward barely perceptible movement. There is the huge amount of computation happening to result in SVG that resembles a pelican on a bicycle.

link

Gerardo1 377 days ago

I don't love that this is the conversation and when these models bake-in these silly scenarios with training data, everyone goes "see, pelican bike! super human intelligence!"

The point is never the pelican. The point is that if a thing has information about pelicans, and has information about bicycles, then why can't it combine those ideas? Is it because it's not intelligent?

link

CamperBob2 377 days ago

"I'm taking this talking dog right back to the pound. It told me to go long on AAPL. Totally overhyped"

link

radlad 376 days ago

Just because it's impressive doesn't mean it has "super human intelligence" though.

link

simonw 377 days ago

I like the Gemini 2.5 Pro ones a little more: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

link

AstroBen 377 days ago

That's one good looking pelican

link

torginus 376 days ago

This made me think of the 'draw a bike experiment', where people were asked to draw a bike from memory, and were suprisingly bad at recreating how the parts fit together in a sensible manner:

https://road.cc/content/blog/90885-science-cycology-can-you-...

ChatGPT seems to perform better than most, but with notable missing elements (where's the chain or the handlebars?). I'm not sure if those are due to a lack of understanding, or artistic liberties taken by the model?

link

k2xl 377 days ago

Not distilled, same model. https://x.com/therealadamg/status/1932534244774957121?s=46&t...

link

eru 376 days ago

Well, that might be more of a function of how long they let it 'reason' than anything intrinsic to the model?

link

Terretta 377 days ago

> It's only available via the newer Responses API

And in ChatGPT Pro.

link

torginus 376 days ago

I've wondered if some kind of smart pruning is possible during evaluation.

What I mean by that, is if a neuron implements a sigmoid function and its input weights are 10,1,2,3 that means if the first input is active, then evaluation the other ones is mathematically pointless, since it doesn't change the result, which recursively means the inputs of those neurons that contribute to the precursors are pointless as well.

I have no idea how feasible or practical is it to implement such an optimization and full network scale, but I think its interesting to think about

link

gkamradt 377 days ago

o3-pro is not the same as the o3-preview that was shown in Dec '24. OpenAI confirmed this for us. More on that here: https://x.com/arcprize/status/1932535380865347585

link

weinzierl 377 days ago

Is there a way to figure out likely quantization from the output. I mean, does quantization degrade output quality in certain ways that are different from other modification of other model properties (e.g. size or distillation)?

link

hapticmonkey 376 days ago

What a great future we are building. If AI is supposed to run everything, everywhere....then there will be 2, maybe 3, AI companies. And nobody outside those companies knows how they work.

link

eru 375 days ago

What makes you think so? So far, many new AI companies are sprouting and many of them seem to be able to roughly match the state-of-the-art very quickly. (But pushing the frontier seems to be harder.)

From the evidence we have so far, it does not look like there's any natural monopoly (or even natural oligopoly) in AI companies. Just the opposite. Especially with open weight models, or oven more so complete open source models.

link

jsjohnst 376 days ago

> And nobody outside those companies knows how they work.

I think you meant to say:

And nobody knows how they work.

link