| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by foundry27 359 days ago

Model cards, for the people interested in the guts: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...

In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:

- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.

- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.

- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)

All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.

7 comments

highfrequency 359 days ago

I would guess the “secret sauce” here is distillation: pretraining on an extremely high quality synthetic dataset from the prompted output of their state of the art models like o3 rather than generic internet text. A number of research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.

This would be much more efficient than relying purely on RL post-training on a small model; with low baseline capabilities the insights would be very sparse and the training very inefficient.

asadm 359 days ago

> research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.

same seems to be true for humans

throw310822 359 days ago

Yes, if I understand correctly, what it means is "a very smart teacher can do wonders for their pupils' education".

tempaccount420 359 days ago

Wish they gave us access to learn from those grandmother models instead of distilled slop.

ashdksnndck 359 days ago

It behooves them to keep the best stuff internal, or at least greatly limit any API usage to avoid giving the goods away to other labs they are racing with.

saurik 358 days ago

Which, presumably, is the reason they removed 4.5 from the API... mostly the only people willing to pay that much for that model were their competitors. (I mean, I would pay even more than they were charging, but I imagine even if I scale out my use cases--which, for just me, are mostly satisfied by being trapped in their UI--it would be a pittance vs. the simpler stuff people keep using.)

rfoo 359 days ago

Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this.

The model is pretty sparse tho, 32:1.

liuliu 359 days ago

Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.

throwdbaaway 359 days ago

I thought Kimi K2 uses 8 active experts out of 384? Sparsity should be 48:1. Indeed Llama4 Maverick is the only one that has 128:1 sparsity.

liuliu 358 days ago

You are right. I mis-remembered the sparsity part of K2. The "done wrong" part I was thinking about how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse).

throwdbaaway 357 days ago

> how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse)

Ah I see. I didn't notice that behemoth has the same sparsity as scout. That seems quite random indeed.

nxobject 359 days ago

It's convenient to be able to attribute success to things only OpenAI could've done with the combo of their early start and VC money – licensing content, hiring subject matter experts, etc. Essentially the "soft" stuff that a mature organization can do.

tgtweak 359 days ago

I think their MXFP4 release is a bit of a gift since they obviously used and tuned this extensively as a result of cost-optimization at scale - something the open source model providers aren't doing too much, and also somewhat of a competitive advantage.

Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.

logicchains 359 days ago

>They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool

They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.

rushingcreek 359 days ago

The native FP4 is one of the most interesting architectural aspects here IMO, as going below FP8 is known to come with accuracy tradeoffs. I'm curious how they navigated this and how the FP8 weights (if they exist) were to perform.

buildbot 359 days ago

One thing to note is that MXFP4 is a block scaled format, with 4.25 bits per weight. This lets it represent a lot more numbers than just raw FP4 would with say 1 mantissa and 2 exponent bits.

mclau157 359 days ago

You can get similar insights looking at the github repo https://github.com/openai/gpt-oss

unethical_ban 359 days ago

I don't know how to ask this without being direct and dumb: Where do I get a layman's introduction to LLMs that could work me up to understanding every term and concept you just discussed? Either specific videos, or if nothing else, a reliable Youtube channel?

tkgally 359 days ago

What I’ve sometimes done when trying to make sense of recent LLM research is give the paper and related documents to ChatGPT, Claude, or Gemini and ask them to explain the specific terms I don’t understand. If I don’t understand their explanations or want to know more, I ask follow-ups. Doing this in voice mode works better for me than text chat does.

When I just want a full summary without necessarily understanding all the details, I have an audio overview made on NotebookLM and listen to the podcast while I’m exercising or cleaning. I did that a few days ago with the recent Anthropic paper on persona vectors, and it worked great.

tshannon 358 days ago

So probably another stupid question, but how do you know what it's spitting out is accurate?

tkgally 358 days ago

One has to be aware of the possibility of hallucinations, of course. But I have not encountered any hallucinations in these sorts of interactions with the current leading models. Questions like "what does 'embedding space' mean in the abstract of this paper?" yield answers that, in my experience, make sense in the context and check out when compared with other sources. I would be more cautious if I were using smaller models or if I were asking questions about obscure information without supporting context.

Also, most of my questions are not about specific facts but about higher-level concepts. For ML-related topics, at least, the responses check out.

umgefahren 359 days ago

There is a great 3blue1brown video, but it’s pretty much impossible by now to cover the entire landscape of research. I bet gpt-oss has some great explanations though ;)

nonfamous 359 days ago

Try Microsoft's "Generative AI for Beginners" repo on GitHub. The early chapters in particular give a good grounding of LLM architecture without too many assumptions of background knowledge. The video version of the series is good too.

cwyers 358 days ago

This is a great book (parts of it are available as blog posts from the author if you want to get a taste of it):

https://www.manning.com/books/build-a-large-language-model-f...

CanuckPro 359 days ago

Try Andrej Karpathy's YouTube videos. I also really liked the Dive into Deep Learning book at d2l.ai

srigi 359 days ago

Start with the YT series on neural nets and LLMs from 3blue1brown

reilly3000 359 days ago

Ask Gemini. Give it a link here in fact.

microtonal 359 days ago

Also: attention sinks (although implemented as extra trained logits used in attention softmax rather than attending to e.g. a prepended special token).