| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kir-gadjello 1233 days ago

This document doesn't contain the architecture and training details of GPT-4. As an engineer, these details would be the most interesting part of it!

Driven by interest in GPT-4 and cutting edge LLMs I studied the research literature and compiled a small list of architectural and training details which very likely underpin GPT-4 in this blogpost: https://kir-gadjello.github.io/posts/gpt4-some-technical-hyp...

While this is a work in progress, the most important part is already in place and thus I decided to publish it in its current draft state.

Have fun following the TLDR and Arxiv links, fellow HNers!

1 comments

adt 1233 days ago

This is great, thanks, Kirill!

I've added your hypothesis to these ones:

https://lifearchitect.ai/gpt-4/

There's quite a broad range of guesses going on. I lean towards 80B language + 20B vision params trained across 3T collected tokens (could repeat to 10T+), but one of the other (strong) hypotheses is a dense 7T param model. That's absurd...

link

kir-gadjello 1233 days ago

That's cool, thanks for noting, Alan!

Would you mind adding a reference link to the source, so that other people could visit my blog? I'm just starting out with blogging, it would help me to get more readers and feedback on this draft. I hope to get it in much better shape in just a few days.

More posts are in the pipeline too!

BTW, I'm 99% sure the model uses some form of sparsity, because the competitive pressure for efficiency of inference is just too large. The real question here, of course, is precise engineering details of the sparsity method chosen. I suggest two promising methods as the most likely; it could be either one or both of them together.

link

adt 1233 days ago

Great!

I always cite my sources, and you'll find a link to your page as usual.

I wanted to point you towards OpenAI's FIM 6.9B as well. Trained on 100B tokens (Chinchilla-aligned), it was announced just before GPT-4 allegedly started training. I didn't see anyone else talking about it, but maybe you could follow the rabbit trail even further, so to speak!

https://arxiv.org/abs/2207.14255

link

MacsHeadroom 1233 days ago

Your page suppose that the 32k context window fits 24k words and 48 pages of text. This is incorrect, as OpenAI has already stated about 90 pages of context in 32k tokens.

GPT-4 tokens are likely larger and average around 7 characters per token in practice, as is the case with OpenAI Codex (as opposed to GPT-3's four characters per token).

This would result in 224,000 characters of context vs 128,000 characters (at 4 char per token) for a total of 84 pages of context. This is closer to OpenAI's own reporting of "about 90 pages of context."

link

adt 1233 days ago

Nope

link