GPT-4 architecture: what we can deduce from research literature

Y	Hacker News new \| ask \| show \| jobs

	GPT-4 architecture: what we can deduce from research literature (kir-gadjello.github.io)
	10 points by kir-gadjello 1194 days ago

3 comments

kir-gadjello 1194 days ago

As the discussion of GPT-4 heats up, the absence of details on its technical implementation becomes only more glaring. As an engineer, I have not learned anything applicable I haven't known yesterday from the newest OpenAI publication!

I have been investigating issues of LLM training and inference for quite some time, and have developed a number of hypotheses about future SoTA models, which I believe very likely apply to GPT-4.

link

kir-gadjello 1194 days ago

If you have questions about my rationale for this or that technique included in the list, please, ask!

For example, I think Google's paper "Sparse is enough for scaling transformers" was very underrated, as it provided more than an order of magnitude improvement for inference economy, and it included one OpenAI researcher among authors.

link

amrb 1194 days ago

I'd like to know how it can support 32k when all the other models I've seen are 2-4k, does this mean it's got a bigger layer for attention or it's 4x billions of parameters Large?

link

kir-gadjello 1194 days ago

It could be done in a dozen ways. One beautiful method is just using the xPos positional embedding pioneered by Microsoft and scale the context window size at runtime (even better if your attention is subquadratic - again there is a dozen of varieties to pick from), see:

"A Length-Extrapolatable Transformer"

https://arxiv.org/abs/2212.10554

"Language Is Not All You Need: Aligning Perception with Language Models"

https://arxiv.org/abs/2302.14045

Notably, this positional embedding has been implemented by lucidrains in his x-transformers package: https://github.com/lucidrains/x-transformers/blob/main/x_tra...

link

seydor 1194 days ago

Well if the model is so smart, could it be that it is actually aware of its layers and parameters?

link

kir-gadjello 1194 days ago

It's no problem to put model's architecture and even some python code into the generous 32k context window, the real problem seems to be as you say "awareness" - at least the facet of it that'd allow to answer complex novel questions.

For me the slightly positive takeaway from OpenAI's paper is good uncertainty calibration of the base pretrained GPT-4. It could be interpreted as one example of "awareness" of its inner workings.

Of course it's hard to say much about the model we don't even know architecture of, not to mean such luxuries as access to weights... Meta's LLaMA release did more to democratize deep learning than OpenAI's GPT-4, that's for sure.

link