| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tuned 231 days ago
	Model Architecture (gpt.rs) Multi-layer Transformer: N stacked decoder blocks with pre-norm residual connections Rotary Position Embeddings (RoPE): Replaces learned positional encodings with rotary embeddings for better length generalization Multi-Query Attention (MQA): Reduces KV cache size by sharing key/value heads across query heads RMSNorm: Parameter-free normalization for stability (instead of LayerNorm) QK-norm: Normalizes queries and keys before attention to prevent numerical instability ReLU² MLP: Uses ReLU(x)² activation for better gradient flow on GPUs Softcap Logits: Bounds output logits using tanh(x/15)*15 to prevent extreme values