|
|
|
|
|
by tuned
231 days ago
|
|
Model Architecture (gpt.rs) Multi-layer Transformer: N stacked decoder blocks with pre-norm residual connections
Rotary Position Embeddings (RoPE): Replaces learned positional encodings with rotary embeddings for better length generalization
Multi-Query Attention (MQA): Reduces KV cache size by sharing key/value heads across query heads
RMSNorm: Parameter-free normalization for stability (instead of LayerNorm)
QK-norm: Normalizes queries and keys before attention to prevent numerical instability
ReLU² MLP: Uses ReLU(x)² activation for better gradient flow on GPUs
Softcap Logits: Bounds output logits using tanh(x/15)*15 to prevent extreme values |
|