Multi-head Latent Attention (MLA), Multi-Token prediction, MoE architecture are some of the most famous examples.