| HN Mirror

my thoughts too, based on limited understanding of GPT. but the more pressure you apply towards compressing the neural network during training, the more circuitry these paths are likely to share. it would be interesting to see just how much and which parts could be folded together before you start to lose significant fidelity (though unfortunately the fidelity seems too low today to even try that).