| HN Mirror

When it comes to compute cost the choice of activation function makes little difference nowadays (and it can often be fused with whatever operation comes before it, which makes it effectively free).

The real reason is simple: it was inherited.

The relu^2 was used in the nanogpt speedrun[1] because it produced the best empirical results, then Andrej based his nanochat on the nanogpt speedrun without changing the activation function, and then this project was based on nanochat.

[1] -- https://github.com/KellerJordan/modded-nanogpt