Hacker News new | ask | show | jobs
by diwank 775 days ago
It’d be really cool to see a transformer with the MLP layers swapped for KANs and then compare its scaling properties with vanilla transformers
3 comments

After trying this out with the fourier implementation above, swapping MLP/Attention Linear layers for KANs (all, or even a few layers) produces diverging loss. KANs don't require normalization for good forward pass dynamics, but may be trickier to train in a deep net.
Note that KANs use LBFGS, which is second-order optimization method. My experience with the use of second-order features suggests that simple gradient descent often leads to divergence.
This is the first thought came to my mind too.

Given its sparse, Will this be just replacement for MoE.

MoE is mostly used to enable load balancing since it makes it possible to put experts on different GPUs. This isn't so easy to do with a monolithic, but sparse layer.
Why was this your first thought? Is a limiting factor to transformers the MLP layer? I thought the bottleneck was in the renormalization part.
At small input size, yes the MLP dominates compute. At large input attention matters more