| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by diwank 822 days ago
	It’d be really cool to see a transformer with the MLP layers swapped for KANs and then compare its scaling properties with vanilla transformers

3 comments

bart1ett 822 days ago

After trying this out with the fourier implementation above, swapping MLP/Attention Linear layers for KANs (all, or even a few layers) produces diverging loss. KANs don't require normalization for good forward pass dynamics, but may be trickier to train in a deep net.

link

thesz 818 days ago

Note that KANs use LBFGS, which is second-order optimization method. My experience with the use of second-order features suggests that simple gradient descent often leads to divergence.

link

gautam5669 822 days ago

This is the first thought came to my mind too.

Given its sparse, Will this be just replacement for MoE.

link

samus 821 days ago

MoE is mostly used to enable load balancing since it makes it possible to put experts on different GPUs. This isn't so easy to do with a monolithic, but sparse layer.

link

mp187 822 days ago

Why was this your first thought? Is a limiting factor to transformers the MLP layer? I thought the bottleneck was in the renormalization part.

link

brrrrrm 822 days ago

At small input size, yes the MLP dominates compute. At large input attention matters more

link