After trying this out with the fourier implementation above, swapping MLP/Attention Linear layers for KANs (all, or even a few layers) produces diverging loss. KANs don't require normalization for good forward pass dynamics, but may be trickier to train in a deep net.
Note that KANs use LBFGS, which is second-order optimization method. My experience with the use of second-order features suggests that simple gradient descent often leads to divergence.
MoE is mostly used to enable load balancing since it makes it possible to put experts on different GPUs. This isn't so easy to do with a monolithic, but sparse layer.