| HN Mirror

This feels aggressively wrong. I'll bite in case you responded to the wrong person or something:

> “Increasing control points” hides a lot under the covers here.

Maybe. Like what exactly?

> Your answer and the paper provide virtually no reason to believe one type of continuous function approximation is better than another.

Even if the paper offered nothing, my answer is immediately above yours. What about being faster to compute or having gradient updates without global information destruction is either not clear or not ever better than what an MLP provides?

> The comparisons made are superficial and only serve to address contrived issues like representing sinusoidal function families concisely.

I don't care about that at all, and the paper barely cares about it; their same algorithm for reifying splines into known function families would work about as well with MLPs.

> It’s weird to just ignore MLPs when approximating a continuous univariate function.

Maybe. MLPs are particularly well suited to high input+output dimensionality, and while they _can_ approximate arbitrary 1D continuous functions they (1) can't do so efficiently, (2) can't be trained via gradient descent to find some of those, and (3) can't approximate topologically interesting 1D functions without many layers and training complexity. The authors ignored infinitely many other things too; the fact that they ignored MLPs is probably just some combination of their reference material (KANs have been around in some form for awhile) not using MLPs, alongside a hunch that they'd be less efficient (and perhaps harder to train) in an already slow library, and the fact that splines empirically sufficed.

> But if the paper did use MLPs theyd have ended up with something that looks a lot more like conventional neural networks, so maybe thats why?

See above, I don't think that would be the most important reason, even if it were true.

I don't think it's true though. Even in its current state, a KAN network already looks a lot like an MLP. Each layer does an O(d^2) computation to transform one d-dimensional vector into another. Instead of sum(dot(w, v)) the computation is sum(spline_w(v)), but aside from the sparsification (which is (1) optional, (2) available for MLPs, and (3) not important to most of the paper's ideas other than interpretability), the core computational kernel of these KANs is almost identical to an MLP.

What they showed, to the extent that it's true (it's always hard to say when focusing on physics computations because of how easily a carefully placed cos/sin/exp can greatly improve test+training error, and more specialized models taking advantage of that property tend to not do as well in more consumer-focused ML), is that if you use an O(grid) factor of extra weights for the same amount of computation then you can get an MLP with better scaling properties (for the same amount of model volume and model compute you get lower training times and better test errors, by a very healthy margin).

I'd be interested in seeing how an MLP would fit in there, but if the learned splines are usually complicated then you would expect a huge multiplicative slowdown, and regardless of spline complexity you would expect to re-introduce most of the training issues of deep networks. Please let me know if you give MLP sub-units a shot and they actually work better. I'd love to not have to do that experiment myself.