Hacker News new | ask | show | jobs
by Lichtso 780 days ago
1. Interestingly the foundations of this approach and MLP were invented / discovered around the same time about 66 years ago:

1957: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_repr...

1958: https://en.wikipedia.org/wiki/Multilayer_perceptron

2. Another advantage of this approach is that it has only one class of parameters (the coefficients of the local activation functions) as opposed to MLP which has three classes of parameters (weights, biases, and the globally uniform activation function).

3. Everybody is talking transformers. I want to see diffusion models with this approach.

4 comments

Biases are just weights on an always on input.

There isn't much difference between weights of a linear sum and coefficients of a spline.

> Biases are just weights on an always on input.

Granted, however this approach does not require that constant-one input either.

> There isn't much difference between weights of a linear sum and coefficients of a function.

Yes, the trained function coefficients of this approach are the equivalent to the trained weights of MLP. Still this approach does not require the globally uniform activation function of MLP.

At this point this is a distinction without a difference.

The only question is if splines are more efficient than lines at describing general functions at the billion to trillion parameter count.

To your 3rd point, most diffusion models already use a transformer-based architecture (U-Net with self attention and cross attention, Vision Transformer, Diffusion Transformer, etc.).
Yes, #2 is a difference. But what makes it an advantage?

One might argue this via parsimony (Occam’s razor). Is this your thinking? / Anything else?

I may be wrong but with midern llms biases aren’t really used any more.
From what I remember, larger LLMs like PaLM don't use biases for training stability, but smaller ones tend to still use them.