| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Lichtso 780 days ago

1. Interestingly the foundations of this approach and MLP were invented / discovered around the same time about 66 years ago:

1957: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_repr...

1958: https://en.wikipedia.org/wiki/Multilayer_perceptron

2. Another advantage of this approach is that it has only one class of parameters (the coefficients of the local activation functions) as opposed to MLP which has three classes of parameters (weights, biases, and the globally uniform activation function).

3. Everybody is talking transformers. I want to see diffusion models with this approach.

4 comments

trwm 780 days ago

Biases are just weights on an always on input.

There isn't much difference between weights of a linear sum and coefficients of a spline.

link

Lichtso 780 days ago

> Biases are just weights on an always on input.

Granted, however this approach does not require that constant-one input either.

> There isn't much difference between weights of a linear sum and coefficients of a function.

Yes, the trained function coefficients of this approach are the equivalent to the trained weights of MLP. Still this approach does not require the globally uniform activation function of MLP.

link

trwm 780 days ago

At this point this is a distinction without a difference.

The only question is if splines are more efficient than lines at describing general functions at the billion to trillion parameter count.

link

tripplyons 779 days ago

To your 3rd point, most diffusion models already use a transformer-based architecture (U-Net with self attention and cross attention, Vision Transformer, Diffusion Transformer, etc.).

link

xpe 780 days ago

Yes, #2 is a difference. But what makes it an advantage?

One might argue this via parsimony (Occam’s razor). Is this your thinking? / Anything else?

link

kolinko 780 days ago

I may be wrong but with midern llms biases aren’t really used any more.

link

tripplyons 779 days ago

From what I remember, larger LLMs like PaLM don't use biases for training stability, but smaller ones tend to still use them.

link