| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jcarreiro 138 days ago

The paper says that:

> In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.

ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.

3 comments

kristjansson 137 days ago

> approximately the same magnitude

and they really do mean that, their results show +/- 1 on log10 plots.

link

cptroot 137 days ago

I don't think this is an accurate characterization of the error magnitude? Their error plots (from appendix 3) are all showing `log_10(|Y - \dot{Y}|)` as having a median of ~-3 (difference of 0.001) and a max of ~1.5 (difference of 0.035), and this is with only 3 Taylor terms.

link

kristjansson 136 days ago

Oh you're right that is a misread on my part, the appendix charts don't say that. I think they're just useless then though? Since they're reporting absolute error (on a log10 scale) we can't assess the relative to compare to the 'within an order of magnitude' claim in the text.

link

energy123 138 days ago

It converges on conventional attention as P goes up

link

fheinsen 138 days ago

The method is more general. The github repository's first example is with eight Taylor terms (P = 8).

link

torginus 137 days ago

I'm clueless about this whole thing, but from my EE education I remember that in general:

Taylor approximations converge slowly in terms of error if the function they're representing is discontinuous (the error disappears quadratically if continuous, linearly if not), and they tend to create highly energetic swings near discontinuties (similarly to Fourier series with Gibbs oscillations).

Moreover, Taylor series are inherently nonlinear, and much of the mathematical toolset around AI assumes general linearity (cue linear algebra), with the exception of sigmoids , and going beyond cubic approximations tends to make errors worse (as expressed in SNR).

link