|
|
|
|
|
by jcarreiro
138 days ago
|
|
The paper says that: > In practice, we find that four Taylor terms (P = 4) suffice for
recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications. ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision. |
|
and they really do mean that, their results show +/- 1 on log10 plots.