Hacker News new | ask | show | jobs
by phkahler 206 days ago
>> Non-AI workloads prefer vector units and not matrix units

FEA and other "scientific" workloads are all matrix math. This is why super computers have been benchmarked using BLAS and LAPACK for the past 40 years. OTOH are those matrix * vector where AI is matrix * matrix?

Either way its a regression which seems strange.

1 comments

Nvidia b200 did the same. A lot of FEA go explicit (matrix free) because scaling is better.

Also lookup ozaki algorithms.

Matrix free generally refers to using "X-vector product" operators, where X is something like the Jacobian or Hessian, but you do not materialize the final Jacobian or Hessian matrix. A big X operator is split into smaller X operators and you operate on the X operator by obtaining the X-vector products sequentially. This doesn't necessarily mean there are no matrices in the individual X-vector products. The smaller X operators could still be matrix vector products.

In fact, one of the big benefits of splitting your big matrix into a series of small matrix vector products is that some of the matrix vector products are parameterized and some are not or at least they share the same parameters over multiple matrix vector products. This means you can perform matrix-matrix multiplication against some of the operators. This is particularly evident in batched training of neural networks.

I do not see which is the relationship between Ozaki algorithms and algorithms that are supposedly "matrix free".

The Ozaki scheme and its variants improves the precision of matrix-matrix multiplications, allowing a matrix-matrix multiplication done with operations having lower-precision to approach the precision of the same multiplication done with operations with higher precision.

So it is an improvement for matrix-matrix operations, which are better done in matrix units. It is not any kind of "matrix free" algorithm.

The Ozaki scheme is not good enough for emulating FP64 in a GPU with poor FP64 throughput, but good FP32 throughput. The reason is that not only the greater precision of FP64 is important, but also its much greater dynamic range in comparison with FP32. In computations with FP64, overflows and underflows are extremely rare events and easy to avoid. On the other hand, in complex physical simulations it is impossible to avoid overflows and underflows in FP32, unless one uses extremely cumbersome frequent rescalings, which eliminate all the advantages of using floating-point numbers instead of fixed-point numbers.

I do not know to which kind of "matrix free" algorithms for FEA you are referring .

Nevertheless, the problem of any "matrix free" algorithm is exactly its poor scaling, because any "matrix free" algorithm must do similar amounts of computational operations and memory transfers. This limits the performance to that of the memory, which prevents scaling.

The advantage of the algorithms based on matrices is exactly the better scaling, because only such algorithms can do more computational operations than memory transfers, so their scaling is no longer limited by the memory interface.

For implementing matrix-matrix operations, the matrix units introduced initially by NVIDIA and then by AMD, Apple, Intel and since next year also by Arm, are preferable, because they reduce even more the number of memory transfers that prevent scaling, in comparison with implementing the same matrix-matrix operations in vector units.