Hi, I addressed the comparisons in other comments, but in general this is for c++ users and not Python. It's more of a comparison to numPy/cuPy, and we do have a table showing the comparison in the docs.
Actually just started looking into MatX yesterday to accelerate our radar pipeline. Really interesting to see that this use-case is heavily featured in the documentation.
Is the UCLA/Nvidia/Raytheon collaboration (as presented in a recent GTC talk) a major force behind the development of MatX?
Hi, yes, the original development was started for radar users who did not know CUDA but needed to write in c++. Many of our examples and code are radar related for that reason.
Thanks, looks really interesting. Do functions like matmul support inputs of differing type, like say A=int8 and B=float? Would be nice if you could get memory efficient quantized matmul with operator fusion.
CUTLASS, which is NVIDIA’s C++ template library for writing matrix multiply and convolution kernels parametrized over input/output types, operators, and algorithm block sizes, theoretically supports this. But, each input of the (k, n) shaped matrix B will be read from global memory ceil(n / block dimension) times in an algorithm that computes one (block dimension, block dimension) submatrix of the output matrix D per thread block. It will probably be more efficient to cast your B matrix to FP16 or INT8 lower precision in a preprocessing kernel to reduce memory traffic in the matrix multiply kernel.
On newer GPUs, though, we have this huge L2 cache which makes the calculus a little different if your working set fits into it. e.g. Ampere A100 has 40MB L2$.
We typically support whatever the underlying library supports. For int8 the support would come from cuBLASLt currently. I don't believe that or Cutlass supports mixed precision inputs, but I can check.
And then maybe also a comparison to Flashlight (https://github.com/flashlight/flashlight), xtensor (https://github.com/xtensor-stack/xtensor) or other C/C++ based ML/computing libraries?
Also, there is no mention of it, so I suppose this does not support automatic differentiation?