Hacker News new | ask | show | jobs
by cburdick13 987 days ago
Hi all, I'm one of the maintainers of MatX. I didn't expect it to hit HN this soon, but happy to answer any questions.
3 comments

I think a comparison to PyTorch, TensorFlow and/or JAX is more relevant than a comparison to CuPy/NumPy.

And then maybe also a comparison to Flashlight (https://github.com/flashlight/flashlight), xtensor (https://github.com/xtensor-stack/xtensor) or other C/C++ based ML/computing libraries?

Also, there is no mention of it, so I suppose this does not support automatic differentiation?

Hi, I addressed the comparisons in other comments, but in general this is for c++ users and not Python. It's more of a comparison to numPy/cuPy, and we do have a table showing the comparison in the docs.

We don't support automatic differentiation (yet).

Actually just started looking into MatX yesterday to accelerate our radar pipeline. Really interesting to see that this use-case is heavily featured in the documentation.

Is the UCLA/Nvidia/Raytheon collaboration (as presented in a recent GTC talk) a major force behind the development of MatX?

Hi, yes, the original development was started for radar users who did not know CUDA but needed to write in c++. Many of our examples and code are radar related for that reason.
Thanks, looks really interesting. Do functions like matmul support inputs of differing type, like say A=int8 and B=float? Would be nice if you could get memory efficient quantized matmul with operator fusion.
CUTLASS, which is NVIDIA’s C++ template library for writing matrix multiply and convolution kernels parametrized over input/output types, operators, and algorithm block sizes, theoretically supports this. But, each input of the (k, n) shaped matrix B will be read from global memory ceil(n / block dimension) times in an algorithm that computes one (block dimension, block dimension) submatrix of the output matrix D per thread block. It will probably be more efficient to cast your B matrix to FP16 or INT8 lower precision in a preprocessing kernel to reduce memory traffic in the matrix multiply kernel.

On newer GPUs, though, we have this huge L2 cache which makes the calculus a little different if your working set fits into it. e.g. Ampere A100 has 40MB L2$.

We typically support whatever the underlying library supports. For int8 the support would come from cuBLASLt currently. I don't believe that or Cutlass supports mixed precision inputs, but I can check.