|
|
|
|
|
by emacs28
456 days ago
|
|
First author here. The hardware architectures are realistic - we developed & evaluated real example hardware implementations for them, validated on FPGA, and they achieved state-of-the-art ResNet performance in a deep learning accelerator system implementation compared to prior accelerators evaluated on similar FPGAs. See the associated accelerator system source code here: https://github.com/trevorpogue/algebraic-nnhw The hardware architectures focused on in the paper are systolic array designs, an efficient type of hardware design for matrix multiplication (e.g., the Google TPU uses this), as opposed to more SIMD-like vector architectures like GPUs. It may be possible to extend the proposed KMM algorithm to other types of hardware architectures also in future work. Regarding floating point - this work is applicable for integer matrix multiplication acceleration, it may be possible to extend the concept to floating point data types in future work also. |
|
this is wrong. TPUv4 has tensor cores just like NVIDIA has tensor cores just like AMD has tensor cores. no one uses a systolic array because bandwidth/connectivity is much scarcer than compute. the only people that keep talking about them are academics that don't actually fab/sell chips.
https://cloud.google.com/tpu/docs/v4
https://www.nvidia.com/en-us/data-center/tensor-cores/
https://rocm.docs.amd.com/projects/rocWMMA/en/latest/what-is...
ninja edit: before you gotcha me with "a tensor core is a systolic array!!!" - most tensor cores are actually outerproduct engines not riffle shuffle engines (or whatever you wanna call the topology corresponding to a systolic array).