Hacker News new | ask | show | jobs
by emacs28 456 days ago
First author here. The hardware architectures are realistic - we developed & evaluated real example hardware implementations for them, validated on FPGA, and they achieved state-of-the-art ResNet performance in a deep learning accelerator system implementation compared to prior accelerators evaluated on similar FPGAs. See the associated accelerator system source code here:

https://github.com/trevorpogue/algebraic-nnhw

The hardware architectures focused on in the paper are systolic array designs, an efficient type of hardware design for matrix multiplication (e.g., the Google TPU uses this), as opposed to more SIMD-like vector architectures like GPUs. It may be possible to extend the proposed KMM algorithm to other types of hardware architectures also in future work. Regarding floating point - this work is applicable for integer matrix multiplication acceleration, it may be possible to extend the concept to floating point data types in future work also.

3 comments

> systolic array designs, an efficient type of hardware design for matrix multiplication (e.g., the Google TPU uses this), as opposed to more SIMD-like vector architectures like GPUs

this is wrong. TPUv4 has tensor cores just like NVIDIA has tensor cores just like AMD has tensor cores. no one uses a systolic array because bandwidth/connectivity is much scarcer than compute. the only people that keep talking about them are academics that don't actually fab/sell chips.

https://cloud.google.com/tpu/docs/v4

https://www.nvidia.com/en-us/data-center/tensor-cores/

https://rocm.docs.amd.com/projects/rocWMMA/en/latest/what-is...

ninja edit: before you gotcha me with "a tensor core is a systolic array!!!" - most tensor cores are actually outerproduct engines not riffle shuffle engines (or whatever you wanna call the topology corresponding to a systolic array).

https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...

>The primary task for TPUs is matrix processing, which is a combination of multiply and accumulate operations. TPUs contain thousands of multiply-accumulators that are directly connected to each other to form a large physical matrix. This is called a systolic array architecture. Cloud TPU v3, contain two systolic arrays of 128 x 128 ALUs, on a single processor.

I don't see any contradiction between your claim that TPU v3 uses systolic arrays and the parent post's claim that TPU v4 does not.
The TPU obviously uses a systolic array: https://jax-ml.github.io/scaling-book/tpus/
Fair enough - my understanding was they moved away from systolic arrays. I stand corrected. I will also say it is well-known they're basically impossible to program/build a compiler for.
This is why Google has 500 people working on the TPU compiler team.
I have used Karatsuba's & Winograd's Inner product [0] algorithm in my work for wide multi-simd integer multipliers and matrix multiplication HW for DSPs. The latter cuts down the MACs by half - n^3/2 instead of n^3. I think the paper talks about it's derivative - FFIP.

The issue is memory bandwidth. These techniques indeed help you save multiplier area however the performance is still bandwidth limited - you'd need to be able to feed more data per cycle to increase performance.

One thing the paper doesn't talk about is energy. For DNN, at the network level the energy consumed by integer macs is not that high. Localizing data computation would have a much more impact on energy reduction than optimizing MACs.

[0] https://ieeexplore.ieee.org/document/1687427

On an FPGA integer adders are much more abundant than integer multipliers. So this algorithm definitely helps get more utilization out of the FPGA. Once the multiplier is small enough, say 3 bits by 3 bits, it can fit into several LUT6's.