Hacker News new | ask | show | jobs
by bacon_blood 2009 days ago
Most machine learning is using _NVIDIA_ GPUs, which themselves have a neural engine (tensor cores) for the last two generations. An NVIDIA A100 has around 19 Teraflops but 156 "tensor flops" (312 if you use sparse matrices).

In addition to being useful for training and inference, the consumer cards use tensor cores for things like mic filtering (RTX Voice) and neural upscaling (DLSS) in games.

General purpose GPU hardware is way more wasteful for matrix math, like maybe >10x waste on power and equally worse performance, than tensor cores.

1 comments

> General purpose GPU hardware is way more wasteful for matrix math, like maybe >10x waste on power and equally worse performance, than tensor cores.

I didn't realize there was that distinction; I thought GPU's were just optimized for vector arithmetic across the board. What is the difference between general purpose GPU hardware and tensor cores? What does general purpose GPU hardware do that tensor cores do not?

A GPU's compute core, to my limited understanding, is sort of like a CPU with a lot of threads that execute the same instruction at the same time (aka SIMT). It has a decode frontend, local memory cache, addressing, registers, and a lot of instructions available to dispatch - each cycle can do something wildly different from the previous cycle. Each distinct instruction requires some dedicated hardware.

To my knowledge, "tensor cores" and neural accelerators are modeled on something like a coprocessor with a very fast memory bus, and a bajillion of the same small execution unit that can do a single operation in parallel (like a 3x3 matrix multiply) on behalf of the main processor. Like if AVX512 was actually AVX 1,000,000 and only had one instruction, and that instruction did some kind of 3x3 matrix math.

Imagine if you had a very large specialized house made entirely of kitchens (instead of bedrooms, bathrooms, etc), and your roommates were all cooks, you could cook significantly more food at once than in the typical one kitchen small household. You also save power per meal cooked because all of the lights and electricity go to kitchens instead of other rooms.

So on a pipelined CPU, the processor has different pipeline stages that execute in order. An instruction may for example move from fetch, to decode, to load, to execute, to store. The execute step may be executed on a different part of the CPU depending on what kind of instruction it is. Basic arithmetic, floating point, and vector math (such as AVX) can be dispatched through different execution ports and run on different parts of the processor. So a processor may (per core) have a pipeline, an integer math unit, a floating point math unit, and two vector units. Operations running on the execution unit also take some variable amount of time to complete. Having ports to two execution units of the same type available makes it so the processor pipeline can dispatch a second long instruction of the same type before stalling.

I don't actually know how the hardware of a tensor accelerator works, but what I would imagine a "tensor core" to be, is thousands of identical execution units that can only do basic matrix math, and a basic pipeline that is much simpler than a typical CPU.

CPUs and GPUs have highly variable workloads and need a lot of specialized hardware on chip that may not always be in use. This wastes power and means you can't have any one task as densely or efficiently as a dedicated chip. If you're designing a dedicated chip, which has direct access to the main cpu memory (as the apple neural engine has), you can design the chip to directly (streaming) read a large matrix from memory, perform an operation on it, and store the result back into memory.

Normal CPUs and GPUs don't have this capability. They approximate matrix math with lots of individual instructions that just go through a pipeline, stall, cache miss, etc just to do a lot of floating point vector math and store the result back to memory. A dedicated chip can skip all the overhead and just tile efficient matrix math in thousands of execution units. That's why an NVIDIA A100 is 19 teraflops when doing normal floating point vector math, and 150 teraflops when doing fp16 matrix math. It has a section of chip dedicated to efficiently doing the required floating point operations en masse without overhead or extra cycles and cache for fetching instructions.