|
A GPU's compute core, to my limited understanding, is sort of like a CPU with a lot of threads that execute the same instruction at the same time (aka SIMT). It has a decode frontend, local memory cache, addressing, registers, and a lot of instructions available to dispatch - each cycle can do something wildly different from the previous cycle. Each distinct instruction requires some dedicated hardware. To my knowledge, "tensor cores" and neural accelerators are modeled on something like a coprocessor with a very fast memory bus, and a bajillion of the same small execution unit that can do a single operation in parallel (like a 3x3 matrix multiply) on behalf of the main processor. Like if AVX512 was actually AVX 1,000,000 and only had one instruction, and that instruction did some kind of 3x3 matrix math. Imagine if you had a very large specialized house made entirely of kitchens (instead of bedrooms, bathrooms, etc), and your roommates were all cooks, you could cook significantly more food at once than in the typical one kitchen small household. You also save power per meal cooked because all of the lights and electricity go to kitchens instead of other rooms. So on a pipelined CPU, the processor has different pipeline stages that execute in order. An instruction may for example move from fetch, to decode, to load, to execute, to store. The execute step may be executed on a different part of the CPU depending on what kind of instruction it is. Basic arithmetic, floating point, and vector math (such as AVX) can be dispatched through different execution ports and run on different parts of the processor. So a processor may (per core) have a pipeline, an integer math unit, a floating point math unit, and two vector units. Operations running on the execution unit also take some variable amount of time to complete. Having ports to two execution units of the same type available makes it so the processor pipeline can dispatch a second long instruction of the same type before stalling. I don't actually know how the hardware of a tensor accelerator works, but what I would imagine a "tensor core" to be, is thousands of identical execution units that can only do basic matrix math, and a basic pipeline that is much simpler than a typical CPU. CPUs and GPUs have highly variable workloads and need a lot of specialized hardware on chip that may not always be in use. This wastes power and means you can't have any one task as densely or efficiently as a dedicated chip. If you're designing a dedicated chip, which has direct access to the main cpu memory (as the apple neural engine has), you can design the chip to directly (streaming) read a large matrix from memory, perform an operation on it, and store the result back into memory. Normal CPUs and GPUs don't have this capability. They approximate matrix math with lots of individual instructions that just go through a pipeline, stall, cache miss, etc just to do a lot of floating point vector math and store the result back to memory. A dedicated chip can skip all the overhead and just tile efficient matrix math in thousands of execution units. That's why an NVIDIA A100 is 19 teraflops when doing normal floating point vector math, and 150 teraflops when doing fp16 matrix math. It has a section of chip dedicated to efficiently doing the required floating point operations en masse without overhead or extra cycles and cache for fetching instructions. |