Hacker News new | ask | show | jobs
by Thorondor 22 days ago
Hardware engineer here. Special casing the multiply by 0 and multiply by 1 paths is harder than it sounds. In software, the cost of adding special cases is simply performance. You're adding more instructions that execute in sequence on a CPU that already physically exists. Doing this for your multiply case is worthwhile because the speedup is large for 0 and 1 while the cost is not that large (relative to the time taken for the whole multiplication operation) for other values.

Hardware is different. Every operation that can be performed in hardware by a chip needs dedicated circuitry. Special casing 0 and 1 means adding at least OR reduction on each operand and a dedicated multiplexer for every bit of the output. Those transistors use power even when they're not in use (leakage power is a huge issue on modern semiconductor processes). They also degrade timing by adding more gates on critical paths through the multipliers. (The timing issue here is that all operations that happen between one flip-flop and another flip-flop need to finish within one clock cycle.) And unless there are whole blocks of 0's and 1's (this does happen in certain neural networks), you typically won't see a direct speedup anyway. In software terms, the matrix multiply is scheduled as many parallel operations that cannot be accelerated much overall by skipping a few operations in some "threads."

All of this makes zero skipping a nontrivial topic. People do still try to do it but it needs serious consideration as, depending on the application, the case is rarely one-sided.

2 comments

You didn't touch on the most important aspect for cost: die area!

How much die space ($) will that circuitry, that's probably statistically near zero chance for you main customers workload (who has model weight of 0 or 1!?), add. And, if you can stomach the cost, what else could you put there instead?

Weights should not be 0 (at least not frequently) but in a ReLU-based neural network, activations are 0 pretty often. You're absolutely right about die area though.
> near zero chance for you main customers workload

What percent of this hardware is running inference for ReLU models? ;)

Nvidia has added structural sparsity to their GPUs and every time they pull out a flops or tops number, they assume you will use structural sparsity.

The die area argument here makes no sense. Supporting structural sparsity can be done either by duplicating the multipliers with and without the support or you have a single general purpose multiplier that does both, in which case you can have twice as many of them.

Also, in ReLU^2 networks, 90%+ parameters are zero.

> The die area argument here makes no sense.

Any logic you add to the GPU is physical silicon and metal that take up physical space.

> duplicating the multipliers with and without the support or you have a single general purpose multiplier that does both

That would be extra physical logic, which would be extra physical space on the die. "can be done" isn't my point, it's that "doing requires surface area".

I expect the degraded critical path will most likely be worse than a bit of die area. On modern processes you have A LOT of transistors to play with.
Thanks for the detailed explanation, I had no idea about any of this.