| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dagenix 2900 days ago
	That's not really accurate. Even in cases were 32 bit and 64 bit operations are equally fast on the CPU, 32 bit values still take up half the memory. For many workloads, the limiting factor is cache space. So, if you can use 32 but values, you can get much better performance for those workloads.

1 comments

stochastic_monk 2900 days ago

And if you’re doing heavy floating point work, you can fit twice as many operations in with a 32-bit float vector as an equally sized double vector, and The vectorized operations happen roughly as fast for both forms, yielding an approximate doubling of speed.

link

dnautics 2900 days ago

for rank-2 tensor work you can do 4x as many operations, for rank-3 tensor work, it's 8x, assuming memory bandwidth is the bottleneck.

link

stochastic_monk 2900 days ago

Does that mean it’s 64x as fast for 16-bit floating point vs 64-bit for a rank 3 tensor?

link

dnautics 2900 days ago

assuming 1) memory bandwidth is the bottleneck and 2) you can keep the tensor values in cache or registers.

I think that GPUs are still vector processing engines, so they should scale with 4x... But assuming google architected the TPU correctly, it should be 16x as fast (I think the architecture is actually that of a rank-2 tensor).

link