Hacker News new | ask | show | jobs
by azinman2 3689 days ago
Ndvidia has to be general purpose. This is not and thus can be better optimized.
1 comments

"General purpose" isn't that general, if you look at the actual operations they support and their threading model. It's already fairly optimized for these sorts of operations, and this amount of claimed headroom makes me suspicious.
Google has a lot of potential options that NVidia doesn't have. They can size their cache heirarchy to the task at hand. They can partition their memory space. They can drop scatter/gather. They can gang ALUs into dataflows that they know are the majority of machine learning workloads. They can partition their register file at the ISA level or maybe even drop it entirely. They can drop the parts of the IEEE754 floating point spec they don't need and they can size their numbers to the precision they need.
The fact that I can compile arbitrary programs for the GPGPU means it is general purpose. NVIDIA isn't writing softmax or backprop into silicon as a CPU instruction.

Look at how much faster ASICs for bitcoin mining are than the GPU... orders of magnitude.

"Backprop" isn't even close to something that would be a "CPU instruction", it's an entire class of algorithm. It's like saying "calculus" should be a CPU instruction. Matrix multiplication & other operations, on the other hand, do neatly decompose into such instructions, which have been implemented by NVidia et al., since that's the core set of functionality they've been pushing for like a decade now.

Additional die space on additional functionality might hurt the power envelope (which is where the focus on performance / watt rather than performance kicks in) but it doesn't make your chips slower per se.

That was my impression too. ML under the hood was a lot of linear algebra, not very different than most shaders. But maybe Google decided to hardcode a few important ML primitives because the ROI was that good in terms of grabbing customers. Also they might have very large scale applications not found elsewhere that motivates this.
Ok I was obviously oversimplifying things but my point is since we can only speculate, it's clear that when you know specific algorithms/math operations/memory layouts/applications you want to optimize for you can create dedicated chips that optimize and do that quickly. That bitcoin miners are all dedicated chips and run circles around GPUs demonstrates exactly this fact.

Furthermore the fact that ML can be error tolerant means you also get to optimize certain floating point operations for speed or energy efficiency at the cost of accuracy. NVIDIA doesn't get to do this in their linear algebra support.

Bitcoin mining is an extremely well-defined task compared to machine learning. It remains to be seen how general these TPUs are in practice - whether they will support the neural network architectures common two years from now.
tbh I felt like realizing what you meant earlier at the end of my comment. I should have ps'd it.
If they balance compute to memory better than GPUs, you could definitely see a 10x. GPUs have large off chip memory and small caches (like 256kb). Cost to going to off chip memory can be 1-2 orders of magnitude more than on chip memory. You can certainly fit 4+MB on modern processors, but they likely bought designs from a company like Samsung because designing high performance, low power memory cells is tricky. I'm surprised they were able to keep things a secret.