Hacker News new | ask | show | jobs
by riku_iki 743 days ago
> I hate the industry for never learning how to get along and make a real CUDA competitor.

there are google TPUs. Do they provide better pefrormance/dollar, or google also charges high margin, or Nvidia is doing some unique optimizations?

3 comments

The TPU is basically an ASIC as far as I know; it competes against CUDA in a very small subset of it's featureset. CUDA is essentially a composition layer on top of multiple GPU features that optimizes them for general-purpose compute. In essence, nothing is stopping Apple or Google from making an Open Source CUDA replacement and undermining the demand for specialized GPGPU compute. The problem is that CUDA is massive, and nobody wants to re-implement it (especially not for free).

So now Nvidia is in the privileged position of having both highly-flexible GPGPU compute hardware, as well as a highly-advanced software layer to use it with. TPUs and NPUs are neat, but fundamentally they are neither of these things; they have an extremely limited processing pipeline exposed by a high-level library, and that's usually it. CUDA is comparatively flexible, to the point that it doesn't even rely on AI to sell it's product.

To me, hating on Nvidia feels like being mad that a well-bred horse with great odds beat out the jockey you were betting on. Why should we hate them, for their "monopoly" on features that Apple and Khronos gave up developing? Because they're blocking-out their competitors by... not having working MacOS drivers per Apple's request? This is the causal and obvious outcome of letting businesses commoditize specialized compute. This is what the industry wanted, and it's rich watching the customers protest like they were fooled into thinking everything was fine.

> The TPU is basically an ASIC as far as I know; it competes against CUDA in a very small subset of it's featureset. CUDA is essentially a composition layer on top of multiple GPU features that optimizes them for general-purpose compute.

my understanding is that compilers can compile some straighforward JAX, TF, Pytorch programs to both Cuda and TPU, so they in direct competition in current hot topics (LLM, deep learning).

Right; but you can't cross-compile everything. This is really common in AI libraries, especially multi-target projects like ONNX: https://onnx.ai/

The math probably adds up in Google's favor with the TPUs, even if they end up being less efficient and slower per-unit than Nvidia hardware. They don't need to pay for the margins, and they can run them 24/7 for their intended purpose. The previous-generation TPUs can't be reused or resold for other purposes though, and if/when AI blows over as a trend you probably can't easily start mining crypto or doing HPC calculations like an Nvidia cluster would.

Our company can buy NVIDIA gear. Google TPU, well, is google's property I have no control over. This site is overrepresented by clod folks. In reality ~90% of the workloads are NOT cloud based. Unless one can buy Google/Amazon/Meta T/N/AI/PUs that unique optimizations are irrelevant for the most of the workloads.
Why 90%? Why don't you go on the cloud for training at least if it's not healthcare data?

Is it because you don't need to buy many gpus to do your workload?

Reasons why people don't go on the cloud were outlined many times, i.e. https://world.hey.com/dhh/we-have-left-the-cloud-251760fb was discussed on HN many times.

I could have written almost the same reasons for GPU workloads.

Who would go with google for anything? No support and the product cancelled when they get distracted with something else.