Hacker News new | ask | show | jobs
by usernew 1109 days ago
Is there really a market for a response though? Now, I'll be honest that I know very little about this market. What I do know from doing a decade of presales before covid hit, is that people who buy GPUs go for aggregate max on a big node farm. Now, most of my clients who bought GPU-heavy scale-out nodes were in the financial industry, so maybe deep learning stuff is different. Their workloads were massively parallel, and could scale out instead of needing something singularly fast.

So I guess my question is - what use case is there for a huge truck that goes 200mph and take 4 trips, when you could just buy 16 regular trucks, and move your apartment in the same amount of time at half the cost.

4 comments

The exciting thing about CXL is we can start to find out if peripheral or hopefully close-networked computing fabrics can be useful & interesting, beyond the small circumstances Nvidia will offer. Having an ecosystem that everyone can participate in will let us explore. Money can't buy that. Talent can't buy that. You need to socialize to really find out the possible values.

'The street finds it's own uses for things' is the well known Gibson adage, I and typically it's a comment aimed low. But our entire era of amazing computing began with the Gang of Nine enabling lowness in a degree such that it quickly became the highest tech, the best. Sure you can still buy a mainframe & they have amazing feats but it's not where the value is, but and the value is where it is because possibility was unchained, I unleashed from corporate dominion, and spread wide. I think we can find amazing new futures with CXL & mad bandwidth connectivity.

The reason that analogy falls short is because it's easier to drive the huge truck at 200mph than it is to find 16 truck drivers. It's really neat when you figure out how to map/reduce your algorithm so you can parallelize it, but it would be even easier if you didn't even have to in the first place. And that's assuming that it is even parallelizable in the first place. Not all algorithms can be optimized like that and needs a bigger system to run on.
There are workloads that are data parallel, and scale like the GPU-heavy scale-out nodes that you describe.

The other approach, which you do when models themselves are massive, is model parallelism. You split it into multiple parts that run on different nodes.

In both cases, you need to distribute weight updates through the network although the traffic patterns can be different.

To maximize the performance in both scenarios, systems designers optimize for all-reduce and bisection bandwidth.

There are also other tricks, for example the TPUv4 ICI network is optically switched, and it is configured when a workload starts to maximize bandwidth for the requested topology ("twisting the torus" in the published paper).

Using something like Stable diffusion and generating all the frames at once (for video) as a single image. For that kind of usage one needs to have ram for the whole image. This setup could generate videos like that in the same time as I generate an image on my home computer.