Hacker News new | ask | show | jobs
by loser777 1392 days ago
The most striking thing about the architecture is that it appears so heterogeneous and complex. Considering the vast amount of software/machine learning engineering behind model/data/pipeline parallelism schemes like Megatron-LM and ZeRO (which target hardware topologies that seem almost simple by comparison) I'm curious what abstractions are in place to make this beast of an architecture friendly to programmers. Can you program a small tile in the same way you would a large tile and like you would in CUDA for a large/small GPU? Are there dedicated kernel teams that implement common blocks like multiheaded attention with the topology in mind so researchers/engineers doing modeling don't have to worry about scaling the model architecture in a hardware-friendly way? Do they have a monstrous fork of PyTorch with "Dojo" supported natively?
1 comments

I'm not sure I would call the architecture very complex. It's about as simple as you can make a scale-out supercomputer. I assume they essentially do static positioning of the cluster for training jobs, and have a translation layer from the TensorFlow middle-end to their thing. Google did a similar thing with their TPUs, so it makes sense that they would have architected TF to accept exotic supercomputers as backends.
Tensorflow (and pytorch) convert your computation graph (constructed in python) to XLA, which is then specialized to a specific hardware architecture. XLA is a good intermediate language and in fact, you can convert some memory movement in the graph to network calls, allowing you to run on parallel systems (like a cluster of GPUs or TPUs with their own non-host-based networking).

It still requires many experts, both to write the XLA to hardware translation, and ML engineers who know how to write TF python that executes quickly.

(note: Google has transitioned many projects to Jax, which also writes to XLA, as TF ended up being a bit of a pig with wings)

> TF ended up being a bit of a pig with wings

Can you say more about this?

Everybody at Google wanted to add their specific feature to TF (gets visibility, users, research cred). Unfortunately, many different teams added multiple incompatible features that didn't compose. TF1 also had some serious problems where it was fundamentally designed around C++, without an understanding that most people wanted to work in Python. With TF2 a lot of stuff got redesigned, making many examples on the web stop working. There are too m any ways to parallelize your computation (along any of the dimensions), and they change too frequently.

But I think the writing was on the wall when the folks building ML Pathways hit some performance problems and realized that Jax made it much, much easier for them to express the computations they wanted and see them run quickly on TPUs (DeepMind had also concluded this). Once Jeff saw that Jax was making stuff that ran faster than TF (for his pet projects) the writing was on the wall.