Hacker News new | ask | show | jobs
by pclmulqdq 1392 days ago
I'm not sure I would call the architecture very complex. It's about as simple as you can make a scale-out supercomputer. I assume they essentially do static positioning of the cluster for training jobs, and have a translation layer from the TensorFlow middle-end to their thing. Google did a similar thing with their TPUs, so it makes sense that they would have architected TF to accept exotic supercomputers as backends.
1 comments

Tensorflow (and pytorch) convert your computation graph (constructed in python) to XLA, which is then specialized to a specific hardware architecture. XLA is a good intermediate language and in fact, you can convert some memory movement in the graph to network calls, allowing you to run on parallel systems (like a cluster of GPUs or TPUs with their own non-host-based networking).

It still requires many experts, both to write the XLA to hardware translation, and ML engineers who know how to write TF python that executes quickly.

(note: Google has transitioned many projects to Jax, which also writes to XLA, as TF ended up being a bit of a pig with wings)

> TF ended up being a bit of a pig with wings

Can you say more about this?

Everybody at Google wanted to add their specific feature to TF (gets visibility, users, research cred). Unfortunately, many different teams added multiple incompatible features that didn't compose. TF1 also had some serious problems where it was fundamentally designed around C++, without an understanding that most people wanted to work in Python. With TF2 a lot of stuff got redesigned, making many examples on the web stop working. There are too m any ways to parallelize your computation (along any of the dimensions), and they change too frequently.

But I think the writing was on the wall when the folks building ML Pathways hit some performance problems and realized that Jax made it much, much easier for them to express the computations they wanted and see them run quickly on TPUs (DeepMind had also concluded this). Once Jeff saw that Jax was making stuff that ran faster than TF (for his pet projects) the writing was on the wall.