| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by smhx 673 days ago

the author got a couple of things wrong, that are worth pointing out:

1. PyTorch is going all-in on torch.compile -- Dynamo is the frontend, Inductor is the backend -- with a strong default Inductor codegen powered by OpenAI Triton (which now has CPU, NVIDIA GPU and AMD GPU backends). The author's view that PyTorch is building towards a multi-backend future isn't really where things are going. PyTorch supports extensibility of backends (including XLA), but there's disproportionate effort into the default path. torch.compile is 2 years old, XLA is 7 years old. Compilers take a few years to mature. torch.compile will get there (and we have reasonable measures that the compiler is on track to maturity).

2. PyTorch/XLA exists, mainly to drive a TPU backend for PyTorch, as Google gives no other real way to access the TPU. It's not great to try shoe-in XLA as a backend into PyTorch -- as XLA fundamentally doesn't have the flexibility that PyTorch supports by default (especially dynamic shapes). PyTorch on TPUs is unlikely to ever have the experience of JAX on TPUs, almost by definition.

3. JAX was developed at Google, not at Deepmind.

2 comments

n7g 673 days ago

Hey, thanks for actually engaging with the blog's points instead of "Google kills everything it touches" :)

1. I'm well aware of the PyTorch stack, but this point:

> PyTorch is building towards a multi-backend future isn't really where things are going

>PyTorch supports extensibility of backends (including XLA)

Is my problem. Those backends just never integrate well as I mentioned in the blogpost. I'm not sure if you've ever gone into the weeds, but there are so many (often undocumented) sharp edges when using different backends that they never really work well. For example, how bad Torch:XLA is and the nightmare inducing bugs & errors with it.

> torch.compile is 2 years old, XLA is 7 years old. Compilers take a few years to mature

That was one of my major points - I don't think leaning on torch.compile is the best idea. A compiler would inherently place restrictions that you have to work-around.

This is not dynamic, nor flexible - and it flies in the face of torch's core philosophies just so they can offer more performance to the big labs using PyTorch. For various reasons, I dislike pandering to the rich guy instead of being an independent, open-source entity.

2. Torch/XLA is indeed primarily meant for TPUs - like the quoted announcement, where they declare to be ditching TF:XLA in favour of OpenXLA. But there's still a very real effort to get it working on GPUs - infact, a lab on twitter declared that they're using Torch/XLA on GPUs and will soon™ release details.

XLA's GPU support is great, its compatible across different hardware, its optimized and mature. In short, its a great alternative to the often buggy torch.compile stack - if you fix the torch integration.

So I won't be surprised if in the long-term they lean on XLA. Whether that's a good direction or not is upto the devs to decide unfortunately - not the community.

3. Thank you for pointing that out. I'm not sure about the history of JAX (maybe might make for a good blogpost for JAX devs to write someday), but it seems that it was indeed developed at Google research, though also heavily supported + maintained by DeepMind.

Appreciate you giving the time to comment here though :)

link

smhx 671 days ago

If you're the author, unfortunately I have to say that the blog is not well-written -- misinformed about some of the claims and has a repugnant click-baity title. you're getting the attention and clicks, but probably losing a lot of trust among people. I didn't engage out of choice, but because of a duty to respond to FUD.

> > torch.compile is 2 years old, XLA is 7 years old. Compilers take a few years to mature

> That was one of my major points - I don't think leaning on torch.compile is the best idea. A compiler would inherently place restrictions that you have to work-around.

There are plenty of compilers that place restrictions that you barely notice. gcc, clang, nvcc -- they're fairly flexible, and "dynamic". Adding constraints doesn't mean you have to give up on important flexibility.

> This is not dynamic, nor flexible - and it flies in the face of torch's core philosophies just so they can offer more performance to the big labs using PyTorch. For various reasons, I dislike pandering to the rich guy instead of being an independent, open-source entity.

I think this is an assumption you've made largely without evidence. I'm not entirely sure what your point is. The way torch.compile is measured for success publicly (even in the announcement blogpost and Conference Keynote, link https://pytorch.org/get-started/pytorch-2.0/ ) is by measuring on a bunch of popular PyTorch-based github repos in the wild + popular HuggingFace models + the TIMM vision benchmark. They're curated here https://github.com/pytorch/benchmark . Your claim that its to mainly favor large labs is pretty puzzling.

torch.compile is both dynamic and flexible because: 1. it supports dynamic shapes, 2. it allows incremental compilation (you dont need to compile the parts that you wish to keep in uncompilable python -- probably using random arbitrary python packages, etc.). there is a trade-off between dynamic, flexible and performance, i.e. more dynamic and flexible means we don't have enough information to extract better performance, but that's an acceptable trade-off when you need the flexibility to express your ideas more than you need the speed.

> XLA's GPU support is great, its compatible across different hardware, its optimized and mature. In short, its a great alternative to the often buggy torch.compile stack - if you fix the torch integration.

If you are an XLA maximalist, that's fine. I am not. There isn't evidence to prove out either opinions. PyTorch will never be nicely compatible with XLA until XLA has significant constraints that are incompatible with PyTorch's User Experience model. The PyTorch devs have given clear written-down feedback to the XLA project on what it takes for XLA+PyTorch to get better, and its been a few years and the XLA project prioritizes other things.

link

n7g 670 days ago

> There are plenty of compilers that place restrictions that you barely notice. gcc, clang, nvcc -- they're fairly flexible, and "dynamic"

In the context of scientific computing - this is completely, blatantly false. We're not lowering low-level IR to machine code. We want to perform certain mathematical processes often distributed on a large number of nodes. There's a difference between ensuring optimization (i.e no I/O bottlenecks, adequate synchronization between processes, overlapping computation with comms) vs. simply transforming a program to a different representation.

This is classic [false analogy](https://simple.wikipedia.org/wiki/False_analogy)

Adding constraints does mean that you give up on flexibility precisely because you have to work around them. For example, XLA is constrained intentionally against dynamic-loops because you lose a lot of performance and suffer a huge overhead. So the API forces you to think about it statically (like you can work around it with fancier methods like using checkpointing and leveraging a tree-verse algorithm)

I'll need more clarification regarding this point, because I don't know what dev in which universe will not regard "constraints" as flying against the face of flexibility.

> popular HuggingFace models + the TIMM vision benchmark

Ah yes, benchmark it on models that are entirely static LLMs or convnet-hybrids. Clearly, high requirement on dynamicness and flexibility there.

(I'm sorry but that statement alone has lost you any credibility for me.)

> Your claim that its to mainly favor large labs is pretty puzzling.

Because large labs often play with the safest models, which often involves scaling them up (OAI, FAIR, GDM etc.) and those tend to be self-attention/transformer like workloads. The devs have been pretty transparent about this - you can DM them if you want - but their entire stack is optimized for these usecases.

And ofcourse, that won't involve considering for research workloads which tend to be highly non-standard, dynamic and rather complex and much, much harder to optimize for.

This is where the "favouring big labs" comes from.

> 1. it supports dynamic shapes

I agree that in the specifically narrow respect of dynamic shapes, it's better than XLA.

But then it also misses a lot of the optimization features XLA has such as its new cost model and Latency Hiding Scheduler (LHS) stack which is far better at async overlapping of comms, computations and even IO (as its lazy).

> there is a trade-off between dynamic, flexible and performance

Exactly. Similarly, there's a difference in the features offered by each particular compiler. Torch's compiler's strengths may be XLA's weakness, and vice-versa.

But its not perfect - no software can be, and compilers certainly aren't exceptions. My issue is that the compiler is being considered at all in torch.

There are use-cases where the torch.compile stack fails completely (not sure how much you hang around more research-oriented forums) wherein there are some features that simply do not work with torch.compile. I cited FSDP as the more egregious one because its so common in everyone's workflow.

That's the problem. Torch is optimizing their compiler stack for certain workloads, with a lot of new features relying on them (look at newly proposed DTensor API for example).

If I'm a researcher with a non-standard workload, I should be able to enjoy those new features without relying on the compiler - because otherwise, it'd be painful for me to fix/restrict my code for that stack.

In short, I'm being bottlenecked by the compiler's capabilities preventing me to fully utilize all features. This is what I don't like. This is why torch should never be leaning at a compiler at all.

It 'looks' like a mere tradeoff, but reality is just not as simple as that.

> XLA:GPU

I don't particularly care if torch uses whatever compiler stack the devs choose - that's beside the point. Really, I just don't like the compiler-integrated approach at all. The choice of the specific stack doesn't matter.

link

lunaticd 673 days ago

3. The project started under a Harvard affiliated Github org during the course of PhDs. These same people later joined Google where it continued to be developed and over time adopted more and more in place of TensorFlow.

link