| HN Mirror

> There are plenty of compilers that place restrictions that you barely notice. gcc, clang, nvcc -- they're fairly flexible, and "dynamic"

In the context of scientific computing - this is completely, blatantly false. We're not lowering low-level IR to machine code. We want to perform certain mathematical processes often distributed on a large number of nodes. There's a difference between ensuring optimization (i.e no I/O bottlenecks, adequate synchronization between processes, overlapping computation with comms) vs. simply transforming a program to a different representation.

This is classic [false analogy](https://simple.wikipedia.org/wiki/False_analogy)

Adding constraints does mean that you give up on flexibility precisely because you have to work around them. For example, XLA is constrained intentionally against dynamic-loops because you lose a lot of performance and suffer a huge overhead. So the API forces you to think about it statically (like you can work around it with fancier methods like using checkpointing and leveraging a tree-verse algorithm)

I'll need more clarification regarding this point, because I don't know what dev in which universe will not regard "constraints" as flying against the face of flexibility.

> popular HuggingFace models + the TIMM vision benchmark

Ah yes, benchmark it on models that are entirely static LLMs or convnet-hybrids. Clearly, high requirement on dynamicness and flexibility there.

(I'm sorry but that statement alone has lost you any credibility for me.)

> Your claim that its to mainly favor large labs is pretty puzzling.

Because large labs often play with the safest models, which often involves scaling them up (OAI, FAIR, GDM etc.) and those tend to be self-attention/transformer like workloads. The devs have been pretty transparent about this - you can DM them if you want - but their entire stack is optimized for these usecases.

And ofcourse, that won't involve considering for research workloads which tend to be highly non-standard, dynamic and rather complex and much, much harder to optimize for.

This is where the "favouring big labs" comes from.

> 1. it supports dynamic shapes

I agree that in the specifically narrow respect of dynamic shapes, it's better than XLA.

But then it also misses a lot of the optimization features XLA has such as its new cost model and Latency Hiding Scheduler (LHS) stack which is far better at async overlapping of comms, computations and even IO (as its lazy).

> there is a trade-off between dynamic, flexible and performance

Exactly. Similarly, there's a difference in the features offered by each particular compiler. Torch's compiler's strengths may be XLA's weakness, and vice-versa.

But its not perfect - no software can be, and compilers certainly aren't exceptions. My issue is that the compiler is being considered at all in torch.

There are use-cases where the torch.compile stack fails completely (not sure how much you hang around more research-oriented forums) wherein there are some features that simply do not work with torch.compile. I cited FSDP as the more egregious one because its so common in everyone's workflow.

That's the problem. Torch is optimizing their compiler stack for certain workloads, with a lot of new features relying on them (look at newly proposed DTensor API for example).

If I'm a researcher with a non-standard workload, I should be able to enjoy those new features without relying on the compiler - because otherwise, it'd be painful for me to fix/restrict my code for that stack.

In short, I'm being bottlenecked by the compiler's capabilities preventing me to fully utilize all features. This is what I don't like. This is why torch should never be leaning at a compiler at all.

It 'looks' like a mere tradeoff, but reality is just not as simple as that.

> XLA:GPU

I don't particularly care if torch uses whatever compiler stack the devs choose - that's beside the point. Really, I just don't like the compiler-integrated approach at all. The choice of the specific stack doesn't matter.