Hacker News new | ask | show | jobs
by std_badalloc 1933 days ago
PyTorch is the most impressive piece of software engineering that I know of. So yeah, it's a nice interface for writing fast numerical code. And for zero effort you can change between running on CPUs, GPUs and TPUs. There's some compiler functionality in there for kernel fusing and more. Oh, and you can autodiff everything. There's just an incredible amount of complexity being hidden behind behind a very simple interface there, and it just continues to impress me how they've been able to get this so right.
5 comments

and TPUs

BS. There's so much effort getting Pytorch working on TPUs, and at the end of it it's incredibly slow compared to what you have in Tensorflow. I hate this myth and wish it would die.

Old thread on this, detailing exactly why this is true: https://news.ycombinator.com/item?id=24721229

OTOH PyTorch seems to be highly explosive if you try to use it outside the mainstream use (i.e. neural networks). There's sadly no performant autodiff system for general purpose Python. Numba is fine for performance, but does not support autodiff. JAX aims to be sort of general purpose, but in practice it is quite explosive when doing something other than neural networks.

A lot of this is probably due to supporting CPUs and GPUs with the same interface. There are quite profound differences in how CPUs and GPUs are programmed, so the interface tends to restrict especially more "CPU-oriented" approaches.

I have nothing against supporting GPUs (although I think their use is overrated and most people would do fine with CPUs), but Python really needs a general purpose, high performance autodiff.

> I have nothing against supporting GPUs (although I think their use is overrated and most people would do fine with CPUs), but Python really needs a general purpose, high performance autodiff.

As someone who works with machine learning models day-to-day (yes, some deep NNs, but also other stuff) - GPUs really seem unbeatable to me for anything gradient-optimization-of-matrices (i.e. like 80% of what I do) related. Even inference in a relatively simple image classification net takes an order of magnitude longer on CPU than GPU on the smallest dataset I'm working with.

Was this a comment about specific models that have a reputation as being more difficult to optimize on the GPU (like tree-based models - although Microsoft is working in this space)? Or am I genuinely missing some optimization techniques that might let me make more use of our CPU compute?

For gradient-optimization-of-matrices for sure. Just make sure that you don't use gradient-optimization-of-matrices just because they run well on GPUs. There may well be more efficient approaches to your problems that are infeasible for the GPUs' wide SIMD architecture you may miss if you tie yourself to GPUs.

In general it's more that some specific models are easy for GPUs. Most models probably are not.

I really don't understand the GPUs are overrated comment. As someone who uses Pytorch a lot and GPU compute almost every day, there is an order of magnitude difference in the speeds involved for most common CUDA / Open-CL accelerated computations.

Pytorch makes it pretty easy to get large GPU accelerated speed-ups with a lot of code we used to traditionally limit to Numpy. And this is for things that have nothing to do with neural-networks.

For a lot of cases you don't really need that much performance. Modern processors are plenty fast. It seems that current push to use GPU also pushes people towards GPU oriented solutions, such as using huge NNs for more or less anything, while other approaches would in many cases be magnitudes more efficient and robust.

GPUs (or "wide SIMDs" more generally) have quite profound limitations. Branching is very limited, recursion is more or less impossible and parallelism is possible only for identical operations. This makes for example many recursion-based time-series methods (e.g. Bayesian filtering) very tricky or practically impossible. From what I gather, running recurrent networks is also tricky and/or hacky on GPU.

GPUs are great for some quite specific, yet quite generally applicable, solutions, like tensor operations etc. But being tied to GPUs' inherent limitations also limits the space of approaches that are feasible to use. And in the long run this can stunt the development of different approaches.

> For a lot of cases you don't really need that much performance. Modern processors are plenty fast. It seems that current push to use GPU also pushes people towards GPU oriented solutions, such as using huge NNs for more or less anything, while other approaches would in many cases be magnitudes more efficient and robust.

for instance?

I still don't get the criticism of Pytorch. If anything, you can get the best of both worlds in many way with their API supporting on GPU and on CPU operations in exactly the same ways.
What do you mean by “seems to be highly explosive”? I have used Pytorch to model many non-dnn things and have not experienced highly explosive behavior. (Could be that I have become too familiar with common footguns though)
I get what you mean by the GPUs are overrated comment, which is that they're thought of as essential in many cases when they're probably not, but in many domains like NLP, GPUs are a hard requirement for getting anything done
Have you tried using Enzyme* on Numba IR?

* https://enzyme.mit.edu

Wait wat, jax and also pytorch is used in a lot more areas then NN's. Jax is even consider to do better in that department in terms on performance then all of julia so wat are u talking about
GP makes a fair point about JAX still requiring a limited subset of Python though (mostly control flow stuff). Also, there's really no in-library way to add new kernels. This doesn't matter for most ML people but is absolutely important in other domains. So Numba/Julia/Fortran are "better in that department in terms on performance" than JAX because the latter doesn't even support said functionality.
> Jax is even consider to do better in that department in terms on performance then all of julia so wat are u talking about

Please provide sources for this claim

> There's sadly no performant autodiff system for general purpose Python.

Like there is for general purpose Julia code? (https://github.com/FluxML/Zygote.jl)

> I have nothing against supporting GPUs (although I think their use is overrated and most people would do fine with CPUs),

Do you run much machine learning code? All those matrix multiplications run a good bit faster on the GPU.

> Oh, and you can autodiff everything.

Well, not everything. Julia's Zygote AD system can autodiff most Julia code (currently with the exception of code that mutates arrays/matrices).

and you didn't even talk about data and model parallelism. which often just works out of the box
Its python wrappers on top of existing ThTensor library which was already provided by torch. But yes great engineering nonetheless.
I don't think this is a particularly accurate description of pytorch in 2021. Yeah, the original c++ backend came from torch, but I think most of that has been replaced. AFAIK, all the development of the c++ backend for pytorch over that last several years has been done as part of the pytorch project -it's not just python wrappers at this point.
What I like about PyTorch is that most of the functionality is actually available through the C++ API as well, which has 'beta API stability' as they call it. So, there are good bindings for some other languages as well. E.g., I have been using the Rust bindings in a larger project [1], and they have been awesome. A precursor to the project was implemented using Tensorflow, which was a world of pain.

Even things like mixed-precision training are fairly easy to do through the API.

[1] https://github.com/tensordot/syntaxdot