| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brrrrrm 1327 days ago

> tread off the beaten path things get slow

Yea, makes sense. I think there's something to be said for dynamic compilation solving this problem more elegantly than providing tons of hand-tuned kernels (PyTorch is 890MB lmao https://pypi.org/project/torch/#files), but I don't think it's a strict reason for a performance win.

> change the loop order too

Memory layout as well! I'm 100% for dynamic compilation, but I'm claiming that it really finds its stride when you fuse things.

1 comments

georgehotz 1327 days ago

Agreed. For anything at all common, most of the gains will be from fusion, the rest is just free. PyTorch also uses tons of GPU memory after only initializing, I wonder if it's copying all the kernels in?

link

terafo 1327 days ago

Jax preallocates 90% of available GPU memory when first operation is run to minimize allocation overhead. Can PyTorch grab that VRAM for a similar reason?

link

zorgmonkey 1327 days ago

Yes PyTorch uses what they call a caching memory allocator[0], basically seems like are allocating a very chunk of GPU memory and implementing a heap with it. If needed they expose some knobs and functions to allow you to control it and observe the memory usage.

[0]: https://pytorch.org/docs/stable/notes/cuda.html#memory-manag...

link