|
|
|
|
|
by raphlinus
1581 days ago
|
|
Great, thanks. That answers my questions. I'll read up on the lazily allocated bit; I wasn't aware that this provided similar functionality as dispatchThreadsPerTile[1], but perhaps it's something I'm misunderstanding. I'm excited about that as a way to stitch 2D graphics rendering operations together without having to hit main memory, but from your explanation I can see that functionality might not be very useful in AI workloads. Amen on more control over dynamic memory access patterns. It's something I'm struggling with too, and I have a feeling that whatever solution I come up with is going to be a compromise. Keep up the good work, these are exciting times! [1]: https://developer.apple.com/documentation/metal/mtlrendercom... |
|
Our goal (though still WIP) is to have the interaction between user applications and our compiled code happen at the command buffer boundary - you would submit some work, pass in a VkSemaphore/MTLSharedEvent/cuEvent/futex/etc, we would use that when submitting our own work, and then we'd pass you back a VkSemaphore/etc you can continue chaining with. So one level of granularity coarser than mid-pass interleaving but still hopefully all pipelined properly with no host/device synchronization required. There will be programs that this doesn't work well with (heavily data-dependent stuff) but at least making it work turns it into an optimization problem vs today's representation problem!