|
|
|
|
|
by loser777
1392 days ago
|
|
The most striking thing about the architecture is that it appears so heterogeneous and complex. Considering the vast amount of software/machine learning engineering behind model/data/pipeline parallelism schemes like Megatron-LM and ZeRO (which target hardware topologies that seem almost simple by comparison) I'm curious what abstractions are in place to make this beast of an architecture friendly to programmers. Can you program a small tile in the same way you would a large tile and like you would in CUDA for a large/small GPU? Are there dedicated kernel teams that implement common blocks like multiheaded attention with the topology in mind so researchers/engineers doing modeling don't have to worry about scaling the model architecture in a hardware-friendly way? Do they have a monstrous fork of PyTorch with "Dojo" supported natively? |
|