|
|
|
|
|
by kouteiheika
166 days ago
|
|
> building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer It isn't, because it's a different architecture than the datacenter hardware. They're both called "Blackwell", but that's a lie[1] and you still need "real" datacenter Blackwell card for development work. (For example, you can't configure/tune vLLM on Spark, and then move it into a B200 and even expect it to work, etc.) [1] -- https://github.com/NVIDIA/dgx-spark-playbooks/issues/22 |
|
The nvfuser code doesn't even call it sm_100 vs. sm_120: NVIDIA's internal nomenclature seems to be 2CTA/1CTA, it's a bin. So there are less MMA tilings in the released ISA as of 13.1 / r85 44.
The mnemonic tcgen05.mma doesn't mean anything, it's lowered onto real SASS. FWIW the people I know doing their own drivers say the whole ISA is there, but it doesn't matter.
The family of mnemonics that hits the "Jensen Keynote" path is roughly here: https://docs.nvidia.com/cuda/parallel-thread-execution/#warp....
10x path is hot today on Thor, Spark, 5090, 6000, and data center.
Getting it to trigger reliably on real tilings?
Well that's the game just now. :)
Edit: https://customer-1qh1li9jygphkssl.cloudflarestream.com/1795a...