|
|
|
|
|
by philipturner
1112 days ago
|
|
For bandwidth-bound problems like large language models, you could also solve it with properly written CPU kernels (Mojo) and usage of the AMX accelerators for compute-intensive parts. I'd be more interested if they had a GPU port of Stable Diffusion, which is really compute-intensive. That's where the GPU has a major advantage over CPU, on lower-end chips like M1/M2 and M1/M2 Pro. > specialized for CPU Mojo's SIMD execution model should map directly to GPUs. Instead of writing shaders thinking about a single GPU thread, you access the assembly ISA directly, thinking about an entire warp/simdgroup. That's how I think when writing SIMD-group matrix kernels anyway. |
|