| HN Mirror

For bandwidth-bound problems like large language models, you could also solve it with properly written CPU kernels (Mojo) and usage of the AMX accelerators for compute-intensive parts.

I'd be more interested if they had a GPU port of Stable Diffusion, which is really compute-intensive. That's where the GPU has a major advantage over CPU, on lower-end chips like M1/M2 and M1/M2 Pro.

> specialized for CPU

Mojo's SIMD execution model should map directly to GPUs. Instead of writing shaders thinking about a single GPU thread, you access the assembly ISA directly, thinking about an entire warp/simdgroup. That's how I think when writing SIMD-group matrix kernels anyway.