They're abysmal compared to anything dedicated at any reasonable batch size because of both bandwidth and compute, not sure why you're wording this like it disagrees with what I said.
I've run inference workloads on a GH200 which is an entire H100 attached to an ARM processor and the moment offloading is involved speeds tank to Mac Mini-like speeds, which is similarly mostly a toy when it comes to AI.
Again, prompt processing isn't the major problem here. It's bandwidth. 256GB/s bandwidth (maybe ~210 in real world) limits the tokens per second well before prompt processing.
Not entirely sure how your ARM statement matters here. This is unified memory.
I've run inference workloads on a GH200 which is an entire H100 attached to an ARM processor and the moment offloading is involved speeds tank to Mac Mini-like speeds, which is similarly mostly a toy when it comes to AI.