| HN Mirror

Well, that performance figure seems consistent with memory bandwidth on that machine (and its upcoming successor Gorgon Halo; Medusa Halo is projected to be faster) and even on DGX/RTX Spark. You'd get the same outcome on Apple Silicon Mn Pro (not Max or Ultra) if there was one with enough memory capacity. It's likely possible to raise aggregate tok/s on Strix Halo or DGX/RTX Spark (not realistically on Apple Silicon, at least not on a single machine) by batching multiple inference flows together, but that's admittedly a bit fiddly to implement and not what you're interested in anyway.

It seems that you'll want either top-of-the-line Apple Silicon (Max/Ultra) or cloud inference, which will always be super competitive if your focus is on low latency.