It is great for inference for single user/single session. it is not replacement for graphical accelerator, that run several concurrent inference sessions in parallel.
Basically the same tradeoff as macmini with unified memory.
The RTX GPU laptops run very hot. Even though they are pound for pound better, it’s just runs too hot for local llm usage for me at least. Prefer Macs for this. A lot of AMD cards also run cooler. I wonder if undervting would help with smaller models and heat.
I mean the GB10 is pretty efficient for the power it has, but imho is nowhere near the power efficiency of Apple Silicon (it was never intended to be a chip used for mobile devices). I guess this is kind of the movement Apple did with the A12Z and the Mini but... the other way around?
I think its gonna be another failure as we are used to see with the PC market these days.
It's probably more that LLM inference speed comes from having a large amount of fast RAM. And fast RAM is brutally expensive right now.
At this point, your cost-efficient options include used 3090s, "frankenrigs" using recycled data center cards, and a handful of "workstation" class cards, where the originally high margins and the long enterprise purchasing cycles have kept prices from going up too fast.
In contrast, a lot of these "personal" AI systems are basically a GPU-like core wired to larger amounts of slow RAM. Which is still semi-affordable. Generally speaking, they make for OK chatbots but extremely slow coding agents. Whereas you can run a modestly useful coding agent at reasonable speed on a 3090.
So yeah, a lot of these systems are bit scammy. But not because it's a secret conspiracy to protect data center cards. Rather, there simply isn't enough fast RAM in the entire world. So they'll flog you disappointly slow RAM instead.
TL;dr: Might be useful for some use cases, but benchmark very carefully.
Basically the same tradeoff as macmini with unified memory.