Hacker News new | ask | show | jobs
by zozbot234 3 hours ago
> The cost of hardware still needs to dramatically drop for open-weight models to be viable for local usage. Even with the release of things like Nvidia DGX Spark and Ryzen AI Halo, you'd likely want a few of them to run agents in parallel.

It's more efficient to do the opposite on a constrained platform. Run agents in parallel using a single model, then round-robin among models for cross-checking purposes. (The makers of local inference engines are dropping the ball by not making batched inference a first-class citizen of that workflow. It's not just useful for vLLM and SGlang.)