|
|
|
|
|
by zozbot234
3 hours ago
|
|
> The cost of hardware still needs to dramatically drop for open-weight models to be viable for local usage. Even with the release of things like Nvidia DGX Spark and Ryzen AI Halo, you'd likely want a few of them to run agents in parallel. It's more efficient to do the opposite on a constrained platform. Run agents in parallel using a single model, then round-robin among models for cross-checking purposes. (The makers of local inference engines are dropping the ball by not making batched inference a first-class citizen of that workflow. It's not just useful for vLLM and SGlang.) |
|