| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sReinwald 406 days ago

That’s a perfectly valid idea in theory, but in practice you’ll run into a few painful trade-offs, especially in multi-user environments. Trust me, I'm currently doing exactly that in our fairly limited exploration of how we can leverage local LLMs at work (SME).

Unless you have sufficient VRAM to keep all potential specialized models loaded simultaneously (which negates some of the "lightweight" benefit for the overall system), you'll be forced into model swapping. Constantly loading and unloading models to and from VRAM is a notoriously slow process.

If you have concurrent users with diverse needs (e.g., a developer requiring code generation and a marketing team member needing creative text), the system would have to swap models in and out if they can't co-exist in VRAM. This drastically increases latency before the selected model even begins processing the actual request.

The latency from model swapping directly translates to a poor user experience. Users, especially in an enterprise context, are unlikely to tolerate waiting for a minute or more just for the system to decide which model to use and then load it. This can quickly lead to dissatisfaction and abandonment.

This external routing mechanism is, in essence, an attempt to implement a sort of Mixture-of-Experts (MoE) architecture manually and at a much coarser grain. True MoE models (like the recently released Qwen3-30B-A3B, for instance) are designed from the ground up to handle this routing internally, often with shared parameter components and highly optimized switching mechanisms that minimize latency and resource contention.

To mitigate the latency from swapping, you'd be pressured to provision significantly more GPU resources (more cards, more VRAM) to keep a larger pool of specialized models active. This increases costs and complexity, potentially outweighing the benefits of specialization if a sufficiently capable generalist model (or a true MoE) could handle the workload with fewer resources. And a lot of those additional resources would likely sit idle for most of the time, too.

1 comments

gustofied 405 days ago

Have you looked into semantic router? It will be a faster way to look up the right model for the right task. I agree that using a llm for routing is not good, takes money, takes time, and can often take the wrong route.

sReinwald 405 days ago

Semantic router is on my radar, but I haven't had a good look at it yet. The primary bottleneck in our current setup, isn't really the routing decision time. The lightweight LLM I chose (Gemma3 4B) handles the task identification fairly well in terms of both speed and accuracy from what I've found.

For some context: this is a fairly limited exploratory deployment which runs alongside other priority projects for me, so I'm not too obsessed with optimizing the decision-making time. Those three seconds are relatively minor when compared with the 20–60 seconds it takes to unload the old and load a new model.

I can see semantic router being really useful in scenarios built around commercial, API-accessed models, though. There, it could yield significant cost savings by, for example, intelligently directing simpler queries to a less capable but cheaper model instead of the latest and greatest (and likely significantly more expensive) model users might feel drawn to. You're basically burning money if you let your employees use Claude 3.7 to format a .csv file.