|
|
|
|
|
by smarterclayton
394 days ago
|
|
llm-d is intended to be three clean layers: 1. Balance / schedule incoming requests to the right backend 2. Model server replicas that can run on multiple hardware topologies 3. Prefix caching hierarchy with well-tested variants for different use cases So it's a 3-tier architecture. The biggest difference with Dynamo is that llm-d is using the inference gateway extension - https://github.com/kubernetes-sigs/gateway-api-inference-ext... - which brings Kubernetes owned APIs for managing model routing, request priority and flow control, LoRA support etc. |
|