|
|
|
|
|
by rdli
396 days ago
|
|
This is really interesting. For SOTA inference systems, I've seen two general approaches: * The "stack-centric" approach such as vLLM production stack, AIBrix, etc. These set up an entire inference stack for you including KV cache, routing, etc. * The "pipeline-centric" approach such as NVidia Dynamo, Ray, BentoML. These give you more of an SDK so you can define inference pipelines that you can then deploy on your specific hardware. It seems like LLM-d is the former. Is that right? What prompted you to go down that direction, instead of the direction of Dynamo? |
|