|
|
|
|
|
by ij23
1035 days ago
|
|
What local/in-K8-cluster models servers would you recommend adding ? Should we add support for llama.cpp and vllm.ai in the proxy server ? Or should we assume you can host them on your own infra and the proxy server requests your hosted model ? |
|
There is a ton of boilerplate around the actual model server that’s just busy work , but if done wrong can be a huge performance suck. Solve that.
Build the proxy that works with the most model servers out there. Do it in a way that once you have mindshare, the model server makers will be find it easy to put up a PR so that they can claim your proxy supports their server.
Don’t take a hard dependency on non-OSS stuff - being able to build an “on-prem” solution (read “deployed into customer’s VPC”) is table stakes for anyone to use your offering for a lot of enterprise use cases.
Edit: another unsolved problem - different models need slightly different prompts to solve the same problem well…