Hacker News new | ask | show | jobs
by kkielhofner 957 days ago
Nice!

Any thoughts as to how this would come together with serving frameworks like vLLM, lmdeploy, Triton Inference Server, etc?

1 comments

Certainly! We'd like our good designs to be picked up by frameworks and serve all users. Currently, Punica is built on top of PyTorch and HuggingFace Transformers ecosystems. Therefore, vLLM and LMDeploy, which are also in the PyTorch ecosystem, should have a smooth adaption. As for Nvidia Triton and TensorRT-LLM, since our kernels are written in CUDA, I believe it will also work seamlessly.

We call for the open source community to help us integrate Punica with all frameworks, thus the whole society can benefit from the efficiency improvement!