| I want to do zero-shot text classification either with the model [1] (711 MB) or with something similar. Want to achieve high throughput in classification requests per second. Classification will run on low-end hardware: some Hetzner [2] machine without GPU (Hetzner is great, reliable and cheap, they just do not have GPU machines), something like this: * CCX13: Dedicated vCPU, 2 VCPU, 8 GB RAM * CX32: Shared vCPU, 4 VCPU, 8 GB RAM Now there are multiple options for deploying and serving LLMs: * lmdeploy * text-generation-inference * TensorRT-LLM * vllm There are more and more new frameworks for this. I am a bit lost. Would you suggest the best option for deploying the above-listed model (No-GPU hardware)? [1] https://huggingface.co/MoritzLaurer/roberta-large-zeroshot-v2.0-c [2] https://www.hetzner.com/cloud/ |
A few thoughts:
1) TensorRT anything isn’t an option because it requires Nvidia GPUs.
2) The serving frameworks you linked likely don’t support the architecture of this model, and even if they did they have varying levels of support for CPU.
3) I’m not terribly familiar with Hetzner but those instance types seem very low-end.
The model you linked has already been converted to ONNX. Your best bet (probably) is to take the ONNX model and load it in Triton Inference Server. Of course Triton is focused on Nvidia/CUDA but if it doesn’t find an Nvidia GPU it will load the model(s) to CPU. You can then do some performance testing in terms of requests/s but prepare to not be impressed…
Then you could look at (probably) int8 quantization of the model via the variety of available approaches (ONNX itself, Intel Neural Compressor, etc). With Triton specifically you should also look at Openvino CPU execution accelerator support. You will need to see if any of these dramatically impact the quality of the model.
Overall I think “good, fast, cheap: pick two” definitely applies here and even implementing what I’ve described is a fairly significant amount of development effort.