|
|
|
|
|
by nenkoru
1173 days ago
|
|
It would be great if you can help me with this PR as well as adding a support for exporting a model that was quantized using GPTQ, bitsandbytes, plain torch. This would bring a lot of benefit from both worlds: - Low memory footprint(thanks quantization) - Fast inference(thanks io binding) Particularly in case of alpaca I have seen a 5x decrease in latency on A100 and 10x on AMD EPYC.
I believe this is the way for users to have an AI that could genereate a response as fast as it can on their hardware.
I have also added a link to my profile on hf with small alpacas turned into ONNX format. Take a look into them. [1] https://github.com/huggingface/optimum/pull/922 [2] https://huggingface.co/nenkoru |
|