Hacker News new | ask | show | jobs
by nenkoru 1173 days ago
It would be great if you can help me with this PR as well as adding a support for exporting a model that was quantized using GPTQ, bitsandbytes, plain torch. This would bring a lot of benefit from both worlds:

- Low memory footprint(thanks quantization)

- Fast inference(thanks io binding)

Particularly in case of alpaca I have seen a 5x decrease in latency on A100 and 10x on AMD EPYC. I believe this is the way for users to have an AI that could genereate a response as fast as it can on their hardware. I have also added a link to my profile on hf with small alpacas turned into ONNX format. Take a look into them.

[1] https://github.com/huggingface/optimum/pull/922

[2] https://huggingface.co/nenkoru