Hacker News new | ask | show | jobs
by junrushao1994 1140 days ago
Thanks for the feedback! This is definitely something we need to do. To share some data, currently the default model is Vicuna-7b, aggressively quantized to 2.9G.

We are expanding the coverage to more models, particularly, Dolly and StableLM are just around the corner, needing some clean up work.

As a fresh new project, right now we are starting to collect data points of which GPU models are supported well and fixing issues being reported. Please don't hesitate to report in our github issue!

2 comments

I see, the 2.9 GB requirements seems to imply a 3 bit weights?

In any case I am happy to see these projects taking form. Perhaps one can eventually make the level of quantization dynamic based on the available vram etc :)

I will definitively play around with it (on linux though, not a phone!)

When people tried 3-bit quantization for 7B models before, it did not exactly go well in terms of detrimental side effects. Are you using some new quantization techniques that mitigate that?