|
|
|
|
|
by vikp
1054 days ago
|
|
I would use textsynth (https://bellard.org/ts_server/) or llama.cpp (https://github.com/ggerganov/llama.cpp) if you're running on CPU. - I wouldn't use anything higher than a 7B model if you want decent speed.
- Quantize to 4-bit to save RAM and run inference faster.
Speed will be around 15 tokens per second on CPU (tolerable), and 5-10x faster with a GPU. |
|