|
|
|
|
|
by nl
1208 days ago
|
|
On a CPU I'd estimate it would get a maximum of around 5 tokens per second (a token being a sub-word token, so generally a couple of letters). I suspect it'd be more like 1 token per second on the large model without additional optimisation. Yes models can be split up. See eg Hugging Face Accelerate. |
|