|
|
|
|
|
by mikeayles
17 days ago
|
|
So for people wondering if it can be used to accelerate LLM inference, sadly not. I've been trying to hit 100,000tokens/s with a 3.28m dumb model, and even this is an order of magnitude too large to benefit. It appears to be focussed more on latency, than throughput. Happy to be corrected? |
|
EDIT: Oh, on second read, do you mean you're running the model on an FPGA?