Hacker News new | ask | show | jobs
by mysteria 579 days ago
Even with a PCIe FPGA card you're still going to be memory bound during inference. When running LLama.cpp on straight CPU memory bandwidth, not CPU power, is always the bottleneck.

Now if the FPGA card had a large amount of GPU tier memory then that would help.