They haven't been able to compete with GPU's on perf/watt. In general you end up just designing some AI accelerator for the FPGA (because the models are too big to map onto a single device all at once), but it's hard to beat purpose-built tensor and vector HW on a GPU when you're running soft logic.
FPGAs are designed to fight latency as much as possible. To do this, they have networks of switches to shuttle bits across the chip and keep delays to the bare minimum, in order for synchronous logic to be able to run at the highest possible clock rates for signals that traverse the entire chip.
To meet this goal, there's a huge amount of effort required to compile a program written in Verilog, VHDL, etc.. into a set of bits that can be used to program all of the switching logic and look up tables in the chip. I'm lead to believe it can sometimes take a day or more per compile.
The second factor optimized for in FPGAs is utilization, trying to use 100% of the available resources of the chip. This is never achieved in practice.
Because everything is optimized for speed, it's not very power efficient.
---
Generally, FPGAs aren't the right architecture for neural networks. If you could load all of the weights into the LUTs, and leave them there, you'd get the type of speedups you want, but those scales of FPGA just don't exist.
> I'm lead to believe it can sometimes take a day or more per compile.
This is true and misleading at the same time. Filling a large FPGA takes time, but if you are working with a small FPGA the turnaround time can be 15 minutes.
They haven't been able to compete with GPU's on perf/watt. In general you end up just designing some AI accelerator for the FPGA (because the models are too big to map onto a single device all at once), but it's hard to beat purpose-built tensor and vector HW on a GPU when you're running soft logic.