To the best of our knowledge, state-of-the-art performance
for forward propagation of CNNs on FPGAs was achieved
by a team at Microsoft. Ovtcharov et al. have reported
a throughput of 134 images/second on the ImageNet 1K
dataset [28], which amounts to roughly 3x the throughput
of the next closest competitor, while operating at 25 W on a
Stratix V D5 [30]. This performance is projected to increase
by using top-of-the-line FPGAs, with an estimated through-
put of roughly 233 images/second while consuming roughly
the same power on an Arria 10 GX1150. This is com-
pared to high-performing GPU implementations (Caffe +
cuDNN), which achieve 500-824 images/second, while con-
suming 235 W. Interestingly, this was achieved using Micros
oft-
designed FPGA boards and servers, an experimental project
which integrates FPGAs into datacenter applications.
That's hard to compare. Typically FPGAs are doing fixed-point math, so they can do more operations with less power. GPUs have traditionally done floating point. However, with the new Pascal architecture, certain cards (P4/P40) support 8-bit integer dot products, which give a massive boost in performance/W. It's still fairly high at 250W, but that's for an entire card with 24GB of memory. You'd have to compare that to an FPGA with that much memory on a PCIe card if you're doing apples to apples. Something like this is appropriate for comparison: http://www.nallatech.com/store/fpga-accelerated-computing/pc...
To the best of our knowledge, state-of-the-art performance for forward propagation of CNNs on FPGAs was achieved by a team at Microsoft. Ovtcharov et al. have reported a throughput of 134 images/second on the ImageNet 1K dataset [28], which amounts to roughly 3x the throughput of the next closest competitor, while operating at 25 W on a Stratix V D5 [30]. This performance is projected to increase by using top-of-the-line FPGAs, with an estimated through- put of roughly 233 images/second while consuming roughly the same power on an Arria 10 GX1150. This is com- pared to high-performing GPU implementations (Caffe + cuDNN), which achieve 500-824 images/second, while con- suming 235 W. Interestingly, this was achieved using Micros oft- designed FPGA boards and servers, an experimental project which integrates FPGAs into datacenter applications.
https://arxiv.org/pdf/1602.04283v1.pdf