The first paper is actually one I've spent a significant amount of time trying to use, to the point of collaborating with one of the authors. His conclusion was that FPGAs used to be competitive with GPUs for approximated nets, but the Tesla GPUs were such a jump forward in practical network performance that it wasn't worth trying to compete outside specialized realms like binary nets.
The second paper was interesting- I can imagine why the problem they are trying to solve would be a good fit for FPGAs. However, I'm suspicious that they implemented an entirely different algorithm on the FPGA, and didn't measure the performance of that algorithm on GPUs. I'm all for using the best algorithm for the hardware, but I worry they just used an overall better algorithm on FPGAs and conflated the results.
> However, I'm suspicious that they implemented an entirely different algorithm on the FPGA, and didn't measure the performance of that algorithm on GPUs.
I agree this was a bit suspicious. It may be the case that the different algorithm they used for the FPGA would have done well on a GPU -- or perhaps more likely, that if they spent a similar amount of effort in rethinking the algorithm just for the GPU, ending up with a third GPU-specialised approach, perhaps that would have done dramatically better.
Pragmatically, it seems like they chose the GPU for their application anyway - so they had already decided the GPU was the overall winner without needing to improve it.
The first paper is actually one I've spent a significant amount of time trying to use, to the point of collaborating with one of the authors. His conclusion was that FPGAs used to be competitive with GPUs for approximated nets, but the Tesla GPUs were such a jump forward in practical network performance that it wasn't worth trying to compete outside specialized realms like binary nets.
The second paper was interesting- I can imagine why the problem they are trying to solve would be a good fit for FPGAs. However, I'm suspicious that they implemented an entirely different algorithm on the FPGA, and didn't measure the performance of that algorithm on GPUs. I'm all for using the best algorithm for the hardware, but I worry they just used an overall better algorithm on FPGAs and conflated the results.