Hacker News new | ask | show | jobs
by baseethrowaway 3058 days ago
Accelerate the most demanding tasks of your server's workload. Crypto, audio/video encoding, compression, database lookups, neural network training, computational fluid dynamics, numerical mathematics, high frequency trading etc.
1 comments

I hear this a lot, but every time I try to implement a specific algorithm (in crypto, compression, and ML so far), I find that a GPU practically beats the FPGA on every metric but power, especially total cost. No matter how nicely the problem seems to map to an FPGA, GPUs start from such high performance that I can't seem to beat them- the one exception so far being some genomic algorithms.

Are there any really good papers, projects, or products that show where FPGAs provide a major commercial benefit over a GPU?

> Are there any really good papers, projects, or products that show where FPGAs provide a major commercial benefit over a GPU?

A couple applications that come to mind:

- RF, including cellular base station hardware and radar

- ASIC prototyping

- Low production run computer hardware, including some RAID controllers

High-end AV equipment like mixing consoles often use FPGAs alongside or instead of discrete DSP chips. The application demands sub-millisecond latency and deterministic performance but is too niche to justify spinning an ASIC.

IMO, those are the main factors that justify FPGA selection - low latency and hard real-time performance. I understand that military and industrial designers make extensive use of FPGAs for these reasons; the throughput isn't necessarily any better than an ordinary processor and the cost is drastically higher, but you have absolute certainty about latency.

These all seem fair- I've been mostly looking at large scale FPGAs in a data center, a la F1 or Microsoft's Catapult. I hadn't given much thought to use as low-run hardware, at which I'm sure they excel.
They have a pretty stable niche in high volume signal processing because they can deliver deterministic real time processing for data. You'll find an FPGA in most oscilloscopes, digitizers and the like. A GPU would be not be well suited to that use.
I see the appeal here, but I'm surprised a good DSP or mid-volume, old node ASIC isn't the more common solution here. Do these need such specialized processing that FPGAs make economic sense?
Scopes and other digitizers can operate at GHz rates and something like an FPGA can interface those rates (JESD) with other peripherals like memory. (for example)
Scopes and digitizers are actually extremely demanding in terms of throughput, plus they do a lot of specialized processing on the samples. Even mid-range models are using top-grade FPGAs and custom ASICs. Keysight's mid-range scope is capable of 5 gigasamples per second at 8 bit sample depth on 4 channels, that's 20 GB/s of data that needs to be processed, [soft] real time.

Regarding cost, these are expensive, complicated instruments. A bottom of the barrel oscilloscope costs $300 and professional grade units are more like $2-3000. The top grade ones can cost half a million dollars ( https://www.keysight.com/en/pcx-x205212/infiniium-z-series-o... ).

So an FPGA has an area to speed trade off. So if you use a lot area things are generally quicker. The main reason is an FPGA has really fine grain parallelism. A GPU might have ~2000 simple cores. If you consider one these Ultrascale+ FPGAs likely have have over a million logic cells. The one in this post has 2852K logic cells. So if we were computing just basic logic we have over 1000 times more parallelism than the GPU. However, most problems are not basic logic, and the FPGA does not have enough IO pins for that unless we combined the results. It also excludes built ins like adders, DSP cells ect... However, that's the general idea fine grain parallelism.

So if your looking for a solution that will perform faster on FPGA you are going to want something that is simple but you need compute often. That way you can duplicate it 1000s of times on the FPGA. An other place an FGPA excels is data that quite long. Compare 32 bit number to 1024 bit number. You could do what ever your doing to the 1024 bit number in one pass with an FPGA. However, the GPU's native int size is probably 32 bits. So to just perform one operation the GPU has to perform at least 32 operations for that one number. So that overhead has to be carried around for every operation a GPU core would have to perform. The HBM added to this FPGA makes it even better in cases like this. That's just the general idea though.

So if what you are doing can take advantage of the FPGA's strengths you could come out with a much faster solution. Also there is power usage. If your going to be building clusters to perform what ever processing you need and the FPGA performs about the same as GPU you will use less electricity.

Thanks for these!

The first paper is actually one I've spent a significant amount of time trying to use, to the point of collaborating with one of the authors. His conclusion was that FPGAs used to be competitive with GPUs for approximated nets, but the Tesla GPUs were such a jump forward in practical network performance that it wasn't worth trying to compete outside specialized realms like binary nets.

The second paper was interesting- I can imagine why the problem they are trying to solve would be a good fit for FPGAs. However, I'm suspicious that they implemented an entirely different algorithm on the FPGA, and didn't measure the performance of that algorithm on GPUs. I'm all for using the best algorithm for the hardware, but I worry they just used an overall better algorithm on FPGAs and conflated the results.

> However, I'm suspicious that they implemented an entirely different algorithm on the FPGA, and didn't measure the performance of that algorithm on GPUs.

I agree this was a bit suspicious. It may be the case that the different algorithm they used for the FPGA would have done well on a GPU -- or perhaps more likely, that if they spent a similar amount of effort in rethinking the algorithm just for the GPU, ending up with a third GPU-specialised approach, perhaps that would have done dramatically better.

Pragmatically, it seems like they chose the GPU for their application anyway - so they had already decided the GPU was the overall winner without needing to improve it.

Have you heard of Microsoft's distributed FPGA fabric project? They are using acceleration for a lot of Bing's algorithms with excellent result all around. They are now working on bringing it to Azure.
That's Catapult, right? I have read through some of their papers. It sounds like they might be offloading some network work onto the FPGAs in the same way AWS has their custom nitro card, but I've not really been impressed with their attempts at data processing improvements (some reason for me to use it on Azure). I haven't read all of their papers, but what I have read always sounded like an after the fact justification for the FPGAs. They might show the FPGAs are better at a machine learning task than CPUs, but unless you were deciding whether to use FPGAs you already have, the real competition is GPUs and they tended not to compare to a GPU.

Do you have a recommendation for a specific data processing experiment of theirs I should check out? I really feel like I just missed a paper where they proved some real advantage over other hardware, and once I found that I'd understand. I respect the Azure teams generally and assume they know what they're doing- but can't escape the hunch that what they are doing is network acceleration, and are just releasing to the public cloud because they have these sitting around anyway.