Hacker News new | ask | show | jobs
by d_silin 2163 days ago
I wonder if it is possible to add a (small) FPGA to a personal computer that could accelerate any specific software tasks (video/audio encoding, ML algorithms, compression, extra FPU capabilities) on user demand.
7 comments

The problem with this will be the overhead of transferring data to/from the FPGA, which once accounted for often causes doing the computation on the CPU to make more sense. It's obviously not a show-stopper, since GPUs have the same problem, but are still useful, but it's hard to find a workload that maps well to this solution.
In a DAW, accelerating a heavy VST plugin might make sense. But often those are amenable to being translated to GPGPU code already.

I guess the one place where GPGPU-based solutions wouldn't work, is when the code you want to accelerate is necessarily acting as some kind of Turing machine (i.e. emulation for some other architecture.) However, I can't think of a situation where an FPGA programmed with the netlist for arch A, running alongside a CPU running arch B, would make more sense than just getting the arch-B CPU to emulate arch A; unless, perhaps, the instructions in arch-A are very, very CISC, perhaps with analogue components (e.g. RF logic, like a cellular baseband modem.)

This is normally handled in emulation by putting the inner parts of the testbench (the transactors) onto the FPGA as well, to minimize the amount of data that has to be transferred between the CPU and the FPGA. If the FPGA is to be used as a peripheral, again a division of labor needs to be found that minimizes the amount of data that needs to be communicated. But if there is FPGA logic on the same chip as the CPU cores, the overhead can be greatly reduced, and we're seeing more of that now.
I assumed this was kind of intel's plan when they purchased Altera. I this issue with this is the amount of time it takes to load the bitstream, but I thought I saw some things recently where progress was being made on this front.
> issue with this is the amount of time it takes to load the bitstream, but I thought I saw some things recently where progress was being made on this front

You saw correctly, work is indeed being done to build "shells" that can accept workloads without the user having to go through the FPGA tooling/build process.

It's been possible for a long time, but there are big challenges to adoption. Every FPGA is different and the image is tightly coupled to the chip, so you'd have to compile the algorithm specifically to your chip before loading, which can take hours. Then loading the image each time you change out accelerators for a different application can take minutes. Then the software that uses the accelerator would have to know which chip and which image you're running and send data to it accordingly. Then you have to remember that FPGA's aren't really that great of accelerators sometimes, since they run at such low clock speeds, have crummy memory interfaces, limited gate support for floating point or even integer multiplication, etc. CPU's commonly outperform them even at the things they're supposed to be good at.

So it's unlikely ever to gain broad acceptance because the software vendors would have to support such a high number of permutations and the return can be questionable. This is why you see far more accelerators based on ASICs that have higher clock speeds and baked-in circuitry for specific tasks, with standardized APIs.

But sure, there's nothing preventing you from buying an FPGA board, hooking it up to your PC, creating a few images that do the accelerations you want, and writing software that uses them, swapping the image in when your program loads. You could even write a smart driver that swaps the image only if it's not in use by another app, or whatever. It's just unlikely you'll ever find a bunch of third-party software that supports it.

There absolutely is. There are PCIe cards you can plugin and use them as accelerators, just like you would use a GPU. Of course programming them to do the task you want is harder, but it can do anything. Saw a great example where someone implemented memcached on a single FPGA plugin and replaced many Xeons with it.
Isn't that what Apple did with that Afterburner Card for the MacPro? I read in https://www.anandtech.com/show/15646/apple-now-offering-stan... that that card is an fpga.

I could imagine that Apple will include something like this in their Apple Silicon SOC for ARM macs.

The Afterburner Card is not user programmable, but maybe it may in the future and this was just the first try to get the hardware in the field.

Yes, and it has been done. There are FPGA's that you can connect to with PCIe, and you only have to pay the small price of writing an FPGA implementation for your usecase. It usually takes just a couple of weeks (OK, maybe months).
You might actuall go even faster than PCIe, by pretending being a DDR4 memory stick.
IIRC some CPUs of the Intel Atom series already have an embedded FPGA.
Intel has launched a couple of Xeon Gold CPUs (like a variant of the 6138P) with integrated FPGAs for specific markets. Nothing mass-market, though, and they don't seem to have caught on much.