Hacker News new | ask | show | jobs
by Scene_Cast2 1996 days ago
A killer tech for this would be a framework that automatically reprograms the FPGA and offloads the work if it makes sense. For example - running k-means? Have your FPGA automatically (with minimal dev effort) flash to be a Nearest Neighbor accelerator.

The problem is finding a way to make that translation happen with minimal dev effort, as software is written rather differently from hardware.

3 comments

I recommend checking out CacheQ: https://cacheq.com/

they are working on almost exactly this. If I was an investor, or Intel or AMD, I would buy them and/or invest heavily.

Their web site is very sparse on what programming models the tool supports. Traditionally, the things you can easily accelerate automatically are algorithms you can write naturally in Fortran 77 (lots of arrays, no pointers), and that's one limit on the applicability of these automatic tools. (Other limits that other posters have pointed out are compilation+place+route runtime, and reconfiguration time.)

They are claiming you can use malloc and make "extensive" use of pointers in C programs and still have them automatically compiled for the FPGA. That's where details are needed and they are mostly missing.

I watched their 30 minute demo film. The speedups are impressive, and on the small example it's impressive that it does the partitioning automatically. However, the program contains only a single call to malloc, and all pointers are derived from that address, so it doesn't do much to convince us that it the memory model and alias analysis give you more flexibility than the F77 model.

You might want to check the "Warp Processing" project out: http://www.cs.ucr.edu/~vahid/warp/. It is probably exactly what you are thinking about. Transparent analysis of the instruction stream at runtime and synthesis and offloading of hot spots to the FPGA.
Huh, interesting. It seems that the work doesn't have to be explicitly parallel for this to work, which is a surprise.
Why is that surprising. A LOT of statements that are done in programming can be executed in parallel. It's just not worth it to actually make threads for them since the overhead of threads is larger then just executing the set of instructions sequentially. In fact all modern processors take advantage of the data dependencies and execute it in parallel if possible.
I recall reading papers about doing this by profiling Java apps a decade or so ago, but I would have to dig pretty deep in my HN comment history to find them.

The approach seems conceptually similar to the optimizations available via the enterprise version of GraalVM.