Hacker News new | ask | show | jobs
by CthulhuOvermind 4065 days ago
Interesting to see this here. I did my masters thesis on this sdk the past September. We compared a neural network in native C to a CPU opencl implementation, and a FPGA implementation. The FPGA had about 8-10 times the kernel performance of a i7-2600k for the task. Interesting enough, what caused the jump in performance was the capability to have memory close to the kernel, with enough capacity to handle the kernel demands. The CPU was capped on what the ram-cpu bandwidth was, around 21gigs, however, the slower pci-e FPGA did not suffer, because of FPGA implemented memory could hold the necessary data at hand. Hence I sent the data to the kernel asynchronously, then a kernel with around 120 parallel implementations would operate and feed back the data through pci-e.

Having OpenCl certainly reduced dev time by around 85% id say. And that's from someone fluent with verilog, who didn't know openCL before doing this.

4 comments

Was there something preventing you from using a GPGPU? Embarrassing memory bandwidth is one of their strengths, and of course they run OpenCL.
How efficiently OpenCl used the FPGA ?

And BTW, i remember reading some paper that compared CPU/GPU/FPGA. The conclusion was - GPU's win on compute/$, FPGA's win on compute/watt. Hard to find the paper now, though.

>How efficiently OpenCl used the FPGA ?

There will always be some inefficiencies introduced when comparing OpenCL to Verilog. Hopefully, this will decrease in the future versions of OpenCL. Notice that the development time decreased by over 80%...

Interesting. Is there any reference, or tutorial, for a fellow FPGA developer who are interested in studying OpenCL for FPGA?

Also, is there a link for your thesis? What kind of data processing was needed? I mostly work with signal processing for RF signals, pipelining data from ADCs.

So I'd say just pickup OpenCL in general, and then follow altera's best practices until you feel comftable.

No link for my thesis, can send you a pdf if you want. In terms of what was needed, my kernel was the simulation inner-most loop that would take in 4 values(2 floats, a const int and a double) per neuron and use them to update the neurons state. The simulation ran at a resolution of 1ms, with values between 1k and 100k neurons.

In essense it was high repetition, low complexity, high memory calculations.

Infact, what I experienced was that a) the biggest overhead is actually setting up the kernel and b) you have to take into account the memory requirements.

Biggest upgrades in performance came from transitioning the data transfers from synchronous to asynchronous, (to alleviate memory bottlenecks as much as possible), and from increasing the number of neurons.

Most interesting bit was that due to the simulation characteristics (Izhikevich model of a SNN), the firing rate dropped aroud 20-30k neurons. With a low firing rate, I could simulate in real-time (ie 1ms of simulation in 1ms of real time) 18k neurons, and 80k neurons due to disparities in firing rates

Would you please send me a pdf of your thesis?
Care to mention in what field you did your bachelors and masters? You got into FPGAs though CS or through EE? I'm always curious about FPGA engineers.
Heya, did my undegrads as EE, changed it halfway from a 4 year MENG course to a 3 year bachelors because I was bored of the power distribution related stuff. Went on to do a masters in Advanced Microelectronics Design. I had contact with FPGA's in both my undergrad and my masters. I selected my BENG thesis on an ARM M0 soft implementation on an FPGA that implemented an expanded instruction set with increased performance on the SHA-1 algorithm.