Hacker News new | ask | show | jobs
by hkhall 3619 days ago
As this thread is filled with people that know way, way more about CUDA and OpenCL than myself I hope that you will indulge me a serious question: I get that graphics cards are great for floating point operations and that bitwise binary operations are supported by these libraries, but are they similarly efficient at it?

Some background: I occasionally find myself doing FPGA design for my doctoral work and am realizing that the job market for when I get done may be better for me if I was fluent in GPGPU programming as it is easier to build, manage, and deploy a cluster of such machines than the same for FPGAs.

My current problem has huge numbers of XOR operations on large vectors and if OpenCL or CUDA could be learned and spun up quickly (I have a CS background) I may be inclined to jump aboard this train vs buying another FPGA for my problem.

4 comments

http://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithm...

Throughput of integer operations ranges between 25% and 100% of floating point FMA performance. 32-bit bitwise AND, OR, XOR throughput is equal to 32-bit FMA throughput.

It depends upon the op / byte loaded intensity. Nvidia packs their GPUs with a lot of float32 (or float64) units because some problems (e.g., convolution, or more typical HPC problems like PDEs, which will probably be done in float64) have a high flop / byte ratio.

A problem just calculating, say, hamming distance or 1-2 integer bit ops per integer word loaded will probably be memory bandwidth bound rather than integer op throughput limited. More complicated operations (e.g., cryptographic hashing) that have a higher iop / byte loaded will be limited by the reduced throughput of the integer op functional units rather than memory bandwidth.

For "deep learning", convolution is one of the few operations that tends to be compute rather than memory b/w bound. It's my understanding that Sgemm (float32 matrix multiplication) has been memory b/w limited for a while on Nvidia GPUs. Though, if you muck around with the architecture (as with Pascal), the ratio of compute to memory b/w to compute resources (smem, register file memory) may change the ratios up.

AMD GPUs have a reputation for speedy integer operations, which are essentially bit-wise operations, so they are often chosen for bitcoin mining. So you might want to consider learning OpenCL, since CUDA runs only on NVidia cards.
I've spent a lot of time using both OpenCL and CUDA, and I would recommend CUDA not because I like NVidia as a company, but because your productivity will be so much higher.

NVidia has really invested into their developer resources. Of course, if your time to write code and debug driver issues isn't that important, then an AMD card using OpenCL might be the right choice.

(I'll try to be honest about my bias against NVidia, so you can more accurately interpret my suggestions. I think along the lines of Linus Torvalds with regard to NVidia... http://www.wired.com/2012/06/nvidia-linus-torvald/ )

I think both of these can be learned reasonably quickly if you know a bit about C programming. I'd also tend to agree that GPGPU is probably a better bet than FPGAs these days.