Hacker News new | ask | show | jobs
by tmurray 4980 days ago
(full disclosure: used to work for NV on CUDA and did very extensive work on Titan, so I am probably biased)

If you think your existing MPI app is going to automatically scale to a heterogeneous architecture (high-power x86 on the main CPU, Xeon Phi cores on the accelerator) and get acceptable performance, sorry, it's not going to happen.

The fundamental constraints on 2012/2013 Xeon Phi performance that determine how apps should be written are exactly the same as current desktop GPUs (small, high-latency local memory that is not coherent with the rest of the system; relatively slow, high-latency link to CPU; ugly interactions with network cards in most environments; fundamental need to hide memory latency at all times). For any sort of performance beyond a standard Xeon, you're going to want to run a Xeon Phi as a targeted accelerator rather than offloading entire processes to it and using a standard MPI stack. This means you're going to be running in a hybrid host/device mode and using compiler directives or a specific parallel language and API to deal with on-chip execution and data transfer, which puts you in exactly the same solution space as with GPUs.

in other words: the Phi of today is not a panacea. you get better tools and more flexibility in terms of the programming model, but the fast path that any of its intended market would use in applications looks identical to GPUs.

1 comments

To my understanding GPU's basically suck at anything with decision paths/move away from straight matrix manipulation/signals analysis, right?
GPUs are SIMD machines, so they're executing the same instruction simultaneously on all the active cores. That means if you have code which branches, it has to mask out the cores which follow branch B while it executes branch A; then has to mask out all the cores which follow branch A while it executes branch B. In other words, if at least one core follows each side of the branch, it has to execute both branches.

If all cores branch in the same direction, you don't get that penalty. A large part of optimising for the GPU comes down to arranging your data and code so that this can happen.

GPUs suck at any problem that cannot be easily divided. If you can map a function over arbitrary chunks of self-contained data GPUs will perform better.