Hacker News new | ask | show | jobs
by _yosefk 3699 days ago
OpenCL was moving to LLVM IR away from programs-as-source-code-strings at some point. Still, you ought to have a JIT step if you want to be able to ship the same binary to multiple systems with GPUs not compatible at the binary level. And BTW shipping LLVM IR on the CPU would make things better in terms of supporting CPUs with incompatible ISAs... whether that ever overtakes native binaries is a question (it did to an extent, of course, with the JVM and what-not.) Of course you could move the JIT off the client device completely and do it at the App Store - prepare a bunch of binaries for all the devices out there - but still, the developers will have shipped IR.

Once you get to shipping IR instead of strings, which is nice I guess in that it should make initialization somewhat faster and the GPU drivers somewhat smaller, I'm not sure what you're going to get from being able to treat a CPU/GPU program as a single whole. Typically the stuff running on the GPU or any other sort of accelerator does not call back into the CPU - these are "leaf functions", optimized as a separate program, pretty much. I guess it'd be nice to be able to optimize a given call to such a function automatically ("here the CPU passes a constant so let's constant-fold all this stuff over here.") The same effect can be achieved today by creating a GPU wrapper code passing the constants and having the CPU call that, avoiding this doesn't sound like a huge deal. Other than that, what big improvement opportunities am I missing due to the CPU compiler not caring what the GPU compiler is doing and what code the GPU is running?

(Not a rhetorical question, I expect to be missing something; I work on accelerator design but I haven't ever thought very deeply about an integrated CPU+accelerator compiler, except for a high-level language producing code for both the CPU and some accelerators - but this is a different situation, and there you don't care what say OpenGL or OpenCL do, you generate both and you're the compiler and you use whatever opportunities for program analysis that your higher level language gives you. Here I think the point was that we miss opportunities due to not having a single compiler analyzing our C/C++ CPU code and our shader/OpenCL/... code as a single whole - it's this area where I don't see a lot of missed opportunities and asking where I'm wrong.)

2 comments

In CUDA you can write your device kernel as a templated function and the compiler will specialise it based on how it's called by the host. In some cases this can result in big speed-ups while keeping the code simple and maintainable. You could get the same speed from specialising the code by hand, or with a separate code generation tool; but having that ability built in to the compiler makes it very easy to use.

It's also handy for the compiler to be able to check for errors in shader invocations. Looking up parameters by name and setting them dynamically adds a whole class of potential runtime errors that don't exist in straight c++ programs.

func<a,b>(c,d) isn't faster than func(a,b,c,d) if the compiler sees that a,b are constants and the call to func is in the same file as its definition; the only thing CPU+GPU code adds here relatively to CPU-only code is, perhaps you need to write a wrapper along the lines of func_ab(c,d) { func(a,b,c,d) }, a bit ugly but the code is still pretty simple and maintainable. (And unlike the case with templates at least you can make sense of the symbol table etc. etc.)

As to runtime errors because of misspelling a parameter name, this is going to be fixed the first time you run the program and you live happily ever after; I don't mind these errors in Python very much and I don't mind it in C programs using strings to refer to variable names in whatever context that may happen.

Overall the amount of "dynamism" in CPU/GPU programming IMO does not result in wasting almost any optimization opportunities nor does it make the program significantly harder to get right, but I'm prepared to change my mind given counterexamples (certainly in Python it's really easy to demonstrate missed optimization opportunities relatively to C due to its dynamism... The amount of dynamism in CPU/GPU programming however is IMO trivial and hence the issues resulting from it are also rather trivial.)

I also don't see how you can possibly optimize GPU code based on some static CPU code but, on some game consoles, you can share data structures between GPU and CPU, which is nice. E.g. instead of binding a bunch of named parameters you just pass a pointer to a native C struct. This, apart from greatly simplifying the code is a performance advantage as well (the parameter binding is not free and consumes both CPU and GPU time). Though, I don't see how this could be possible to implement in a platform-independent way.