| HN Mirror

In CUDA you can write your device kernel as a templated function and the compiler will specialise it based on how it's called by the host. In some cases this can result in big speed-ups while keeping the code simple and maintainable. You could get the same speed from specialising the code by hand, or with a separate code generation tool; but having that ability built in to the compiler makes it very easy to use.

It's also handy for the compiler to be able to check for errors in shader invocations. Looking up parameters by name and setting them dynamically adds a whole class of potential runtime errors that don't exist in straight c++ programs.

func<a,b>(c,d) isn't faster than func(a,b,c,d) if the compiler sees that a,b are constants and the call to func is in the same file as its definition; the only thing CPU+GPU code adds here relatively to CPU-only code is, perhaps you need to write a wrapper along the lines of func_ab(c,d) { func(a,b,c,d) }, a bit ugly but the code is still pretty simple and maintainable. (And unlike the case with templates at least you can make sense of the symbol table etc. etc.)

As to runtime errors because of misspelling a parameter name, this is going to be fixed the first time you run the program and you live happily ever after; I don't mind these errors in Python very much and I don't mind it in C programs using strings to refer to variable names in whatever context that may happen.

Overall the amount of "dynamism" in CPU/GPU programming IMO does not result in wasting almost any optimization opportunities nor does it make the program significantly harder to get right, but I'm prepared to change my mind given counterexamples (certainly in Python it's really easy to demonstrate missed optimization opportunities relatively to C due to its dynamism... The amount of dynamism in CPU/GPU programming however is IMO trivial and hence the issues resulting from it are also rather trivial.)

I also don't see how you can possibly optimize GPU code based on some static CPU code but, on some game consoles, you can share data structures between GPU and CPU, which is nice. E.g. instead of binding a bunch of named parameters you just pass a pointer to a native C struct. This, apart from greatly simplifying the code is a performance advantage as well (the parameter binding is not free and consumes both CPU and GPU time). Though, I don't see how this could be possible to implement in a platform-independent way.