|
|
|
|
|
by WithinReason
1327 days ago
|
|
That's cool, am I right in assuming that you want to automate the production of efficient GPU (or other accelerator) code based on these low level primitives? But you would still need a piece of sorcery that can produce high performance OpenCL code, right? And that code could be different for every device, so you would need some trial and error, benchmark-based compilation at the very least. Or would OpenCL code be generated by hand for each device? |
|
Working on parameterizing a search space that includes more than the local group size. The end dream is some ML guided search to optimize the kernels :)