No changes are required in the library -- you just need to have some way of generating the code for the forward pass using e.g. cuDNN (which already has a heuristic selector!)
If I write gemm.cl like above program implementing a bunch of CNN primitives, why I cannot compile it with gcc.
For compiling any Opencl program, I need to write kernel in suppose sample.cl file and compile it with
gcc –Wall sample.cl –o sample -lOpenCL
This -lOpenCL picks up libopencl.so file from my Hardware vendor(Qcom, Intel) and generates the binary which runs it on GPU(or whereever Opencl is available)
Why do I need anything trinity compiler and optimizer.
The short answer is that there are dozens of ways to use that GEMM to do convolution (convolution algorithms), and there are NUM_ALGORITHMS * NUM_LAYERS way to implement the network.
Our toolkit figures out which of those arrangements are the fast ones!