how tied are the compiler aspects to the specific kernels (eg for convolutional layers)? Could load balancing and fusion logic be broken out into a library which could work for user defined kernels?
They aren't tied at all -- in fact the optimizer is a totally separate project (triNNity-optimizer) that just does graph optimization. You can add user defined kernels as long as you have some way of microbenchmarking them!