I think what is going on here is that what we see as complications are actually features, but that doesn't become clear until you are operating with your NNs in production, at scale.
What you want to be able to do is control which devices (CPUs, GPUs or co-processors[1]) execute which part of your model (eg, GPU for training, co-processors for inference, who knows what else).
Yahoo released some code to deal with similar issues, but with Caffe on YARN[2].