On AWS this was using nvidia-docker with the TensorFlow Docker images. Probably, the AWS AMI Deep Learning gives very similar performance (with same versions of CUDA, TensorFlow etc.). There's only so much you can tweak if the GPU itself is the bottleneck...