Latency in training literally does not matter. You care about throughput. In serving, where latency matters, most DL frameworks allow you to serve the model from a highly optimized C++ binary, no python needed.
quote is 'data spend 2500 ms on the wire'. That's not latency. For a nice 10GbE connection, that's optimistically 3 GB or so worth of data. Do you have 3GB of training data? Then it will spend 2500 ms on the wire to distribute to all of your nodes as part of startup.
The poster you are replying to is 100% correct.