You're right, CryptoNets used a data layout optimized for throughput with a batch size 4096. Since then we've done a lot of work on low latency inference with our CHET compiler [1] and my colleagues with LoLa [2]. It all comes down to the data layouts you use.