The Tesla K40 has peak double performance of ~1.4 TFLOPS. It has 64 DP cores, the warp scheduler can schedule four warps per smx per cycle. It can therefore have two warps executing double instructions at the same time. But the number is not very interesting, the memory bandwidth on the other hand is, a GK110 has 288GB/s, take you code, get it's arithmetic intensity and you have a upper bound for your performance, assuming you are memory bound of course.