cat<<eof > float.py
import itertools
s = sum(itertools.repeat(1.0, 100000000))
print(s)
$ time python float.py
100000000.0
real 0m0.602s
user 0m0.596s
sys 0m0.004s
time python3 float.py
100000000.0
real 0m0.603s
user 0m0.600s
sys 0m0.000s
$ time pypy float.py
100000000.0
real 0m0.211s
user 0m0.088s
sys 0m0.004s
That's with no warmup for the pypy variant (or indeed the other python variants). Or, slightly more "robust":
$ python -m timeit -s "import itertools as i" \
"sum(i.repeat(1.0, 100000000))"
10 loops, best of 3: 594 msec per loop
$ python3 -m timeit -s "import itertools as i" \
"sum(i.repeat(1.0, 100000000))"
10 loops, best of 3: 592 msec per loop
$ pypy -m timeit -s "import itertools as i" \
"sum(i.repeat(1.0, 100000000))"
10 loops, best of 3: 68.2 msec per loop
Pypy actually does pretty good here:
$ cat float.cpp
#include<iostream>
int main() {
double s = 0;
for (int i = 0; i < 100000000; ++i) {
s++;
}
std::cout << s << std::endl;
return 0;
}
$ g++ --std=c++14 -O3 float.cpp
$ time ./float
1e+08
real 0m0.237s
user 0m0.236s
sys 0m0.000s
Note that the C++ code use a loop, not a lazy generator. Apparently they may be coming in c++17 as proposal N4286.
Summing a list of numbers is easy mode for a JIT. You've got a tight loop with one type that can be statically shown will never be violated in real-time. Unfortunately, unless that's actually your workload, the speed with with a JIT-based system can add numbers is not relevant to how fast it runs in practice. Any JIT that can't tie C on that workload is broken somehow.
Personally, I think people often go quite overboard with the "benchmarks are useless" idea, but this benchmark really is useless, because it will never produce any differences betweens JITs and thus can't show whether one is good or bad.
Apparently neither cpython, pypy or gcc manage to remove the loop in this case. I actually think it is interesting that this "slow" code in cpython is within [ed: ~10x] of pypy/jit/machine code (c++ probably should do better, I'm not all that familiar with gcc - maybe -O3 isn't enough to try to unroll loops and/or try to vectorize).
Actually code like this arguably should be a win for a high-level language with an optimization pass; ideally the whole thing should be translated to a constant at compile-time.
That I don't necessarily expect from a JIT in real time. I'd expect it from any half-decent optimizing compiler, but I'd expect it to likely be the result of several interacting and too-expensive-for-real-time optimizations.
I mean, if the JIT works that out, great, and if someone wants to show off performance numbers that shows one can do that I'm interested in the information, but I wouldn't in general discard one for failing to notice that optimization.
I have a feeling 100% of the c++ time is being spent in some silliness like setting up the locale of the ostream, because my compiler totally eliminates that loop.
Probably. I glanced at the asm to make sure the loop was still there (which it was, possibly because it loops over an int, and sums doubles?) and couldn't see anything that stood out. Still a little surprised that pypy without warmup is faster than c++ for this silly thing.
[ed: On this system, eliminating the loop by hand looks like:
#include<iostream>
int main() {
double s = 100000000;
std::cout << s << std::endl;
return 0;
}
$ g++ --std=c++14 -O3 float2.cpp -o float2 \
&& time ./float2
1e+08
real 0m0.001s
user 0m0.000s
sys 0m0.000s