Hacker News new | ask | show | jobs
by chrisseaton 3643 days ago
I think it's true that language implementations such as Ruby and Python spend most of their time running the C parts of the code. I did a talk saying the same thing about Ruby a couple of weeks ago, but referring to the Java code in JRuby, https://ia601503.us.archive.org/32/items/vmss16/seaton.pdf.

But this doesn't mean that a JIT is not going to help you. It means that you need a more powerful JIT which can optimise through this C code. That may mean that you need you to rewrite the C in a managed language such as Java or RPython which you can optimise through (which we know works), or maybe we could include the LLVM IR of the C runtime and make that accessible to the JIT at runtime (which is a good idea, but we don't know if it's practical).

I work on an implementation of Ruby, and we make available the IR of all our runtime routines (in our case implemented in Java) to a powerful JIT, so that we can inline from the interpreter into the runtime and back again.

In the case of Python, PyPy does the same thing, allowing the JIT to optimise between the interpreter and runtime, as they're both written in RPython.

So I think the problem the Pyston project needs to solve is how to allow the JIT to see the runtime routines and optimise through them like it does with Python code.

1 comments

Pypy makes your app take many times the memory for like 20% perf. Which is good but seems maybe often not worth the effort.
Eh...

  cat<<eof > float.py
  import itertools
  s = sum(itertools.repeat(1.0, 100000000))
  print(s)

  $ time python float.py 
  100000000.0

  real    0m0.602s
  user    0m0.596s
  sys     0m0.004s

  time python3 float.py 
  100000000.0

  real    0m0.603s
  user    0m0.600s
  sys     0m0.000s

  $ time pypy float.py 
  100000000.0

  real    0m0.211s
  user    0m0.088s
  sys     0m0.004s
That's with no warmup for the pypy variant (or indeed the other python variants). Or, slightly more "robust":

   $ python -m timeit -s "import itertools as i" \
                 "sum(i.repeat(1.0, 100000000))"
  10 loops, best of 3: 594 msec per loop

  $ python3 -m timeit -s "import itertools as i" \
                 "sum(i.repeat(1.0, 100000000))"
  10 loops, best of 3: 592 msec per loop

  $ pypy -m timeit -s "import itertools as i" \
              "sum(i.repeat(1.0, 100000000))"
  10 loops, best of 3: 68.2 msec per loop
Pypy actually does pretty good here:

  $ cat float.cpp 
  #include<iostream>

  int main() {
    double s = 0;
    for (int i = 0; i < 100000000; ++i) {
        s++;
    }

    std::cout << s << std::endl;
    return 0;
  }

  $ g++ --std=c++14 -O3 float.cpp
  $ time ./float
  1e+08

  real    0m0.237s
  user    0m0.236s
  sys     0m0.000s
Note that the C++ code use a loop, not a lazy generator. Apparently they may be coming in c++17 as proposal N4286.
Summing a list of numbers is easy mode for a JIT. You've got a tight loop with one type that can be statically shown will never be violated in real-time. Unfortunately, unless that's actually your workload, the speed with with a JIT-based system can add numbers is not relevant to how fast it runs in practice. Any JIT that can't tie C on that workload is broken somehow.

Personally, I think people often go quite overboard with the "benchmarks are useless" idea, but this benchmark really is useless, because it will never produce any differences betweens JITs and thus can't show whether one is good or bad.

> it will never produce any differences betweens JITs and thus can't show whether one is good or bad

It can tell you which JITs can't even manage to remove the loop, which is useful to know.

Apparently neither cpython, pypy or gcc manage to remove the loop in this case. I actually think it is interesting that this "slow" code in cpython is within [ed: ~10x] of pypy/jit/machine code (c++ probably should do better, I'm not all that familiar with gcc - maybe -O3 isn't enough to try to unroll loops and/or try to vectorize).

Actually code like this arguably should be a win for a high-level language with an optimization pass; ideally the whole thing should be translated to a constant at compile-time.

Ah right I think that's because the accumulator is a double. I missed that. I think it should still be possible but compilers probably don't bother.
That I don't necessarily expect from a JIT in real time. I'd expect it from any half-decent optimizing compiler, but I'd expect it to likely be the result of several interacting and too-expensive-for-real-time optimizations.

I mean, if the JIT works that out, great, and if someone wants to show off performance numbers that shows one can do that I'm interested in the information, but I wouldn't in general discard one for failing to notice that optimization.

Real life coding is not a loop.

Try that on a Websever or an image processing box.

I have a feeling 100% of the c++ time is being spent in some silliness like setting up the locale of the ostream, because my compiler totally eliminates that loop.
Probably. I glanced at the asm to make sure the loop was still there (which it was, possibly because it loops over an int, and sums doubles?) and couldn't see anything that stood out. Still a little surprised that pypy without warmup is faster than c++ for this silly thing.

[ed: On this system, eliminating the loop by hand looks like:

  #include<iostream>
  int main() {
    double s = 100000000;

    std::cout << s << std::endl;
    return 0;
  }

  $ g++ --std=c++14 -O3 float2.cpp -o float2 \
    && time ./float2
  1e+08

  real    0m0.001s
  user    0m0.000s
  sys     0m0.000s
Just for completeness.]
On Pypy's benchmark site, the speedup is a lot higher than 20%. I usually experience a 2x-3x speedup with the kind of code I run, sometimes more.

> http://speed.pypy.org/