Hacker News new | ask | show | jobs
by gergo_barany 2953 days ago
Interesting work! I have a bunch of comments and questions.

> The goal of the thesis was to evaluate language features of Python that were hypothesized to cause performance issues.

In another life I did something similar using a similar compiler simulation technique, looking at other Python features like redundant reference count operations, boxing of numbers, dynamic type checks etc. See G. Barany, Python Interpreter Performance Deconstructed. Dyla'14. http://www.complang.tuwien.ac.at/gergo/papers/dyla14.pdf

After obtaining the numbers in that paper, the work didn't really go anywhere; there were no really obvious optimizations to try based on the data. But it was fun!

Anyway, questions:

1. If I understand the source on GitHub correctly, you parse Python source code yourself. I'm fairly sure your simulation would be a lot more faithful if you compiled Python bytecode instead. Did you consider this, and if yes, was there a particular reason not to do it that way?

I ask this in particular because if I understand your thesis correctly, you look up local variables in hash tables every time they are referenced. This is not what Python does: It maps variable names to integer indices during compilation to bytecode, and the bytecode just takes those embedded constant indices and indexes into an array to obtain a local variable's value. That's a lot faster. And you would get it automatically if you started from bytecode. (Plus, it would be easier to parse, but if you have fun parsing stuff, that's reasonable too.)

2. Where do you actually make useful use of Rust's static ownership system? I've only skimmed that part of the thesis very quickly, but I missed how you track ownership in Python programs and can be sure that things don't escape. Can you give an example of a Python program using dynamic allocation that your compiler maps to Rust with purely static ownership tracking and freeing of the memory when it's no longer used?

3. Related to 2: Why bother with any notion of ownership at all? Did you try mapping everything to Rust's reference counting and just letting it do its best? I'm wondering how much slower that would be. Python is also reference counted, after all, and I guess the Rust compiler should have more opportunities to optimize reference counting operations.

4. In general, do you have an idea why your code is slower than Python, besides the hash table variable lookup issue I mentioned above?

4 comments

Regarding the bytecode - it was always considered an internal implementation detail subject to change (unlike the JVM bytecode) and in 3.6 they have in fact made a fairly major change[1]:

"The Python interpreter now uses a 16-bit wordcode instead of bytecode which made a number of opcode optimizations possible."

They haven't been shy about changing it in the past either, since there's no guarantee of stability, so it's likely to continue to change in incompatible ways.

[1] https://docs.python.org/3/whatsnew/3.6.html#optimizations

Good point, and shows that I haven't kept up with the latest developments here :-) Though in the context of a one-off research project, I still think picking a concrete Python version and using its byte/wordcode is the better choice. It won't be an enterprise-quality project with a long lifetime, but neither is the author's solution of parsing (a subset of) Python manually.
> you look up local variables in hash tables every time they are referenced. This is not what Python does: It maps variable names to integer indices during compilation to bytecode, and the bytecode just takes those embedded constant indices and indexes into an array to obtain a local variable's value.

This is only true for function arguments right? Module level bindings and class and object attributes are looked up in dictionaries. I think the same for variables used in closures too?

And for any local variables, not just arguments.
> Module level bindings and class and object attributes are looked up in dictionaries.

Depends on __slots__, yes?

> the work didn't really go anywhere;

That's really too bad.

> there were no really obvious optimizations to try based on the data.

Is that because Python already is the way it is? In other words, if you started from scratch, how would you design a language differently so that it doesn't run into these issues?

Asking for a friend ;-)

> Is that because Python already is the way it is?

Partly, yes. But also because the optimizations I would have come up with were all tried by Stefan Brunthaler before. See some citations in the paper, in particular to: http://arxiv.org/abs/1310.2300 where he claims a 4x speedup over CPython using purely interpreter-based and safe optimizations. So I guess I should have written "no really obvious and new optimizations to try".

Anyway, Stefan's work in general is very interesting for implementors of dynamic language interpreters. A lot of it harks back to things that were first developed for Smalltalk, which I think is where you are coming from?

> In other words, if you started from scratch, how would you design a language differently so that it doesn't run into these issues?

I don't think there would be a need to start from scratch. Python 3 is a pretty good language! Also, it's not clear that it needs to be fast, it's just fun to think about how it might be fast.

That said, I think there is only one key feature missing to enable aggressive optimizations: An annotation on functions/methods that says that the function may be optimized even at the cost of some loss in reflective capabilities. In current Python any function's code can be inspected and changed at run time, as can its local variables by callees who can walk up the call stack. This means that every optimization that compiles Python code to machine code or even just changes the interpreter's handling of local variables can, in general, change the program's semantics.

I think it would suffice to add a @static decorator to disable these reflective features for individual functions. (And disabling dynamic lookup of globals would be good too.) The interpreter could then recognize that you want those optimized and could do whatever magic it likes without caring about boring corner cases like invalidating its optimizations if someone tries to patch the function's code at runtime.

This would not be a big thing to do, and there would be no real need to restart from scratch; Python 3 is a pretty good programming language!

Everything else would pretty much fall out automatically from that using known/already implemented techniques, especially with the type hints that would allow you to gradually move a fully dynamic, interpreted application to a statically typed, compiled one function by function. Such a Python would still not win the speed crown, but it could beat the current one by miles on a lot of applications.

That work is great!

> We have presented the first limit study that tries to quantify the costs of various dynamic language features in Python.

This is spot on what we were doing as well, that's great to have this as a reference.

> 1. If I understand the source on GitHub correctly, you parse Python source code yourself. I'm fairly sure your simulation would be a lot more faithful if you compiled Python bytecode instead. Did you consider this, and if yes, was there a particular reason not to do it that way?

We did not consider this actually. This would be a very interesting concept to explore. For the unoptimized version of Cannoli we do look up variables in a list of hash tables (which represent the current levels of scope). We did perform a scope optimization that then uses indices to access scope elements and this was much faster. However, it meant that the use of functions like `exec` and `del` were no longer permitted since we would not be able to statically determine all scope elements at run time (consider `exec(input())`, this could introduce anything into scope and we can't track that).

If you know, how does CPython resolve scope if it maps variable names to indices? In the case of `exec(input())` and say the input string is `x = 1`, how would it compile bytecode to allocate space for x and index into the value? I don't have much experience with the CPython source, so please excuse me if the question seems naive :)!

> 2. Where do you actually make useful use of Rust's static ownership system? I've only skimmed that part of the thesis very quickly, but I missed how you track ownership in Python programs and can be sure that things don't escape. Can you give an example of a Python program using dynamic allocation that your compiler maps to Rust with purely static ownership tracking and freeing of the memory when it's no longer used?

Elements of the Value enum (that encapsulates all types) relied on `Rc` and `RefCell` to defer borrow checking to run time. Consider a function who has a local variable that instantiates some object. Once that function call has finished Cannoli will pop that local scope table and all mappings will be dropped when it goes out of scope. The object encapsulated in a `Rc` will have it's reference count decremented to 0 and be freed.

This is how I've interpreted the Rust borrow checker, I will say that this was the first time I had ever used Rust so it's possible that I am not completely right on this. But once that table goes out of scope, all elements should be dropped by the borrow checker and any Rc should be decremented/dropped.

> 3. Related to 2: Why bother with any notion of ownership at all? Did you try mapping everything to Rust's reference counting and just letting it do its best? I'm wondering how much slower that would be. Python is also reference counted, after all, and I guess the Rust compiler should have more opportunities to optimize reference counting operations.

I did defer a lot of borrow checking to run time with Rc, but I tried to use this as little as possible to maximize optimizations that may result from static borrow checking.

> 4. In general, do you have an idea why your code is slower than Python, besides the hash table variable lookup issue I mentioned above?

If you remove the 3 outlier benchmarks (that are slow because of Rust printing and a suboptimal implementation of slices), Cannoli isn't too far off from CPython. And in fact, with the ray casting benchmark, Cannoli began to outperform CPython at scale. This leads me to believe that the computations in Cannoli are faster than CPython. However, there is still a lot of work to do to create a more performant version of Cannoli. The compiler itself was only developed for ~4 months, I have no doubt that more development time would yield a better results.

That being said, I think the biggest slowdown comes from features of Rust that might not have been utilized. This is just speculation, but I think the use of lifetimes could benefit the compiled code a lot. I also think there may be more elegant solutions to some of the translations (e.g. slices), that could provide speedup. But I can't say that there is one thing causing the slowdown, and profiling the benchmarks (excluding the outliers) support that.

Thanks for your answers!

> This leads me to believe that the computations in Cannoli are faster than CPython.

Right, the speed difference due to output buffering seems annoying. I wonder if you can change the stdout buffering mode to be more similar to Python's (if Python's stdout is not line buffered, I don't recall).

Also, you might want to try to run the same benchmarks PyPy uses: http://speed.pypy.org/

These are still micro-ish benchmarks, but not quite as micro as some of yours.

> However, there is still a lot of work to do to create a more performant version of Cannoli.

I saw in another post that you are going off to industry to do something else, but if you will continue development and evaluation of Cannoli and want to discuss further, feel free to drop me a line at the email address in the paper linked above.

In Python 2, the bytecode optimization that lets you access variables by index is turned off if your function has exec code in it that may modify locals.

So if your function is:

   def foo(): exec "a=1"; return a
Then running dis.dis on foo to disassemble the bytecode it you will see:

              8 LOAD_NAME                0 (a)
while you normally would see:

              6 LOAD_FAST                0 (a)
You can't use the exec in locals thing in Python 3 at all I believe.
I just tested it in Python3:

    def foo():
        exec("a=1")
        return a

    print(foo())
Fails with a NameError:

    Traceback (most recent call last):
      File "test.py", line 5, in <module>
        print(foo())
      File "test.py", line 3, in foo
        return a
    NameError: name 'a' is not defined
I get the same error with your example, but this works fine (Python 3.6.4):

    exec("a = 1")
    print(a)
This will print "1".
That is because the exec runs in the global scope. When Python sees a variable that is not assigned to in the local scope, it is assumed to be a global variable, so when exec creates a new local variable, the load still fails because it looks into the globals dictionary.

But you can do this:

  def foo():
    exec("a = 1")
    return locals()['a']
Ahh, I see, this makes more sense. Thanks for the clarification!