Hacker News new | ask | show | jobs
by DannyBee 4864 days ago
Speaking as a compiler guy, and having a hand in a few successful commercial JITs: The only reason he thinks they aren't slow is because they haven't yet reached the limits of making the JIT faster vs the program faster. Yes, it's true that the languages are not slow in the sense of being able to take care of most situations through better optimization strategies. As a compiler author, one can do things like profile types/trace/whatever, and deoptimize if you get it wrong. You can do a lot. You can recognize idioms, use different representations behind people's back, etc.

But all those things take time that is not spent running your program. On average, you can do pretty well. But it's still overhead. As you get farther along in your JIT, optimization algorithms get trickier and trickier, your heuristics, more complex. You will eventually hit the wall, and need to spend more time doing JIT'ing than doing real work to make optimizations to some code. This happens to every single JIT, of course. This is why they try to figure out which code to optimize. But even then, you may find there is too much of it.

Because of this, the languages are slower, it's just the overhead of better JIT algorithms, not slower code. In practice, you hope that you can optimize enough code well enough that nobody cares, because the ruby code takes 8ms, and the C code takes 5ms.

For example: Almost all of the allocations and copying can be optimized, but depending on the language, the algorithms to figure out what you can do safely may be N^3.

Also, PyPy is still pretty young in its life cycle (in this iteration of PyPy:P) for folks to say that they can make stuff much faster if they only had a few things. It really needs a very large set of production apps being rin by a very large set of folks for quite a while to see where the real bottlenecks still are. Past a certain point, you run out of optimization algorithm bullets. The way compilers get the last 20% is by tuning the algorithms for 10 years.

Of course, i'm not trying to slag on PyPy, I think they've done an amazing job of persevering through multiple rewrites to get somewhere that seems to be quite good now. I just am a little wary of a fairly young JIT saying that all big performance problems fall into a few categories.

6 comments

Interesting point of view, the problem in compiler construction is well known ("Proebsting's law", though it says it's more like 18 years instead of 10.)

The issue with benchmarks is surely well known, also by the PyPy authors; I wonder what the biggest application is that they have benchmarked or that runs on PyPy.

Your point on the JIT compiler interrupting program execution is certainly valid, too, but not necessarily so. One could easily do the code generation in a separate background thread and let execution switch over only if necessary. But, as you have already said, a latency issue certainly exists. This is one of the cases where interpreters usually have a leg up, and there are promising ways of optimizing interpreters.

Yes, you could do a background thread, with some caveats:

1. On most current CPU's, this will cause really bad cache/memory thrashing, enough to probably impact the program.

2. This may actually cause significant slowdown, depending on how long it takes to optimize a given set of code (IE it may be better to spend 100ms paused optimizing than 5000ms in the background). This is, of course, a latency issue.

3. State of the art for most JIT's is still to use one thread. The number of folks doing actual parallel code generation is nil. So sadly, even if you had 4 cores, 3 empty, you'll still, at best, get to use one of them for the background thread doing the optimizing. There are parts that are trivial to parallelize if you've structured the JIT "right", but they aren't always the parts that are slow.

Background compilation in a separate thread actually works pretty well. IE9 has been shipping it with Chakra for a while, and Firefox is now getting it (and it improved the benchmarks a lot, especially on ARM).
Good to hear it's gotten better. Admittedly, I wasn't thinking about browser based JITs when I said that :)

I'm actually curious if you have any stats on how much of the time this is being done on actual busy machines where it's going to compete for L1/etc resources vs how often it's able to be offloaded onto an otherwise empty core.

IE i expect their to be a significant difference in the use cases for JIT's like PyPy, which are probably going to sit on shared servers that folks are trying to maximize utilization of, vs desktops where I imagine most browsing probably doesn't use all cores at 100%.

> Admittedly, I wasn't thinking about browser based JITs when I said that :)

Don't HotSpot and JRockit also do background (de)compilation & swapping of generated code?

Yes, but in hotspot's case I cannot remember if it is actually turned on in both "server" and "client"
ad 1) Hm, this seems to be a good point, but what's with the following line of thinking: some thread A interprets a program P, while another thread B compiles P to native machine code (P'). Now, if another thread C would start executing P' (taking the data/snapshot from A), then C's caches should build up and remain accurate. Of course, if this happens too often, then the caching behavior will be shitty. I always wondered (based on my interest in interpretation), how much I-cache misses the instruction cache flushes after inline-caching in native machine code cause. (If you have some data on that, please let me know.)

ad 3) I am well aware of that. However, I remember that at PLDI'11 there was a talk from Univ. of Edinburgh chaps doing parallel trace-based dynamic binary translation. Obviously, DBT is less work than a high level, full-blown JIT, but at least it's not nil :)

I wonder what the biggest application is that they have benchmarked or that runs on PyPy.

speed.pypy.org has benchmark info on Django, Twisted and some other large, non-trivial codebases.

I know these from the site, and have looked at the Django benchmark that's listed there. I think it's a rather small benchmark that does exercise lots of Django internals (but that benchmark comes from Unladden Swallow). I don't know for the twisted ones, though.

What I actually wanted to know, what the biggest application is, i.e., a not benchmark.

We're in no shortage of production code. The problems are quite trivial - two people doing the same think might have different stuff in mind (so you cannot have a heuristic that works for everyone) and the fact that we're running on a shoestring budget compared to other JIT-for-dynamic-language projects. We don't even have 3 people full time.

As for the "single thing" - it's just the next thing on the infinite list of things that can be optimized better. Having a better assembler backend would be good (but smaller) win, etc. etc. For now and for quite a bit in the future, it's clear what to do to make X Y or Z faster.

I know it's for a simpler core language, but LuaJIT is quite a shoestring project too, with one main developer. Yet LuaJIT posts impressively fast results, even though Lua makes no distinction between objects and hash tables. In this and other ways, Lua can be compared to Javascript...but it does have a fast, high-quality JIT implementation, despite not a ton of money or even mindshare. I'm curious why you think this is (or if you think the facts are different).

P.S.: "hash_set" in the slides should be "hash_map."

I know, Mike is really good. However, this sort of approach does not scale toward larger teams hence the limits of what sort of language you can potentially implement. Lua is much much simpler than Javascript, which is again simpler than Python (Python is really vast). Also, can you find two Mikes?
Lua does not seem simpler than javascript in any way that would affect a JIT compiler.

Lua is a fairly simple language for the user, in the sense that it uses a few general mechanisms rather than a plethora of more specialized ones. However this "simplicity" can be misleading because these general mechanisms are very powerful, and can generally be used to implement most things other languages have specialized abstractions for. This sort of powerful and general mechanism is actually fairly hard to write an optimizing compiler for, because there's little stated explicitly about what's going to happen at runtime, and few obvious constraints about what's allowed to happen.

The reason LuaJIT (and most modern JIT compilers, although LuaJIT seems better than average) is so fast is because it uses very very local (in terms of both location and time) runtime context to know when to specialize operations that are conceptually much richer, e.g., "this number is really an integer" (Lua does not have a separate integer type), or "I know what function/operator will be called here (even though it conceptually dispatches through a table or metatable), so I can inline it or use a direct call."

I'm less sure about python (never used it much), but almost all the differences seem be user-facing ones -- large amounts of syntax-sugar for things that Lua offers sufficient mechanism for, but no built-in UI (in the programming-language sense). As it's the power/complexity of the mechanism that matters, not the UI, such user-facing richness doesn't really affect a JIT compilers.

I claim you're wrong. I've been working on a range of JIT compilers, but the PyPy is the biggest one. The problem is not "how hard the semantics are", but "how much of the semantics you need to support", especially that interaction between them is a headache. Lua does not have the equivalent of where statement in JS (albeit most JIT compilers just don't optimize it) for example.

I won't argue about the JS (although DOM interaction comes to mind as a big headache), but in Python syntax is trivial. It's all the semantics and not just how they're, but how much of it. descriptors, metaclassses, new/old style classes, tons of builtin types, tons of builtin modules, all of it the user will expect to seamlessly integrate with the JIT compiler.

Mike Pall + Lars Bak = JIT Dream Team :)
I develop a compiler for ActionScript 3. Though it’s not a great language, it does have a few distinct advantages over its cousin JavaScript. First, type annotations. Obviously the more knowledge you have about the structure of the program, the better you can help it run well. Having a class model instead of a prototype model also helps—much as I like working with prototypes—because you can easily eliminate late binding (“hash lookups”) and allocations, two of the performance killers mentioned in the slides. The operative word there is easily. You can analyse and JIT all you want, but you cannot feasibly solve at runtime the fundamental constraints imposed by the language.

Say a compiler author needs twice as much effort and cleverness to make programs in language X run fast than for language Y. That means that—all other things being equal—implementations of X will be twice as slow as Y for the same basic quality of implementation.

> For example: Almost all of the allocations and copying can be optimized, but depending on the language, the algorithms to figure out what you can do safely may be N^3.

... and this is pretty much his point: He can keep optimizing, but the moment you start passing complex objects around and copying them all over the place, instead of passing raw buffers around and operating on them in place, you've massively raised the bar in terms of the complexity of the necessary optimizations needed.

.) Just wondering, are there any languages/runtime-systems based on the idea of a fixed memory layout? No heap just "preallocated" buffers? I remember that was / probably still is quite common in the embedded world. I guess it is done easily with globals in C

.) Any profilers providing information reg. the heap-allocs as part of the execution costs?

.) Any Runtimes / VM actually optimizing the layout of those omni-present List/Array/Hashtable/Bag/Set/Dict of MyObjectTypes to have elements laid out as close as possible in memory? (The position of the actual objects that is not only the pointers to the objects within the containers)

Are we "safe" as heap allocs / gc / memory handling is really of no significant impact compared to other issues? Or is memory handling an large part of the "What Andy giveth, Bill taketh away." story?

.) At our workplace, we use preallocation for almost everything. When we do dynamic allocations, typically it is from free-lists from pools that we pre-allocate.

Almost all of our allocation/freeing costs are cheap O(1). Pretty much everything is zero-copy. That is why I cringe a little when I hear that GC's are "faster than manual MM". Because manual MM done well can eliminate almost all of the costs of memory management.

I also cringe a little when I see "malloc" deep inside C functions.

I really like the basic convention in C that you should pass needed allocations in as arguments (and it is virtually always possible) allowing these allocations to be members of of members of other allocations, aggregating allocations so there are much fewer.

.) Information regarding heap-allocs is relatively easy to obtain with "oprofile", look up the time spent in "malloc" and "free" and the related functions underneath them.

.) C makes it very easy to use an array of the values you want, rather than an indirection. For Lists, I don't think that is a useful idea. For hash tables, given that many entries are empty, I think it is a better trade-off to have your hash array as an indirection, though the buckets could be explicitly put together. And so on.

I often encountered dynamic allocation overheads are noticeable or even huge chunks of my runtime when I had to work with code that used them. Reducing the dynamic allocation really helps. We're not safe, we need to work towards it explicitly.

No heap? Hmmm... As the slidedeck pointed out, much time is spent in both allocation and copying of memory, and in practice those things are most often strings. Given that 1) the languages under consideration declare strings as immutable, and 2) the strings enter the system at fixed points (e.g. source code or external inputs), and 3) exit at fixed points (e.g. sent out a socket).... Imagine a system that does not allow heap allocation or memory copying at all. Just doesn't allow them!

    var structure = "dog" + "house"
    print structure
does not need to allocate or move memory. That's how the script languages do it now, but should they? All the really need to do is print "dog" and then print "house", there's nothing that says those 8 letters need to be ever adjacent in memory.

Since it's so darn easy (and sloppy) to call malloc() and memcpy() the script-engine writers are going to keep doing so until they're given a MacrocosmicGod-like environment that won't allow malloc() or memcpy() and so will have to come up with a clever workaround. maybe it will be something like linked-lists of immutable blocks of memory? Something like ropes? Something smarter than that?

Anyway, script-engine writers, please forget that malloc() and memcpy() exist, and see what you come up with. It'll be wonderful. (P.S. I was in the script-engine business for a dozen years and didn't solve the problem, but that's just because I'm not smart enough.)

So does the JIT optimization algorithms depend on popular conventions and patterns of how people write code with a language? Like in JS theres bad language features people avoid and certain patterns on how to code something. If such things changed would the optimizations start failing? I guess I'm wondering if speed is somewhat related to trends in that language.
Once you get past generic dataflow based optimizations that have some notion of optimality[1], yes, you start to write optimizations that target idioms.

For example: Most compilers started doing structure layout and reorg optimizations (transforming structures with arrays into arrays of structures, and vice versa) to tackle specific benchmarks. In some cases, they discovered it also was useful generally.

But whatever the benchmarks are, that's often what gets targeted. You can't optimize in a vacuum, you need to know what you are trying to develop optimization algorithms to do. This usually occurs by taking user complaints/programs/whatever, finding out why they are slow, seeing if there is a common solution, and developing an optimization to do it.

So yeah, if you are targeting certain patterns, and the patterns change, ..

[1] IE You've removed all possible redundant computations, made it so everything is only computed among paths it is actually used, the calculations occur in lifetime optimal fashion, etc.

To counter the overhead of the JIT algorithm, would it be possible to do that work in a background thread or cache the JIT optimizations to disk?
See above. It varies. In some cases, it may be worse to do this than to stop the program. It's not as simple as "always work in the background until ready" .