| > direct threaded (using computed gotos) I've seen different people mean different things by this, do you mean the IR is a list of bytecode handler addresses, and then the end of every handler is a load+indirect jump? Or is there also a dispatch table? In my experience the duplication of the dispatch sequence (i.e. no dispatch "loop") is worth 10-40% and then eliminating the dispatch table on top of that a bit more. CPUs work hard to predict indirect branches these days, but the BTB is only so big. Getting rid of any indirect call or jump, regardless if that is through a dispatch table, is a big win, perhaps 2-3x, because CPUs have enormous reorder buffers now and can really load a ton of code if branch prediction is good, which it won't be for any large program with pervasive indirect jumps. > it always spills temporaries to memory. We're actively working on keeping things in registers rather than spilling. In my experience that can be a 2x-4x performance win. > It's not uncommon to see call sites with hundreds of different classes Sure, the question is always about the dynamic frequency of such call sites. What kind of ICs does YARV use? Are monomorphic calls inlined? |
It's the former. Each bytecode is the handler address and every handler does a load + jump. There's no dispatch table (though there are compilation options that allow you to use a dispatch table, but I doubt anybody does that since you'd have to specifically opt in when you compile Ruby).
> Sure, the question is always about the dynamic frequency of such call sites. What kind of ICs does YARV use? Are monomorphic calls inlined?
In one of our production applications, the most popular inline cache sees over 300 different classes and ~600 shapes (this is only for instance variable reads, I haven't measured method calls yet but suspect it's similar).
The VM only has a monomorphic cache (YJIT generates polymorphic caches), and neither the VM nor the JIT do inlining right now.