| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tenderlove 1202 days ago

I'll take a stab at this.

YARV (Ruby's VM) is already direct threaded (using computed gotos), so there's no dispatch loop to eliminate. YARV is a stack based virtual machine, and the machine code that YJIT generates writes temporary values to the VM stack. In other words, it always spills temporaries to memory. We're actively working on keeping things in registers rather than spilling.

Ruby programs tend to be extremely polymorphic. It's not uncommon to see call sites with hundreds of different classes (and now that we've implemented object shapes, hundreds of object shapes). YJIT is not currently splitting or inlining, so we unfortunately encounter megamorphic sites more frequently than we'd like.

I'm sure there's more stuff but I hope this helps!

2 comments

titzer 1202 days ago

> direct threaded (using computed gotos)

I've seen different people mean different things by this, do you mean the IR is a list of bytecode handler addresses, and then the end of every handler is a load+indirect jump? Or is there also a dispatch table? In my experience the duplication of the dispatch sequence (i.e. no dispatch "loop") is worth 10-40% and then eliminating the dispatch table on top of that a bit more.

CPUs work hard to predict indirect branches these days, but the BTB is only so big. Getting rid of any indirect call or jump, regardless if that is through a dispatch table, is a big win, perhaps 2-3x, because CPUs have enormous reorder buffers now and can really load a ton of code if branch prediction is good, which it won't be for any large program with pervasive indirect jumps.

> it always spills temporaries to memory. We're actively working on keeping things in registers rather than spilling.

In my experience that can be a 2x-4x performance win.

> It's not uncommon to see call sites with hundreds of different classes

Sure, the question is always about the dynamic frequency of such call sites. What kind of ICs does YARV use? Are monomorphic calls inlined?

link

tenderlove 1202 days ago

> I've seen different people mean different things by this, do you mean the IR is a list of bytecode handler addresses, and then the end of every handler is a load+indirect jump? Or is there also a dispatch table? In my experience the duplication of the dispatch sequence (i.e. no dispatch "loop") is worth 10-40% and then eliminating the dispatch table on top of that a bit more.

It's the former. Each bytecode is the handler address and every handler does a load + jump. There's no dispatch table (though there are compilation options that allow you to use a dispatch table, but I doubt anybody does that since you'd have to specifically opt in when you compile Ruby).

> Sure, the question is always about the dynamic frequency of such call sites. What kind of ICs does YARV use? Are monomorphic calls inlined?

In one of our production applications, the most popular inline cache sees over 300 different classes and ~600 shapes (this is only for instance variable reads, I haven't measured method calls yet but suspect it's similar).

The VM only has a monomorphic cache (YJIT generates polymorphic caches), and neither the VM nor the JIT do inlining right now.

link

titzer 1201 days ago

Thanks for the replies. I could keep picking your brain, but maybe it's more efficient for me to read some documentation. Are there some design docs or FAQs or summaries of the execution strategies that you can point me to? Thanks.

link

ignoramous 1202 days ago

> In my experience that can be a 2x-4x performance win.

What's the state-of-art in reg allocation? I see that the Android Runtime makes use of SSAs to allocate registers in linear-time [0]. Are other language runtimes pushing the boundaries further and in different ways?

[0] https://www.arxiv-vanity.com/papers/2011.05608/

link

ignoramous 1202 days ago

> In other words, it always spills temporaries to memory. We're actively working on keeping things in registers rather than spilling.

Curious: What register allocation algorithm do the current Ruby JITs use? Is that influencing the work on this new JIT too?

link