| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by titzer 1202 days ago

> direct threaded (using computed gotos)

I've seen different people mean different things by this, do you mean the IR is a list of bytecode handler addresses, and then the end of every handler is a load+indirect jump? Or is there also a dispatch table? In my experience the duplication of the dispatch sequence (i.e. no dispatch "loop") is worth 10-40% and then eliminating the dispatch table on top of that a bit more.

CPUs work hard to predict indirect branches these days, but the BTB is only so big. Getting rid of any indirect call or jump, regardless if that is through a dispatch table, is a big win, perhaps 2-3x, because CPUs have enormous reorder buffers now and can really load a ton of code if branch prediction is good, which it won't be for any large program with pervasive indirect jumps.

> it always spills temporaries to memory. We're actively working on keeping things in registers rather than spilling.

In my experience that can be a 2x-4x performance win.

> It's not uncommon to see call sites with hundreds of different classes

Sure, the question is always about the dynamic frequency of such call sites. What kind of ICs does YARV use? Are monomorphic calls inlined?

2 comments

tenderlove 1202 days ago

> I've seen different people mean different things by this, do you mean the IR is a list of bytecode handler addresses, and then the end of every handler is a load+indirect jump? Or is there also a dispatch table? In my experience the duplication of the dispatch sequence (i.e. no dispatch "loop") is worth 10-40% and then eliminating the dispatch table on top of that a bit more.

It's the former. Each bytecode is the handler address and every handler does a load + jump. There's no dispatch table (though there are compilation options that allow you to use a dispatch table, but I doubt anybody does that since you'd have to specifically opt in when you compile Ruby).

> Sure, the question is always about the dynamic frequency of such call sites. What kind of ICs does YARV use? Are monomorphic calls inlined?

In one of our production applications, the most popular inline cache sees over 300 different classes and ~600 shapes (this is only for instance variable reads, I haven't measured method calls yet but suspect it's similar).

The VM only has a monomorphic cache (YJIT generates polymorphic caches), and neither the VM nor the JIT do inlining right now.

titzer 1201 days ago

Thanks for the replies. I could keep picking your brain, but maybe it's more efficient for me to read some documentation. Are there some design docs or FAQs or summaries of the execution strategies that you can point me to? Thanks.

ignoramous 1202 days ago

> In my experience that can be a 2x-4x performance win.

What's the state-of-art in reg allocation? I see that the Android Runtime makes use of SSAs to allocate registers in linear-time [0]. Are other language runtimes pushing the boundaries further and in different ways?

[0] https://www.arxiv-vanity.com/papers/2011.05608/