Hacker News new | ask | show | jobs
by Octokiddie 1261 days ago
Any ideas on why miniwasm performs better on all the benchmarks except "trap," on which it performs decidedly worse?
2 comments

The benchmarks were run on MacOS, and actually execute an interrupt for debugging, MacOS then checks if the process is being debugged. Wasm3 just exit(1) and prints a message.

And as to why the rest are faster, I spent much time optimizing the interpreter and learning what the best way to write interpreters is. Its mostly jump threading and Mixed Data.

I found that most Wasm interpreters are not particularly good at calls. Wizard is not as fast as wasm3 or wamr in raw speed, but is much faster on calls, particularly because it does not copy arguments (value stacks can be overlapped). But Wizard's primary motivation is to be memory efficient, so it interprets in-place. It also supports instrumentation.

Nice work!

Don't take this as anything other than speculation: I wonder if wasm3 is using musttail with opaque function calls in the instruction handlers. It will demolish performance, which is why I am only using computed gotos in mine (when available). Even switch-case is faster than musttail when you have to leave the tco-jumps. Which is (as an example) why one should not measure performance by fibonacci number generation. :)
> I wonder if wasm3 is using musttail with opaque function calls in the instruction handlers. It will demolish performance, which is why I am only using computed gotos in mine (when available). Even switch-case is faster than musttail when you have to leave the tco-jumps.

This doesn't match with my experience. After working on this problem a lot, I came to the conclusion that musttail with opaque function calls is one of the best ways of getting good code out of the compiler: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

I meant having an opaque function inside your instruction handler. My assembly looks like crap if something doesn't get inlined. Because I have no way of achieving this I simply cannot use TCO. It runs fibonacci faster, but anything that uses memory is way worse because it pushes and pops a ton of registers on the instruction handler itself, and not the slow-path opaque function.

An instruction handler here being a dispatch function. It handles a single instruction.

Reading your post it says so under Limitations. Opaque calls trashes performance. I guess we agree, but then again I was just reading my assembly, so I had no reason to doubt myself.

Yes our solution was to make all fallback functions into tail calls. It solves the problem, but requires a lot of discipline and can be a bit awkward.

I recently saw this, which is a very interesting approach for using non-tail-call fallback functions without trashing the code: https://chromium-review.googlesource.com/c/v8/v8/+/4116584

That's very interesting! Thanks!