Hm, the OSX crash is probably because they have a different ABI from Linux -- Google "osx ia32 abi" for details-- I'm probably violating stack alignment.
Partial register stalls are where you write part of the register, then read the full register:
mov al, byte [PTR]
add ebx, eax
But I am emitting this:
mov al, byte [PTR]
add byte [PTR+3], al
So there's no stall.
I got jit4 about as far as is reasonable for a single-pass compiler, but I intend to implement a proper BF->IR->ASM compiler to implement some more complex multi-pass optimizations I have in mind.
Yes, I have encountered the problem before that OS X crashes on incorrect stack alignment even where other platforms tolerate it.
Good point about the stall -- I always thought that partial register stall happened at the point that you do the partial write, since the logical contents of the register now depend on its previous contents. I didn't realize that the dependency logic was sophisticated enough to allow the partial read without depending on the entire register's value.
I added some optimizations using Flex to identify patterns, and got a 3.8x speedup on the Mandelbrot benchmark -- https://github.com/rmmh/beefit