Do you plan to use the signal handler trick eventually? Less portable but in my tests it shrinks the total overhead by half (from masking's 29% to 14%).
Sorry, I should have been more clear. I believe we use the masking on 32-bit platforms, which is faster than explicit bounds checks. On 64-bit platforms we use guard pages. We don't actually need a signal handler, because we don't need to gracefully recover from a fault like we do on the Web — we can just crash.