Generational GC might make heap-allocating your activation records a viable option, even if you're not satisfied with CPython levels of performance. I mean in a sense that's what Chicken does, right?
That is the argument made by Andrew Appel’s paper “garbage collection can be faster than stack allocation” https://www.cs.princeton.edu/~appel/papers/45.pdf and it’s the compilation strategy used by SML/NJ as set out in Appel’s book “compiling with continuations”
It's pretty hard to beat the pushj/popj approach, except that you have to lock out the GC before returning (as it may be scanning the activation records to find roots).