It seems to me that the correct term is "Static Binary Translation" (SBT) for what you call "ahead of time binary translation".
And the correct term for "JIT-based emulation" is "Dynamic Binary Translation" (DBT).
At least these are the terms you should use if you want to find some literature on this subject.
We're not talking about JIT or AOT compiler because it's not really a compilation (compilation is translating to a lower level language).
I think a lot of people talk about JIT rather than DBT because the JIT term is better known, and there is confusion when Apple says they do "Dynamic translation for JITs".
Which means that: they do DBT to handle applications that use JIT.
You are correct, static binary translation is what Rosetta does first. That, however, is what I called niche technology in another post, most of the research so far had focused on dynamic binary translation.
Furthermore, SBT, even for user mode binaries, can rarely reach the performance levels that we see with Rosetta2. There are many issues in determining what is code, where are the branch destinations in case of indirect branches, etc. What we have here is certainly a feat of engineering on its own.
> There are many issues in determining what is code, where are the branch destinations in case of indirect branches, etc.
Yes, handling indrect branch seems a bit complex and I'm not a specialist in the field.
But I'm pretty sure that the cases of indirect branch are rare enough so that an additional indirection is relatively inexpensive.
Adding a simple address mapping table should meet most of the cases.
An interesting question would also be whether Apple has added features to the hardware to improve the translation?
We know, for example, that Apple introduced a special register [1] to temporarily switch from the ARM consistency model to the TSO consistency model (Total Store Order) from x86.
That is marketing terminology (because "emulation is slow"). Full static transpiling is not a solvable problem - you can't actually take an x86 app, run it through some converter, and get an ARM app out. It's just not a thing and it never will be (without cheating and, like, literally embedding an emulator in the app).
Anything less than that is emulation, and requires dynamic elements. All modern emulators use JIT, and caching the result is similar to AoT translation; plus JIT can be faster than AoT sometimes due to being able to take advantage of runtime profiling, and you can never guarantee ~full AoT translation of even binaries without self-modifying code without additional metadata (like a list of all branch destinations), so Rosetta cannot possibly claim it does that with full coverage. On top of that you need to add a level of indirection to all indirect branches, as you cannot statically change all function pointers in data structures (that's an even harder problem). At that point you're adding enough bookkeeping gunk to the translated code that it is no longer a straight translation, like Apple would want you to believe. JIT is binary translation too, so by Apple marketing standards, qemu, Dolphin, and basically every other modern emulator is also "translation". Which is just not useful.
So everyone saying that "Rosetta 2 is AoT translation" as if that means it's fundamentally better/faster than other emulation technologies is just falling to marketing.
Whatever you call it, it's not fundamentally different from any other emulator in a way that puts it in another class of technology. It is not straight converting x86 to ARM. That's just not a thing and it never will be. The end result is that the CPU is going to be executing a series of translated basic blocks interspersed with code added by the translation to glue everything together, which is the same thing every JIT-based emulator does, and will have the same performance characteristics, and the fact that some of that work can be done ahead of time is not a fundamental difference.
If you want to look for reasons why Rosetta 2 is faster than other emulators, look for places where Apple cheated and made their CPUs implement x86 things like its memory consistency model. That can have massive gains. I bet if you port a decent JIT-based emulator to use that feature on M1, and compare it to Rosetta 2 for number crunching inner loops and such, you'll find you can get very similar performance numbers out of it once the JIT cache is warm.
It'll be interesting when people take a deep dive into specific things Rosetta 2 does.