|
|
|
|
|
by ajross
256 days ago
|
|
A typical CALL is a 16 bit displacement and encodes in three bytes. A RET encodes in one. On arm64, all instructions are four bytes. The BL and BX to effect the branching is 8 bytes of instruction already. Plus non-leaf functions need to push and pop the return address via some means (which generally depends on what the surrounding code is doing, so isn't a fixed cost). Obviously making that work requires not just the parallel dispatch for all the individual bits, but a stack engine in front of the cache that can remember what it was doing. Not free. But it's 100% a big win in cache footprint. |
|