Interesting. I couldn't find any documentation for how this happens (does it need compiler/linker support to know whether free is used?), but I did find the source code for this __simple_malloc: https://git.musl-libc.org/cgit/musl/tree/src/malloc/lite_mal...
Though, I'm curious... don't a lot of malloc implementations use a bump allocator if a simple fragmentation heuristic is below some limit? Presumably musl down inside malloc() has a static bool (or a static function pointer) it uses to keep track if dlsym() has ever returned the address of free(). How much faster is the musl implementation than an implementation using a simple fragmentation heuristic? Presumably, they're both well-predicted conditional branches (and/or indirect branch targets well predicted by the BTB).