The advantages of chopping up chopping your pipeline stages in half so that each is 10 FO4s long rather than the 16 FO4s most people use. You've generally got 2 FO4s of latching and 2 of clock skew so IBM was seeing 6 FO4s of useful work per stage compared to 12 with Intel. Or at least the overhead was 4 per stage in the mid 2000s, I've got no idea what they are in the early 2020s.
And, if you have enough threads per core, it's relatively simple to switch to another thread when an instruction stalls. Unfortunately, most our software is designed for machines with few fast cores.