Small benchmarks can go either way, but for large codebase (especially C++) inliner is more important than just about anything else. So GCC wins, because it has better inliner.
There are many measures of benchmark sizes. One important measure is size of codes that account for 99% of execution time. If your codebase is a million lines but your hotspot is a thousand lines, benchmark result is sensitive to optimization quirks and in some sense benchmark is small.
Anecdotal, but I've seen similar improvements over g++ in my code.