It's usually a 0.5% improvement on a micro benchmark.
One of my suspicions is that at the low end where I operate the marginal cost of higher speed is essentially zero. My firmware spends more than 9.99% of the time sleeping. Micro optimizations of a few percent is meaningless. At the other end superscalar processors are a moving target for micro optimizations. And further a lot of tasks look like init -> process data -> clean up. Over time the process data part has gotten very large. Making the init and clean up parts of the code a smaller and smaller percentage of the execution time. Micro optimizations in those parts of the code provide no value. Next is the constant movement to push the data processing into either specialized CPU instructions or GPU's.
A single optimization pass might only improve a microbenchmark a bit, but all passes taken together significantly speed up most programs. In the embedded software that I have experience with we eventually had to turn on optimizations because otherwise we would have had to switch to a new hardware platform to run continuously more demanding workloads.
One of my suspicions is that at the low end where I operate the marginal cost of higher speed is essentially zero. My firmware spends more than 9.99% of the time sleeping. Micro optimizations of a few percent is meaningless. At the other end superscalar processors are a moving target for micro optimizations. And further a lot of tasks look like init -> process data -> clean up. Over time the process data part has gotten very large. Making the init and clean up parts of the code a smaller and smaller percentage of the execution time. Micro optimizations in those parts of the code provide no value. Next is the constant movement to push the data processing into either specialized CPU instructions or GPU's.