I would say that geometric mean is the usual way of averaging benchmark scores. It has the property that a given relative speedup on a component benchmark always has the same effect on the aggregate score. With an arithmetic mean the component benchmarks with a longer runtime will dominate the aggregate. Normalizing the results before applying the arithmetic mean doesn't really help either -- the first X% improvement to a component benchmark would still be valued more than the second X% speedup.