Do you mean 10 preceding versions, or 10 repeated timings of the same version? If you repeat the timing for the each version many times, why is that not enough to smooth out the noise?
Instead of averaging, I can recommend my go-to L-estimator for this sort of thing: the midsummary. Take the average of the 40% and 60% percentile as your measure of central tendency of performance.