You don't need to go to such complicated lengths. Just perform enough tests (as in, a statistically large enough amount) and a distribution will form. That also captures the variability of real world network effects.
What about different ads served to different browsers? Someone running, say, Opera will have a different ad profile than a Chrome user even when completely blank cookie-wise.
It's hardly complicated. I've put such tests together in an afternoon. In fact, whatever is added in complexity is gained by the fact fewer tests are necessary. Via this mechanism you can also remove any questions about compression, use of HTTP/2, etc., which could impact the tests based on server-side choices when it comes to serving data to either platform. Equal always equals better.
But those metrics are important, if servers serve more optimized pages to Edge users for some reason that a freaking important fact to know.
This is about real world data and real experiences and how it affects actual real users.
You can normalize the tests to the point where there is absolutely zero difference between the browsers, of that I'm sure, but that will not reflect any actual cases that real users experience.
Within those specifications, the objection about the ad-block become irrelevant. If the browser justs works better, then users don't care and can simply enjoy more battery time.
The case for more normalized tests is to find out which browser is factually better designed/written.