Is anyone else surprised by the sample size? I'm not sure that eight baseballs provides enough data for any relevant conclusions, especially when the differences are so small to begin with.
Assuming there has always been 50% light balls and 50% heavy balls, the probability of picking 4/4 heavy balls from the earlier period and 4/4 light balls from the later period by chance is 0.4%.
Although this statistic might be less relevant if the balls were not statistically randomly selected.
If all the balls have shown variations within the same tolerances, that would be true, but balls grouped in a meaningful way, and these groups have shown negligible differences among themselves.
Because of this tight variation within the groups, 8 balls are enough IMHO.
Although this statistic might be less relevant if the balls were not statistically randomly selected.