It could be more helpful for comparing model performance than just vibes or benchmarks. For example, you could run analyses to compare average line count per change or revert rate by model. Perhaps there will be a paper out in the near future that scrapes AI usage in public repos for a broader dataset.
If, say, a certain version of Claude tends to be better at front-end than back-end work, that can be important for deciding how to use it in the future. Just like when managing human developers.