Closeness to the original audio sample doesn't tell you anything, really. These codecs try to drop information in such ways that it's specifically as inaudible as possible to humans.
So, you could easily have an encoding which sounds worse than a different encoding, even though it's actually bitwise closer to the original.
For example, humans hear very bad above 20 kHz, so you can pretty safely drop any information about sounds above that. A bad codec would keep that information around and instead drop something in the usually audable range. And then it'd sound worse, even though it's really close bitwise to the original, from all the information it didn't drop above 20 kHz.
So, yeah, you actually need humans to rate the quality.
So, you could easily have an encoding which sounds worse than a different encoding, even though it's actually bitwise closer to the original.
For example, humans hear very bad above 20 kHz, so you can pretty safely drop any information about sounds above that. A bad codec would keep that information around and instead drop something in the usually audable range. And then it'd sound worse, even though it's really close bitwise to the original, from all the information it didn't drop above 20 kHz.
So, yeah, you actually need humans to rate the quality.