It can. That's part of the Blu-ray spec. But it's not standardized in streaming video AFAIK (not that Netflix has to care about that, they have their own player) and, even if the feature exists, somebody still has to go do it.
A speech recognition model can give you a reading on how understandable the speech is and use that information to guide the channel volume in the mixing.
OTOH, a lot of the models end up trained on features that are very different from what humans hear.
The defaults for automatic sound mixing will almost always be wrong. And they will differ in how they are wrong from consumer box to consumer box.