50 states, represented probably 500 different ways... full names, partial names, abbreviations, abbreviations with dots, abbreviations with dots and extra spaces, different cases...
That data happens to take the form of human input. In that case the different ways in which participants choose to enter their address is part of what you are collecting. It does not suggest that the data is raw or dirty, you just gathered superfluous information as a side effect of your methodology. I think it is better described as overly complicated.
If the problem were the opposite it would also make sense to say that the data is too simple rather than too clean. That gets to what the article is saying; that all data collection is inherently biased.
Well, you can't answer 50 states just because you had 100000 people. What if the source of that data was a CA insurance query form where they must enter, specifically, their CA address. (I'm assuming you'd still get other states, but it might not be 50).