| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by eximius 2720 days ago
	50 states, probably. ;) Edit: assuming a very US-centric dataset...

2 comments

icedchai 2720 days ago

50 states, represented probably 500 different ways... full names, partial names, abbreviations, abbreviations with dots, abbreviations with dots and extra spaces, different cases...

link

porphyrogene 2720 days ago

That data happens to take the form of human input. In that case the different ways in which participants choose to enter their address is part of what you are collecting. It does not suggest that the data is raw or dirty, you just gathered superfluous information as a side effect of your methodology. I think it is better described as overly complicated.

If the problem were the opposite it would also make sense to say that the data is too simple rather than too clean. That gets to what the article is saying; that all data collection is inherently biased.

link

rhacker 2720 days ago

Well, you can't answer 50 states just because you had 100000 people. What if the source of that data was a CA insurance query form where they must enter, specifically, their CA address. (I'm assuming you'd still get other states, but it might not be 50).

link

teej 2720 days ago

Plus Washington DC, the US territories, and US military bases abroad.

link