|
|
|
|
|
by axegon_
359 days ago
|
|
Not a data engineer but my work revolves around processing a ton of data(let's call it partial data engineering). Much of the data I get is inputted by humans from different sources and platforms and countries. My biggest pain in a nutshell - the human factor. Believe it or not, people have managed to misspell "Austria" over 11,000 times(accents, spaces, different encodings, alphabets, languages, null characters and so on. Multiply that by 250-something countries and multiply that by around 90-100 other fields which suffer from similar issues and multiply that by 2.something billion rows and you get the picture. |
|
My solution was to enable the user to create a 'dictionary' of valid values for each column in the table. In your case you would create a list of all valid country names for that particular column. When inserting a new row, it checks to make sure the country name matches one of the values. I thought this might slow things down significantly, but testing shows I can insert millions of rows with just a minor performance hit.
The next step is to 'auto correct' error values to the closest matching one instead of just rejecting it.