|
|
|
|
|
by apohn
3160 days ago
|
|
>Dirty data engineering, like modeling, will soon be largely automated. I don't agree. For every modern tech company that collects data that lends itself to automated data cleaning, there's a 40+ year old company that defined what data to be collected in 1990, designed an "automated system" in 1995 and has been shoehorning improvements on that system since then. At my last job I was given access to a database with 150+ tables with no data dictionary. The person who wrote the load process and ETL (the output was a lot of summaries) had left 10 years before and nobody truly understood how anything actually worked or the downstream dependencies. It took me a week of digging just to find out which of those 150 tables were just temp tables for one of the many queries that executed on that system. It's going to be a while before somebody figures out how to clean that data automatically, or even find issues in that data. That's the reality of the world of data for many organizations. |
|
When I am talking about automated data cleaning, I am talking more about preprocessing text, dealing with missing variables, discarding duplicates, noisy/uninformative variable and outlier removal, spelling correction, feature interactions and transformations. All of these can be (and are being) largely automated. [1] [2]
A data lake with 150+ undocumented tables is garbage in-garbage out, both for machines and humans. I'd almost label that as the barrier: "Data not available", not: "Dirty data". While a reality for some companies, such a company really needs a DB admin or data engineer, not try to shoehorn an (expensive) data scientist in these roles.
[1] https://people.csail.mit.edu/kalyan/dsm/
[2] https://www.ijcai.org/proceedings/2017/352