|
|
|
|
|
by followmeon
3160 days ago
|
|
Dirty data is not as much as a problem for me than human-biased data. Dirty data engineering, like modeling, will soon be largely automated. Let's say you are predicting store sales. You create a feature that holds the store sales of one year back. The feature works really well and you are happy with your evaluation. But you captured bias: The previous model the store used was "predict today's sales by looking at last year's sales". Store managers fitted their sales tactics to this model (when the model predicted too much sales, the store managers do their best to get rid of the surplus inventory, for instance: by adding discounts or moving the products to a more prominent spot in the store). So in the end you end up with a model with good evaluation, but you actually have (over)fitted to previous policies/models. You have not created the best possible sales predictor. How to ever find this out, without a costly intimate deep-dive in the data and data generation processes? |
|
I don't agree. For every modern tech company that collects data that lends itself to automated data cleaning, there's a 40+ year old company that defined what data to be collected in 1990, designed an "automated system" in 1995 and has been shoehorning improvements on that system since then.
At my last job I was given access to a database with 150+ tables with no data dictionary. The person who wrote the load process and ETL (the output was a lot of summaries) had left 10 years before and nobody truly understood how anything actually worked or the downstream dependencies. It took me a week of digging just to find out which of those 150 tables were just temp tables for one of the many queries that executed on that system.
It's going to be a while before somebody figures out how to clean that data automatically, or even find issues in that data. That's the reality of the world of data for many organizations.