|
|
|
|
|
by kmax12
3160 days ago
|
|
In "What barriers are faced at work?", I really wish they broke down the "dirty data" response into more categories. In particular, I'd love to know if people are dealing with data quality issues, feature engineering issues, or something else all together. In my opinion, this is representative of the problems with data science tools today. There is so much focus on the machine learning algorithms rather than getting data ready for the algorithms. While there is a question that lets respondents pick which of 15 different modeling algorithms they use, there's nothing that talks about what technologies people use to deal with "dirty data", which is agreed to be the biggest challenge for data scientists. I think more formal study of data preparation and feature engineering is too frequently ignored in the industry. |
|
Two examples of this: Kaggle Datasets supports wiki-like editing of metadata (file and column descriptions) and makes it easy to see, fork, and build on all the analytics created on the data so far.
We're just getting started with each of these products: we want Kaggle Datasets to support a fully collaborative model around working with all your data in the future, and Kaggle Kernels to support every analytics and machine learning usecase.