|
|
|
|
|
by radarsat1
3160 days ago
|
|
Of course everyone agrees that "cleaning data" is difficult and boring, and it's always mentioned, but what I don't really understand is what kind of tools people expect for this beyond what are already available. E.g. pandas is pretty good at merging tables, re-ordering, finding doubles, filling or dropping unknowns etc. There are also tools for visualizing large amounts of data, look for outliers, etc. Beyond the basic tools it seems to me that each dataset requires decisions to be made that can't be automated. (e.g. do I drop or fill the unknowns?) I don't see how this could be improved, as every decision has a solid, semantic implication related to whatever is the overarching research question. So statements like "getting data ready for the algorithms" seem kind of meaningless to me, in the sense of general methodologies. How could you possibly "get the data ready" without considering what it is, how it will be used, etc. How can it possibly be generalized to anything beyond the specific requirements of each problem instance? I'm just really curious what you are imagining when you say that better tools are needed here. |
|
We have actually used it to compete on Kaggle without having to perform manual feature engineering with success.
[0] https://www.featuretools.com