| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by radarsat1 3160 days ago

Of course everyone agrees that "cleaning data" is difficult and boring, and it's always mentioned, but what I don't really understand is what kind of tools people expect for this beyond what are already available. E.g. pandas is pretty good at merging tables, re-ordering, finding doubles, filling or dropping unknowns etc. There are also tools for visualizing large amounts of data, look for outliers, etc. Beyond the basic tools it seems to me that each dataset requires decisions to be made that can't be automated. (e.g. do I drop or fill the unknowns?) I don't see how this could be improved, as every decision has a solid, semantic implication related to whatever is the overarching research question.

So statements like "getting data ready for the algorithms" seem kind of meaningless to me, in the sense of general methodologies. How could you possibly "get the data ready" without considering what it is, how it will be used, etc. How can it possibly be generalized to anything beyond the specific requirements of each problem instance?

I'm just really curious what you are imagining when you say that better tools are needed here.

3 comments

kmax12 3160 days ago

I am the lead contributor of a python library called Featuretools[0]. It is intended to perform automated feature engineering on time-varying and multi-table datasets. We see it as bridging a gap between pandas and libraries for machine learning like scikit-learn. It doesn't handle data cleaning necessarily, but it does help get raw datasets ready for machine learning algorithms.

We have actually used it to compete on Kaggle without having to perform manual feature engineering with success.

[0] https://www.featuretools.com

link

ScottBurson 3160 days ago

Wow, this looks very cool!

link

IanCal 3160 days ago

I'm starting to build up various utilities to help with this kind of thing, but I fully agree. The decisions require understanding the business requirements (do I use source X or Y for field 1, what errors are OK, what types of error are worst, etc), but the process of finding some of these could be better.

One simple one is missing data. Missing data is rarely a null, I've seen (on one field, in one dataset):

    N/A
    NA
    " "
    Blank # literally the string "Blank"
    NULL # Again, the string
    No data
    No! Data
    No data was entered for this field
    No data is known
    The data is not known
    There is no data

And many, many more. None can be clearly identified automatically, but some processes like:

Pull out the most common items, manually mark some as "equivalent to blank" and remove.

Identify common substrings with known text (N/A, NULL, etc) and bring up those examples.

Are useful, I'd like to extend with more clustering and analysis to bring out other common general issues but rare specific issues. Lots of similar things with encodings, etc. too.

Other things that might be good are clearer ways I could supply general conditions I expect to hold true, then bring the most egregious ones to my attention so I can either clear out / deal with them in some way. A good way of recording issues that have already been analysed and found to be OK would be great too.

link

philvb 3159 days ago

Yes, completely agree that each dataset requires decisions to be made that can't be automated, but there are huge opportunities for tools to assist users in understanding what cleaning decisions they might want to make and how those decisions affect the data. Most data cleaning tools do a very poor job of helping the user visualize and understand the impact cleaning has on data - they're usually very low level (such as pandas).

As an example of a tool: Trifacta (disclaimer I work here) https://www.trifacta.com/products/wrangler/. We're trying to improve data cleaning with features such as suggesting transforms the user might want, integrating data profiling through all stages to discover and understand, and transform previews so the user can understand the impact.

I think there's a huge opportunity for better tools in the problem space.

link