Hacker News new | ask | show | jobs
by gsheni 1477 days ago
If your dataset has hundreds of columns, do you really want to individually identify the correct type for each column?

With Woodwork, an open source library for rich semantic data typing, we made type identification fast, simple, and effortless.

Read about our work to understand how we added type inference for natural language columns.

1 comments

For a long time data analysis products have had "profiling" tools

https://en.wikipedia.org/wiki/Data_profiling

which can look at the values in a column and make some inferences about the column such as "these are all integers between 35 and 89". Most of those work at the level of the whole column, but I worked at a firm that developed a convolutional network classifier that could take either a single data point (say "1999-08-24") or the column header text plus the data point ("Independence Date", "1998-08-24") and guess at the data type (e.g. "date", "address", ...)

It worked really well but wasn't explainable. Another disadvantage was that there was some things it was never going to figure out, such as this checksum on credit card numbers:

https://en.wikipedia.org/wiki/Luhn_algorithm