| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aarondia 1492 days ago

This is a great point and something that we're actively working on improving in Mito. If you have millions of rows of data, its not enough to just scroll through your data, you need tools to build your understanding.

Some of the tools that you mentioned exist in Mito today. For example, Mito generates summary information about each column (all of the .describe() info along with a histogram of the data). And we're creating features for gaining a global understanding of the data too.

In practice, one of the main ways that we see people use Mito is for that initial exploration of the data. Often the first thing that users do when they import data into Mito is to correct the column dtype, delete columns that are irrelevant to their analysis, and filter out/replace missing values.

1 comments

pbronez 1492 days ago

It would be super fun to implement an intelligent head() function that shows a representative sample rather than the first X rows. Do the profiling & identify a collection of rows that represent the overall distribution.

You could develop some IP around efficient and effective ways to do this. Probably would require an ensemble of unsupervised methods.

link

aarondia 1492 days ago

That's a cool idea! One helpful .head() function could include the most unique data typed data. It could help you identify which columns have mixed dtypes: mostly numbers, and some cells that are supposed to be numbers but are actually strings because of additional decimals.

link