Hacker News new | ask | show | jobs
by santiagobasulto 1492 days ago
I like this. Is a "friendlier" way to browse data. Said that, I have to add:

Exploring large datasets requires a COMPLETELY different mindset. When your data starts growing, it's impossible to keep it all in a visual format (for 2 reasons[0]) and you have to start thinking analytically. You have to start looking at the statistical values of your data to understand what's its shape. That's why the `.describe()` and `.info()` methods in Pandas are so useful. After many years doing this, I can "see" the shape of my data just by looking at the statistical information about it (mean, median, std, min, max, etc).

After some time you don't need to rely on visual tools, just can run a few methods, look at some numbers, and understand all your data. Kinda feels like the operator of The Matrix that is looking at the green numbers descend and knows what's going on behind the scenes.

[0] Your eyes are really inefficient at capturing information and there's only so much memory available: try loading a 15GB CSV in Excel.

7 comments

I would caution against this approach in general (unless you’re working with unusually uniform data from a deterministic source — in my world that is rarely the case). Summary statistics are useful but taken in isolation they can mislead. One loses the ability to get a feel for interesting non-aggregated phenomenon.

I find it’s important to actually “touch” the raw data even if only in a buffered, random sampling sort of way to get a feel for it. Sometimes with big datasets, looking through rows of data feels tedious and meaningless but I’ve found that I’ve often picked up on things I wouldn’t have without actually looking at the raw data. Raw data is often flawed, but there’s often some signal in it that tells a story hence it’s important not to overlook these through a lens of aggregate statistics.

The next step is to visualize the data multidimensionally in something like Tableau. Tableau works on very large datasets (it has an internal columnstore format called Hyper) and can dynamically disaggregate and drill down. Insights are usually obtained by looking at details, not aggregates.

A good example of what you are warning against is Anscombe’s quartet

https://en.wikipedia.org/wiki/Anscombe's_quartet

Histograms and Boxplots (and IQRs) don't lie tho...
Boxplots don't lie, but they can mislead as any summary statistic, data viz or model can. https://blog.bioturing.com/wp-content/uploads/2018/11/BoxVio...

Misleading histograms depend totally on the bin-width tho.

If you want to use open-source Python-based visualizations instead of Tableau, the following tools allow the creation of custom plots - including the ability to export the underlying code.

- bamboolib (proprietary license - acquired by Databricks in order to run within the Databricks notebooks)

- mito (GPL license)

- dtale (MIT license)

If you can write visualisations in Python itself, I am a big fan of Altair's syntax (https://github.com/altair-viz/altair), which is based on vega-lite. A while back, I wrote a brief guide and comparison of the main plotting libraries: https://datapane.com/reports/87NNEJ7/the-ultimate-guide-to-p...

One benefit of having them in actual code is that you can programmatically automate the creation of things like dashboards and reports. For instance, schedule a script to share an interactive plot every Monday morning, or build a live dashboard that updates every 10m. This opens up a lot of possibilities that would be impossible in a traditional drag-and-drop tool.

> programmatically automate the creation of things like dashboards and reports.

That's an awesome use case for Python, and that sort of script generation is one of the main reasons that we see people adopting Python/Mito. And specifically, graphing[1] is one of the most popular features in Mito.

Mito generates Plotly [2] graphs, and of course generates the Plotly graph code too, so you can customize the graphs to your perfect liking (Plotly has great documentation and a lot of customizations) or schedule the script to run automatically.

[1] https://docs.trymito.io/how-to/graphing [2] https://plotly.com/

Thanks for mentioning Altair. I am personally also a big fan.

I am one of the co-founders of bamboolib and we are actively thinking about adding support for altair to the Plot Creator (instead of just relying on Plotly).

Since we are talking other viz options in Python, there are of course also matplotlib, seaborn, plotly, and more.

Of course that `.head()`, `.tail()`, `iloc` and other mechanisms to visualize the data of subsets is always important. But would you really caution AGAINST this? Like, literally telling someone NOT to use summary statistics to explore a dataset?
No, I’m more cautioning against using summary statistics in isolation without looking at the raw data.

I was more responding to the statement that one can “see” the shape of data through them and not needing visual tools. The lens of summary statistics is a very narrow one — it’s a necessary but almost always insufficient one. Even .ilocs are insufficient —- it’s hard to know what to .iloc for. One really needs to browse the data interactively to get a good sense of it.

Ah, ok. Sorry, I misunderstood. Yes, we’re on the same page. As usual, a good balance is necessary.
This is a great point and something that we're actively working on improving in Mito. If you have millions of rows of data, its not enough to just scroll through your data, you need tools to build your understanding.

Some of the tools that you mentioned exist in Mito today. For example, Mito generates summary information about each column (all of the .describe() info along with a histogram of the data). And we're creating features for gaining a global understanding of the data too.

In practice, one of the main ways that we see people use Mito is for that initial exploration of the data. Often the first thing that users do when they import data into Mito is to correct the column dtype, delete columns that are irrelevant to their analysis, and filter out/replace missing values.

It would be super fun to implement an intelligent head() function that shows a representative sample rather than the first X rows. Do the profiling & identify a collection of rows that represent the overall distribution.

You could develop some IP around efficient and effective ways to do this. Probably would require an ensemble of unsupervised methods.

That's a cool idea! One helpful .head() function could include the most unique data typed data. It could help you identify which columns have mixed dtypes: mostly numbers, and some cells that are supposed to be numbers but are actually strings because of additional decimals.
Good points! I also think that this is an area that Mito could do better in. While we do provide pretty cool summary stats [1] and graphing capabilities [2], there isn't a great view for the summary stats of the entire dataframe. It's def on the roadmap -- but this comment makes me think we should move on it quick.

Thanks for the feedback!

[1] https://docs.trymito.io/how-to/summary-statistics

[2] https://docs.trymito.io/how-to/graphing

I find the world is full of datasets with < 200 datapoints, and that is where excel (in my experience) is great. With such datasets it often makes sense to look through the data at particular outliers.

Also, even with huge datasets I tend to always look at a random sample, and the "most extreme" datapoints -- mainly because in my experience there is a good chance some parts of the data are malformed, and need to be recollected/fixed. Of course, if you trust your data collection you don't need this!

+1 - this is also how I operated as a Data Scientist myself
> try loading a 15GB CSV in Excel.

Or visualising it in r or pandas without meaningful subsampling.

One cool library I saw recently for helping on the visualisation side is https://github.com/vegafusion/vegafusion

It allows you to use Altair in Python for visualising data, but does the computation in the backend using Arrow DataFusion. Not for 15GB perhaps, but cool nonetheless.

I have an excel template for handling a relatively large amount of data. No where 15GB on one sheet. I use it for preprocessing experimental data from a single experiment. There are about 10 chart tabs build in so I can visually inspect the data looking for errors (and go back and inspect the raw instrument data when something looks off).

The aggregate data is around 1.5 million experimental results. MiniTab is too unwieldy and requires too much manual reformatting of the data sheets.

Is this something I should be looking at in R or project Jupyter? Does one make better visualizations than the other?

Ggplot is extremely powerful if you can grok its grammar, which takes some getting used to. But I'd assume that if you see a graph in a scientific paper it's made with ggplot.

Having many data points you want to explore you are always going to be at the edges of what your hardware and software can produce.

The last really big datasets I worked with were for my thesis and I had to do subsampling to below 10% to get results within 10minutes or so and that was basically plotting midi recordings of piano performances, so nothing gigantic

In all seriousness, excell can’t be the right option for 15GB of alphanumeric data (one sheet?)
Do you as a rule look at a sample of the individual raw data, non aggregated?
Usually aggregated... then can start looking at "subsets". For example, step 1 is look at the whole dataset. Then you identify that there are a lot of rows with a type of missing value, so you look at the statistical attributes of that subset (all the rows with value X in null).

From time to time you can do a `.head()/.title()` or an `.iloc[X:Y]` to check some things visually. But just as a "refresher".

This sort of bouncing back and forth between the aggregate the raw data is something that Mito is really great at. To view aggregate info, users tend to either look at graphs or pivot tables of their data in Mito. They use that aggregate view to identify subsets that need some further investigation/cleaning/transforming. And then they filter down to that subset, make the correction, and use the aggregate view again to see the results.

Practically, this just looks like moving between two tabs in the spreadsheet!

Something that we don't support right now, but would love to support in the future is cross-filtering. It would be a powerful/easy way of supporting that back and forth workflow.