Hacker News new | ask | show | jobs
by wenc 1492 days ago
I would caution against this approach in general (unless you’re working with unusually uniform data from a deterministic source — in my world that is rarely the case). Summary statistics are useful but taken in isolation they can mislead. One loses the ability to get a feel for interesting non-aggregated phenomenon.

I find it’s important to actually “touch” the raw data even if only in a buffered, random sampling sort of way to get a feel for it. Sometimes with big datasets, looking through rows of data feels tedious and meaningless but I’ve found that I’ve often picked up on things I wouldn’t have without actually looking at the raw data. Raw data is often flawed, but there’s often some signal in it that tells a story hence it’s important not to overlook these through a lens of aggregate statistics.

The next step is to visualize the data multidimensionally in something like Tableau. Tableau works on very large datasets (it has an internal columnstore format called Hyper) and can dynamically disaggregate and drill down. Insights are usually obtained by looking at details, not aggregates.

3 comments

A good example of what you are warning against is Anscombe’s quartet

https://en.wikipedia.org/wiki/Anscombe's_quartet

Histograms and Boxplots (and IQRs) don't lie tho...
Boxplots don't lie, but they can mislead as any summary statistic, data viz or model can. https://blog.bioturing.com/wp-content/uploads/2018/11/BoxVio...

Misleading histograms depend totally on the bin-width tho.

If you want to use open-source Python-based visualizations instead of Tableau, the following tools allow the creation of custom plots - including the ability to export the underlying code.

- bamboolib (proprietary license - acquired by Databricks in order to run within the Databricks notebooks)

- mito (GPL license)

- dtale (MIT license)

If you can write visualisations in Python itself, I am a big fan of Altair's syntax (https://github.com/altair-viz/altair), which is based on vega-lite. A while back, I wrote a brief guide and comparison of the main plotting libraries: https://datapane.com/reports/87NNEJ7/the-ultimate-guide-to-p...

One benefit of having them in actual code is that you can programmatically automate the creation of things like dashboards and reports. For instance, schedule a script to share an interactive plot every Monday morning, or build a live dashboard that updates every 10m. This opens up a lot of possibilities that would be impossible in a traditional drag-and-drop tool.

> programmatically automate the creation of things like dashboards and reports.

That's an awesome use case for Python, and that sort of script generation is one of the main reasons that we see people adopting Python/Mito. And specifically, graphing[1] is one of the most popular features in Mito.

Mito generates Plotly [2] graphs, and of course generates the Plotly graph code too, so you can customize the graphs to your perfect liking (Plotly has great documentation and a lot of customizations) or schedule the script to run automatically.

[1] https://docs.trymito.io/how-to/graphing [2] https://plotly.com/

Thanks for mentioning Altair. I am personally also a big fan.

I am one of the co-founders of bamboolib and we are actively thinking about adding support for altair to the Plot Creator (instead of just relying on Plotly).

Since we are talking other viz options in Python, there are of course also matplotlib, seaborn, plotly, and more.

Of course that `.head()`, `.tail()`, `iloc` and other mechanisms to visualize the data of subsets is always important. But would you really caution AGAINST this? Like, literally telling someone NOT to use summary statistics to explore a dataset?
No, I’m more cautioning against using summary statistics in isolation without looking at the raw data.

I was more responding to the statement that one can “see” the shape of data through them and not needing visual tools. The lens of summary statistics is a very narrow one — it’s a necessary but almost always insufficient one. Even .ilocs are insufficient —- it’s hard to know what to .iloc for. One really needs to browse the data interactively to get a good sense of it.

Ah, ok. Sorry, I misunderstood. Yes, we’re on the same page. As usual, a good balance is necessary.