Hacker News new | ask | show | jobs
by tenfingers 3827 days ago
My biggest problem with ggplot is that it's slow, unbearably slow. It also enforces all data to be realized into a single dataframe, which is true only for small (as is: fits in memory) datasets.

Very often, to produce specialized plots, I have to send data to the canvas in chunks by performing pre-processing myself. ggplot really doesn't work in this scenario. Combined with the general slowness, it forces me to use alternatives quite frequently.

It's a bummer, really, because I'd like my plots to have a consistent visual style, and doing that across different plotting packages is an issue.

I very often resort to gnuplot when it comes to huge datasets and/or incremental plotting. The same is true also in python (matplotlib is also very slow, independently of the backend). But at least, if you use seaborn (https://github.com/mwaskom/seaborn), you can easily intermix the easiness of plotting through a DataFrame or just supply data arrays.

ggplot is really awesome for what it does, but 1) the syntax doesn't really please me (feels just plainly forced onto the wrong context 2) doesn't scale, which forces me to use alternatives too frequently 3) trying to customize the plot style beyond a few minor tweaks is pure hell.

5 comments

At the end of the day, the whole point of ggplot is to produce a graphical representation of some aspect of the data. How much information can you possibly cram into ONE graphic and have it be readable by a human? Your problem is really a data reduction problem and not a plotting/graphics problem.
The data that goes into the plot is unrelated to it's visual complexity.

The "problem" is that ggplot also takes care of the transformation/reduction step for you.

For example, a KDE plot can source potentially a limitless amount of data while still generating a very simple plot. Likewise for most smoothers.

However, if I have to produce the kde/smoothed line myself, I lose almost all advantages of using ggplot (I have to manually calculate the visual density, scaling and attaching labels is another PITA).

On top of that, as other have said, ggplot really struggles already with thousands of entries. A simple 5x5 faceted scatterplot with ~10k points might take seconds to render on recent hardware. When I plot data interactively for exploration, I might do this hundreds of times a day. I lose all the convenience just in the time wasted for rendering.

This is a very valid point that I feel we often overlook. Most folks don't think like a statistician, and over-complicating figures is the best way to render them useless. All is lost if your audience can't understand what you're trying to convey (:
Which is why ggvis has a different architecture which will make working with large datasets easier. Not to mention the pipe as a unifying interface across ggvis and dplyr which makes it easier to do efficient data manipulation within the visualisation.
Most of the world's data analysis still happens on data sets that are at most a couple of thousand observations. It'd be neat if ggplot2 was faster for large datasets, but I can imagine that's not a priority.
The last time I measured the runtime, it was slow even for medium sized data sets -- in comparison to lattice or to R's "native" plot functions. It's probably not a problem for interactive use but it becomes annoying when generating reports with many visualizations.
actually, there is a trick that allows you to store the state of your visualisation as you are adding layers. you don't need to have all your data in just one dataframe.

http://koaning.io/casino-gambling-simulations-in-r.html

ggplot is slow (though not as slow as it was 5 or so years ago) and Matplotlib is even slower.

I took to liking the ggplot syntax immediately. Specifically what do you find odd/forced about it?