| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gravypod 3295 days ago

I've been very troubled by coming to this stuff as a programmer. I'm having the same instant dis-satisfactory response that your students are having with looping structures.

I've recently started working on some projects where I need to do a lot of data visualization, story telling, and investigation "into the data". As a programmer getting into this stuff is far worse then I expected. Nothing works as I would think would make sense. My biggest problem is that I'm thinking like a programmer not like a mathematician. I expect objects, segregation or elimination of state, application and reduction, re-usability, and algorithms.

Are there any good frameworks that allow for processing, caching, data visualization (layout -> data population -> rendering), then exporting to some format (PNG/PDF/TeX)?

What follows, below this line, is my groveling about the things that have bothered me. Be warned if you don't like rambling and complaining. -------

Pandas, one of the biggest "offenders", is trying to be an in-memory database with only one table but ends up having far fewer features and a far clunkier interface (want to do a simple map/reduce? Welcome to chaining a strange combination of '.loc', '&', and ':,' "operators"). Matplotlib is unintuitive and poorly documented for anyone who isn't a mathematician (.plot(lons, lats, latlons=True) is correct). Dealing with anything more then 100,000 data points is a pain to revision on. State everywhere it shouldn't be (matplotlib.pyplot).

While I've been working on this project I probably (each spin) spend an hour or two getting the data out of a format that doesn't make sense from a programmers perspective, I spend another 5 to 10 minutes writing an application/reduction, then I spend another hour to go back into the strange data formats that matplotlib will take. All the while re-running expensive computations and waiting because I have no good persistence layer for my project.

There are just things in this community that are common that I'd never dream of. What follows is a list of these things.

1. Functions with 20-40 arguments are the norm for some reason. They also love to throw in a few insane defaults, undocumented options, and even magical flags (not enums).

Things like "draw a line, connect the dots" makes it so you need to know what 5 to 7 arguments of a massive function. In C/Java when I need some flags they probably look like this:

    some_operation(some_data, DO_A | DO_C | DO_Z)

Or, if someone was feeling really nice and defined an enum & used varargs, it looks more like this:

    some_operation(some_data, SomeOperationFeatures.DO_A, SomeOperationFeatures.DO_C, SomeOperationFeatures.DO_Z)

Where all of these have appropriate documentation. My IDE place nice and can complete these things. My compiler likes it and can typecheck these things. I like it because I know all of my options available (SomeOperationFeatures.).

With matplotlib you have things like `linestyle=""`. You have to go to a webpage, look through the docs, and figure out what you want. It's worth reading the docs [1] if you never have. This could have very easily have been LineStyle.DOTTED, LineStyle.DASHED, LineStyle.BLANK. IDEs would have played nice. The 3.6 runtime's typechecking would have played nice. You would be able to see what your options are (LineStyle.).

2. Non-standard ways of treating python-isms

Pandas, for some reason, cannot stick to python-isms. I can't do simple things like...

    if not df: # Check if DF is empty
        return ...

    for row in df: # Iterate through the rows of a DF
        row.date = datetime(row.year, row.month, row.day, ...) # Create a new column in the row based on the row's data.

    subset = [a for a in df if some_condition(a)] # Do simple filtering

Pandas also implements it's own versions of standard python objects! You need to know, and go back and forth between two, ways of doing things.

3. All these libraries separate logically grouped concepts.

Lets say I have time series data from 10 sensors.

    class SomeMagicalSample:
        def __init__(self, a, b, c, d. ..., occurred)
            self.a = a
            ...
            self.occurred = occurred

With this code I can generate very complex filtering, combinations, and what not. Things like extracting "real" meaning from measured values becomes easy to express.

    def get_magical_scalar(self): return ... some interpolation ...

    def is_some_magical_type(self): return ... some check ...

Now I can use my already tried and true reduction and application.

    sum(map(SomeMagicalSample.get_magical_scalar,
            filter(SomeMagicalSample.is_some_magical_type, samples)))

Pandas, matplotlib, numpy, scipy and the lot are designed to make me avoid this style of organization. I'm instead forced to do something like this.

    a = [...]
    b = [...]
    c = [...]
    d = [...]
    ....
    occurred = [...]

Then I have to jump through hoops to keep all of this data in the same order, shift it around together.

4. Because everything is meaningless lists of numbers there are no ways to reuse code.

Most of the code I have written to show off a single value over time, or pull some data out of some other data and visualize it, is never going to be used again. Unless I want to look at this exact same thing this code will not be useful. If there was some way pass objects around, hide the internals, and process them independently of their meaning then this would not be the case.

The one case where this was not true in the past few days was when I rendered a model's prediction into a pcolormesh and drew it onto a basemap. By passing it a basemap it will automatically find the place to generate data for with the model. This was an undocumented feature that I had to read the source of basemap to find was possible (pulling the top left and bottom right Lat Lons from a basemap regardless of projection).

Maybe these warts just hurt for a little while? Do these go away? Are there alternatives that can handle >10 million data points? I don't have a good analysis framework setup for the work I'm doing. Maybe this is the issue. I don't even know what a good analysis framework would look like.

[1] - https://matplotlib.org/api/lines_api.html#matplotlib.lines.L...

6 comments

david_eads 3295 days ago

I'm actually so with you on a lot of this. The inability to use Pythonisms with Pandas is insane and I had to do a data analysis where I really, genuinely needed to do some looping and some simple map/reduce and it almost drove me insane.

You might like [Agate](http://agate.readthedocs.io/) better.

I haven't done a ton of Jupyter in the newsroom yet, but what I've found myself doing is abstracting out the stuff I want to do in normal Python into one or more utility modules and having those return dataframes into my notebook. That way I can mostly write normal Python but have access to some of the nicer pandas features and get to do more exploratory work.

I don't mind that matplotlib is kind of awful -- that data viz would never go in a published piece in any event. I just want some hints as to what I or more likely a teammate would build in D3 around the specifics of the data.

link

gravypod 3295 days ago

> The inability to use Pythonisms with Pandas is insane and I had to do a data analysis where I really, genuinely needed to do some looping and some simple map/reduce and it almost drove me insane.

I recently started a project that I got to write from the ground up by myself. I was happy with the processing side of things. I was very sad with the data I was getting in and putting out. There's some impedance mismatch that doesn't need to exist.

> You might like [Agate](http://agate.readthedocs.io/) better.

I looked at the front page and definitely wasn't enjoying what I was seeing. It, at first, looked like more complexity piled up on top of things that don't need it. Then I saw this link: http://agate.readthedocs.io/en/1.6.0/cookbook/compute.html#l...

This is definitely worth a try. Much closer to what I was thinking.

> I don't mind that matplotlib is kind of awful -- that data viz would never go in a published piece in any event. I just want some hints as to what I or more likely a teammate would build in D3 around the specifics of the data.

Sadly in my field matplotlib is the professional tool (hah!). The end goal is the matplotlib plots. I'd be all fine for tweaking things in a designing program and putting it up by I'd be upset with myself.

My end goal is to have a single script in a repository that installs, runs, and then compiles my papers. I don't want anyone to need to look at sub-standard copies of my plots. I want anyone to be able to jump in and check my work and create derivative works.

Sadly this is not common in science today so there aren't really good tools for this sort of thing at the composition side. Even worse plotting isn't common in the computer world so tools for that don't exist either.

link

david_eads 3295 days ago

> I recently started a project that I got to write from the ground up by myself. I was happy with the processing side of things. I was very sad with the data I was getting in and putting out. There's some impedance mismatch that doesn't need to exist.

Impedance mismatch is a great way to put it. For me, if I can deal with that mismatch so that newbies/journalism colleagues don't have to, I'll do it.

> Sadly in my field matplotlib is the professional tool (hah!). The end goal is the matplotlib plots. I'd be all fine for tweaking things in a designing program and putting it up by I'd be upset with myself.

I used to work in science and have found journalism to have better solved many of these issues (at the expense, of course, of specialization and depth -- even a yearlong project isn't quite the same as decades of experience working in a single area). The solutions aren't pure or pretty -- they're more about workflow and held together with duct tape and baling wire. But the competitive pressure to deliver data that has a good user experience on deadline is very powerful and has led to some effective practices.

link

kinkrtyavimoodh 3295 days ago

I am a programmer and I use Pandas quite a bit. While I agree that it's a little counter-intuitive at times, I have found it to be an extremely useful and important Python package that there is just no reasonable substitute for.

link

bigger_cheese 3295 days ago

">Are there any good frameworks that allow for processing, caching, data visualization (layout -> data population -> rendering), then exporting to some format (PNG/PDF/TeX)?"

I use SAS for this in my Day Job it's not a free program but powerful for this type of stuff.

I typically use SQL queries (via SAS's proc sql command) to manipulate and process my data but you can also programatically manipulate your data sets using SAS's "datastep" language.

SAS has support for macro expansions which make some of your examples (like manipulating 10 sensors at once) pretty trivial. But this is getting into programming language territory I would not expect someone new/unfamiliar with programming to grasp all of this intuitively.

edit: Heres some code I have in production that counts how many (of 8) sensors are reading high in a given time frame.

array aads (*) TP_AD1_TOP_STACK_TC1 -- TP_AD1_TOP_STACK_TC8; NO_AD1_TEMPERATURES_HIGH = 0; do j= 1 to dim(aads); if aads(j) gt 160 then NO_AD1_TEMPERATURES_HIGH = NO_AD1_TEMPERATURES_HIGH +1; end;

Downside is that SAS is a commercial package and it is not free I Have heard a lot of good things about "R" which is supposedly quite similar but have not had opportunity to use it myself.

link

stdbrouw 3295 days ago

As someone who has used SAS for many, many different projects: it is terrible, vastly inferior to Pandas or R, and the only reason to ever use it is when you're forced to. Even simple stuff like functions that operate on data have to be hacked on with macros.

Case in point, your production SAS code could be replaced with this Pandas code (and the R code would look very similar):

  temperatures[[TEMPERATURE_COLUMNS]].apply(lambda t: (t > 160).sum(), axis=1)

or if your data is in proper long form

  data.groupby('time').temperature.gt(160).sum()

link

gravypod 3295 days ago

I'd like to get my analysis systems as "inclusive" as possible. I'd be using my internal SQL server and just fall into python for my processing if I didn't care about sharing my work.

SAS looks good though. I've looked at it many times and it is a clean solution if you really are in the "big games".

link

bigger_cheese 3295 days ago

Yeah that is a good point trying to separate analysis from database.

My work is going opposite direction unfortunately we are starting to use Hadoop makes it quite difficult to do things "outside of the database" there is just too much data to work with locally.

link

mirimir 3295 days ago

SQL plus R is a good combo.

link

BeetleB 3295 days ago

Funny you talk about SAS that way.

In my former team, we used SAS for a while and once I introduced the team to Pandas, they happily ditched SAS.

link

zoombini29 3295 days ago

> Pandas, for some reason, cannot stick to python-isms. I can't do simple things like... > if not df: # Check if DF is empty > return ...

This part is a gotcha, but it's also a reflection that allowing if checks for things other than empty leads to subtle bugs. (there are long mailing list posts about it and about the bugs that were uncovered). See here for some explanation about why numpy does it: https://github.com/numpy/numpy/issues/8622

link

tnecniv 3295 days ago

I feel your pain. You can pretty much blame either MATLAB origins or an relentless pursuit of runtime efficiency for most of these problems

link

stdbrouw 3295 days ago

Many of the things you list are indeed annoyances when doing data analysis in Python and they make things harder than they should be, but others are typical grievances I see from people new to it, and these do actually go away once you've been working with e.g. Pandas for longer.

> Pandas, one of the biggest "offenders", is trying to be an in-memory database with only one table but ends up having far fewer features and a far clunkier interface (want to do a simple map/reduce? Welcome to chaining a strange combination of '.loc', '&', and ':,' "operators").

What makes Pandas so great is that you can apply arbitrary functions to rows and columns, with the full expressivity of Python. In some cases it might be clunkier (though you should almost never need `.loc` and other indexing methods) but mostly it's just `df.groupby(...).apply(...)` or vectorized methods like `df.column + df.other_column`. This is a huge improvement over having half of your analysis in database queries and half in a programming language.

> Matplotlib is unintuitive and poorly documented

Try https://seaborn.pydata.org/ for statistical graphics.

> Pandas also implements it's own versions of standard python objects! You need to know, and go back and forth between two, ways of doing things.

This sucks but is unavoidable, because Python does not have fast data types with support for missing values built in, so all your columns would have to be of mixed type (the actual type + None) and everything would slow down and simple things like computing the mean of a column with missing values would not work.

Note that you don't actually "need to go back and forth" because Pandas will happily convert plain Python objects to their Numpy equivalents for you.

> 3. All these libraries separate logically grouped concepts.

It's not functional, you're just going to have to deal with that. But split-apply-combine and similar patterns are quite elegant in Pandas: http://pandas.pydata.org/pandas-docs/stable/groupby.html

> 4. Because everything is meaningless lists of numbers there are no ways to reuse code.

A lot of data analysis is throw-away code. Some of it can be abstracted into reusable code, some of it can't.

Lastly, don't forget that Python does have a lot of things going for it when it comes to data analysis, from geospatial tools (http://toblerity.org/shapely/) to Bayesian modeling (http://pymc-devs.github.io/pymc3/index.html), as well as interactive coding with Jupyter and Hydrogen for the Atom editor (https://github.com/nteract/hydrogen).

link