| I've been very troubled by coming to this stuff as a programmer. I'm having the same instant dis-satisfactory response that your students are having with looping structures. I've recently started working on some projects where I need to do a lot of data visualization, story telling, and investigation "into the data". As a programmer getting into this stuff is far worse then I expected. Nothing works as I would think would make sense. My biggest problem is that I'm thinking like a programmer not like a mathematician. I expect objects, segregation or elimination of state, application and reduction, re-usability, and algorithms. Are there any good frameworks that allow for processing, caching, data visualization (layout -> data population -> rendering), then exporting to some format (PNG/PDF/TeX)? What follows, below this line, is my groveling about the things that have bothered me. Be warned if you don't like rambling and complaining.
------- Pandas, one of the biggest "offenders", is trying to be an in-memory database with only one table but ends up having far fewer features and a far clunkier interface (want to do a simple map/reduce? Welcome to chaining a strange combination of '.loc', '&', and ':,' "operators"). Matplotlib is unintuitive and poorly documented for anyone who isn't a mathematician (.plot(lons, lats, latlons=True) is correct). Dealing with anything more then 100,000 data points is a pain to revision on. State everywhere it shouldn't be (matplotlib.pyplot). While I've been working on this project I probably (each spin) spend an hour or two getting the data out of a format that doesn't make sense from a programmers perspective, I spend another 5 to 10 minutes writing an application/reduction, then I spend another hour to go back into the strange data formats that matplotlib will take. All the while re-running expensive computations and waiting because I have no good persistence layer for my project. There are just things in this community that are common that I'd never dream of. What follows is a list of these things. 1. Functions with 20-40 arguments are the norm for some reason. They also love to throw in a few insane defaults, undocumented options, and even magical flags (not enums). Things like "draw a line, connect the dots" makes it so you need to know what 5 to 7 arguments of a massive function. In C/Java when I need some flags they probably look like this: some_operation(some_data, DO_A | DO_C | DO_Z)
Or, if someone was feeling really nice and defined an enum & used varargs, it looks more like this: some_operation(some_data, SomeOperationFeatures.DO_A, SomeOperationFeatures.DO_C, SomeOperationFeatures.DO_Z)
Where all of these have appropriate documentation. My IDE place nice and can complete these things. My compiler likes it and can typecheck these things. I like it because I know all of my options available (SomeOperationFeatures.).With matplotlib you have things like `linestyle=""`. You have to go to a webpage, look through the docs, and figure out what you want. It's worth reading the docs [1] if you never have. This could have very easily have been LineStyle.DOTTED, LineStyle.DASHED, LineStyle.BLANK. IDEs would have played nice. The 3.6 runtime's typechecking would have played nice. You would be able to see what your options are (LineStyle.). 2. Non-standard ways of treating python-isms Pandas, for some reason, cannot stick to python-isms. I can't do simple things like... if not df: # Check if DF is empty
return ...
for row in df: # Iterate through the rows of a DF
row.date = datetime(row.year, row.month, row.day, ...) # Create a new column in the row based on the row's data.
subset = [a for a in df if some_condition(a)] # Do simple filtering
Pandas also implements it's own versions of standard python objects! You need to know, and go back and forth between two, ways of doing things.3. All these libraries separate logically grouped concepts. Lets say I have time series data from 10 sensors. class SomeMagicalSample:
def __init__(self, a, b, c, d. ..., occurred)
self.a = a
...
self.occurred = occurred
With this code I can generate very complex filtering, combinations, and what not. Things like extracting "real" meaning from measured values becomes easy to express. def get_magical_scalar(self): return ... some interpolation ...
def is_some_magical_type(self): return ... some check ...
Now I can use my already tried and true reduction and application. sum(map(SomeMagicalSample.get_magical_scalar,
filter(SomeMagicalSample.is_some_magical_type, samples)))
Pandas, matplotlib, numpy, scipy and the lot are designed to make me avoid this style of organization. I'm instead forced to do something like this. a = [...]
b = [...]
c = [...]
d = [...]
....
occurred = [...]
Then I have to jump through hoops to keep all of this data in the same order, shift it around together.4. Because everything is meaningless lists of numbers there are no ways to reuse code. Most of the code I have written to show off a single value over time, or pull some data out of some other data and visualize it, is never going to be used again. Unless I want to look at this exact same thing this code will not be useful. If there was some way pass objects around, hide the internals, and process them independently of their meaning then this would not be the case. The one case where this was not true in the past few days was when I rendered a model's prediction into a pcolormesh and drew it onto a basemap. By passing it a basemap it will automatically find the place to generate data for with the model. This was an undocumented feature that I had to read the source of basemap to find was possible (pulling the top left and bottom right Lat Lons from a basemap regardless of projection). Maybe these warts just hurt for a little while? Do these go away? Are there alternatives that can handle >10 million data points? I don't have a good analysis framework setup for the work I'm doing. Maybe this is the issue. I don't even know what a good analysis framework would look like. [1] - https://matplotlib.org/api/lines_api.html#matplotlib.lines.L... |
You might like [Agate](http://agate.readthedocs.io/) better.
I haven't done a ton of Jupyter in the newsroom yet, but what I've found myself doing is abstracting out the stuff I want to do in normal Python into one or more utility modules and having those return dataframes into my notebook. That way I can mostly write normal Python but have access to some of the nicer pandas features and get to do more exploratory work.
I don't mind that matplotlib is kind of awful -- that data viz would never go in a published piece in any event. I just want some hints as to what I or more likely a teammate would build in D3 around the specifics of the data.