| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by stdbrouw 3295 days ago

Many of the things you list are indeed annoyances when doing data analysis in Python and they make things harder than they should be, but others are typical grievances I see from people new to it, and these do actually go away once you've been working with e.g. Pandas for longer.

> Pandas, one of the biggest "offenders", is trying to be an in-memory database with only one table but ends up having far fewer features and a far clunkier interface (want to do a simple map/reduce? Welcome to chaining a strange combination of '.loc', '&', and ':,' "operators").

What makes Pandas so great is that you can apply arbitrary functions to rows and columns, with the full expressivity of Python. In some cases it might be clunkier (though you should almost never need `.loc` and other indexing methods) but mostly it's just `df.groupby(...).apply(...)` or vectorized methods like `df.column + df.other_column`. This is a huge improvement over having half of your analysis in database queries and half in a programming language.

> Matplotlib is unintuitive and poorly documented

Try https://seaborn.pydata.org/ for statistical graphics.

> Pandas also implements it's own versions of standard python objects! You need to know, and go back and forth between two, ways of doing things.

This sucks but is unavoidable, because Python does not have fast data types with support for missing values built in, so all your columns would have to be of mixed type (the actual type + None) and everything would slow down and simple things like computing the mean of a column with missing values would not work.

Note that you don't actually "need to go back and forth" because Pandas will happily convert plain Python objects to their Numpy equivalents for you.

> 3. All these libraries separate logically grouped concepts.

It's not functional, you're just going to have to deal with that. But split-apply-combine and similar patterns are quite elegant in Pandas: http://pandas.pydata.org/pandas-docs/stable/groupby.html

> 4. Because everything is meaningless lists of numbers there are no ways to reuse code.

A lot of data analysis is throw-away code. Some of it can be abstracted into reusable code, some of it can't.

Lastly, don't forget that Python does have a lot of things going for it when it comes to data analysis, from geospatial tools (http://toblerity.org/shapely/) to Bayesian modeling (http://pymc-devs.github.io/pymc3/index.html), as well as interactive coding with Jupyter and Hydrogen for the Atom editor (https://github.com/nteract/hydrogen).