Hacker News new | ask | show | jobs
by spangry 3350 days ago
I haven't tried this yet, but am praying that it delivers even half of what it promises. For whatever reason I just can't get my head around pandas, despite multiple attempts.

If this also turns out to be inscrutable I may be forced to conclude that I'm stupid...

8 comments

>For whatever reason I just can't get my head around pandas, despite multiple attempts.

You need to work with pandas consistently for a month or two, and then it'll all click.

pandas is not complex, nor deep. It is, however, very broad. Most of the time it is "Here's what I need to do. I'm sure there's an API or two in pandas that will let me do this," and then you spend an hour or so looking at the documentation to find those APIs.

My first month or two was: "I need to do this. Let me Google". Pretty much every time someone had asked that same question on SO.

If you stick to it for 2 months, you'll eventually "learn" all the routine tasks and Googling stuff becomes only occasional.

And it does help if you're familiar with NumPy.

Tried these video series? That guy explains it really nicely and is really bright.

https://pythonprogramming.net/search/?q=pandas

Thanks for this! It's funny in a way: I'm trying to learn the basics, but don't have a clear idea of what the basics actually are. This looks like it could be just the ticket. Cheers!
In case you're looking for more, this tutorial series hit the front page of HN a few years ago: http://www.gregreda.com/2013/10/26/intro-to-pandas-data-stru...

Modern pandas is a bit more idiomatic now though: https://tomaugspurger.github.io/modern-1.html

Are you trying to learn pandas just to learn pandas or do you have a motivating example?
What have you wanted to use it for?

Pandas is basically an R data frame for Python. A sloppy description of that is a text mode spreadsheet.

The description of Bonobo doesn't immediately invite the comparison to Pandas, to me anyway.

You're very right, as I'm using both pandas and bonobo for different reasons.

Mostly, when I want a quasi-mathematical look over a dataset, pandas is my tool of choice. For all those data pipeline things that reasonably fit on one computer, I do use bonobo.

I'm an avid Pandas user. Some stuff at work has come up recently that calls for ETL - and I'm trying to figure out what the best tools are. Is any of your Bonobo code public? I'd be curious to see what a real-life project looks like...
No, I don't have real-life public code available. I'm gonna see what I can extract from old commercial project for publication, but I can't guarantee anything.
Luigi is a simple tool from Spotify that seems to solve a similar data workflow (with DAGs) problem as Bonobo. Airflow from AirBnB is a more complex tool and I've understood that Spotify has lately moved from Luigi to Airflow.

Can anybody comment how Bonobo compares to Luigi?

Usually just simple data analysis, really nothing far outside of the 'statistics' lib. Currently it's more the exploration and discovery part of the exercise I'm struggling with. I've got a few hundred thousand csv files representing various aspects of Australia's national energy market (e.g. outcomes of 5 minute supply auctions). I'm trying to make my way through that, figure out what's relevant and wrangle the relevant stuff in some organised fashion.

Is pandas the wrong kind of tool for this type of thing? Going off what rdorgueil has said, I'm beginning to suspect so. Is there a data-wrangling 'gold standard' library for python?

I'm just learning pandas as well but I think it is the right tool for the job. I am using django-pandas so I can do easy ORM stuff. If I were to sketch out your use case:

Create a object/class called

    AuctionResult
     - some datetime
     - value
Then you'd query it qs = AuctionResult.objects.all()

then you load it into a pandas dataframe:

df = read_frame(qs)

After that you can do all sorts of the fun stuff I imagine.

I don't see why pandas won't work for your case. It sounds like most if not all the csvs contain the same columns and type of data. You could easily create a pandas dataframe that combines them all, then use any plotting library like matplotlib and/or seaborn to plot. If you need help provide some examples of the csvs you are trying to parse.
Pandas is definitely the most popular (and imo best) data wrangling library for Python.
You should also have a look at this book(http://www.goodreads.com/book/show/14744694-python-for-data-...), it helps that author of pandas library is also the author of the book. From the description of your use-case, you seem to be doing exploratory data analysis, pandas can definitely handle that.
Wait a little to buy it. There is a new edition in the oven.
McKinney's book is good. Unfortunately there have been several reasonably important breaking API changes since it was published, so it's now to be taken with some salt.
You're not alone! I think pandas made some design decisions around their transformation functions that make it a lot more cumbersome to use than R's dplyr. It's not obvious from the documentation, though.

As an example from the pandas docs [1], in dplyr you can do

> gdf <- group_by(df, col1)

> summarise(gdf, avg=mean(col1))

In pandas this is similar to

> df.groupby('col1').agg({'col1': 'mean'})

But dplyr's summarize it's much more flexible than agg, as you can do all kinds of things to any number of columns. E.g.

> summarise(gdf, some_name = f1(col1) + f2(col2))

But in pandas you can apply 1 function to 1 column with agg.

[1] http://pandas.pydata.org/pandas-docs/stable/comparison_with_...

Swings and roundabouts. I'm a big fan of dplyr, and R definitely does some thing better than Pandas, but I've never found anything as flexible as pd.pivot_table for cross tabulations. For instnace, the lack of multiindexing in R is a big drawback.
you can supply a map, ie:

gdf = df.groupby('col1').agg({'col2': np.mean, 'col3':np.std,'col4': lamba x: np.mean((x) / np.std(x))})

once you've got your grouped dataframe, go nuts

gdf['some_name'] = gdf['col1'].apply(f1) + gdf['col2'].apply(f2)

The summarise example I gave creates a single new column (some_col), that is a function of two columns from the grouped data frame. Passing a map to agg is just creating multiple columns, each a function of at most a single column in the dataframe.

(I should have used a better example, like summarise(gdf, some_col = f(cola / colb))

Totally a preference thing. I strongly prefer pandas to dplyr having worked with both.
You can supply a map to agg, something like {col1:'sum', col2: 'mean',...}
Yes, but it is still applying that function to a single column (the summarise example I gave could aggregate multiple columns into a single result, e.g. sum(col1 / col2))
I enjoyed this free Edx course:

https://courses.edx.org/courses/course-v1:Microsoft+DAT208x+...

Their is some pratical exercises that you do in your browser that really helps to get the grasp of it.

Don't miss Pandas, they are really cool!

Don't feel bad. Pandas is powerful but it has (at least when I used it) some truly abominable documentation.
Agreed the documentation could be a lot better. I wonder about a visual approach to using DataFrames and Series with all the different methods to demonstrated more clearly what's being done.
I really liked this visual explanation of pivot and reshape:

http://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pi...

(I´m not affiliated to site)

You mean excel?
And some annoying interface decisions, like d['x'] is a column but d[1:3] is a row slice.
pandas does a lot. This does nothing. That's the difference.