| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by spangry 3350 days ago
	I haven't tried this yet, but am praying that it delivers even half of what it promises. For whatever reason I just can't get my head around pandas, despite multiple attempts. If this also turns out to be inscrutable I may be forced to conclude that I'm stupid...

8 comments

BeetleB 3350 days ago

>For whatever reason I just can't get my head around pandas, despite multiple attempts.

You need to work with pandas consistently for a month or two, and then it'll all click.

pandas is not complex, nor deep. It is, however, very broad. Most of the time it is "Here's what I need to do. I'm sure there's an API or two in pandas that will let me do this," and then you spend an hour or so looking at the documentation to find those APIs.

My first month or two was: "I need to do this. Let me Google". Pretty much every time someone had asked that same question on SO.

If you stick to it for 2 months, you'll eventually "learn" all the routine tasks and Googling stuff becomes only occasional.

And it does help if you're familiar with NumPy.

unixhero 3350 days ago

Tried these video series? That guy explains it really nicely and is really bright.

https://pythonprogramming.net/search/?q=pandas

spangry 3350 days ago

Thanks for this! It's funny in a way: I'm trying to learn the basics, but don't have a clear idea of what the basics actually are. This looks like it could be just the ticket. Cheers!

gjreda 3349 days ago

In case you're looking for more, this tutorial series hit the front page of HN a few years ago: http://www.gregreda.com/2013/10/26/intro-to-pandas-data-stru...

Modern pandas is a bit more idiomatic now though: https://tomaugspurger.github.io/modern-1.html

fnord123 3349 days ago

Are you trying to learn pandas just to learn pandas or do you have a motivating example?

maxerickson 3350 days ago

What have you wanted to use it for?

Pandas is basically an R data frame for Python. A sloppy description of that is a text mode spreadsheet.

The description of Bonobo doesn't immediately invite the comparison to Pandas, to me anyway.

rdorgueil 3350 days ago

You're very right, as I'm using both pandas and bonobo for different reasons.

Mostly, when I want a quasi-mathematical look over a dataset, pandas is my tool of choice. For all those data pipeline things that reasonably fit on one computer, I do use bonobo.

RobinL 3350 days ago

I'm an avid Pandas user. Some stuff at work has come up recently that calls for ETL - and I'm trying to figure out what the best tools are. Is any of your Bonobo code public? I'd be curious to see what a real-life project looks like...

rdorgueil 3350 days ago

No, I don't have real-life public code available. I'm gonna see what I can extract from old commercial project for publication, but I can't guarantee anything.

dirtyaura 3349 days ago

Luigi is a simple tool from Spotify that seems to solve a similar data workflow (with DAGs) problem as Bonobo. Airflow from AirBnB is a more complex tool and I've understood that Spotify has lately moved from Luigi to Airflow.

Can anybody comment how Bonobo compares to Luigi?

spangry 3350 days ago

Usually just simple data analysis, really nothing far outside of the 'statistics' lib. Currently it's more the exploration and discovery part of the exercise I'm struggling with. I've got a few hundred thousand csv files representing various aspects of Australia's national energy market (e.g. outcomes of 5 minute supply auctions). I'm trying to make my way through that, figure out what's relevant and wrangle the relevant stuff in some organised fashion.

Is pandas the wrong kind of tool for this type of thing? Going off what rdorgueil has said, I'm beginning to suspect so. Is there a data-wrangling 'gold standard' library for python?

jtchang 3350 days ago

I'm just learning pandas as well but I think it is the right tool for the job. I am using django-pandas so I can do easy ORM stuff. If I were to sketch out your use case:

Create a object/class called

    AuctionResult
     - some datetime
     - value

Then you'd query it qs = AuctionResult.objects.all()

then you load it into a pandas dataframe:

df = read_frame(qs)

After that you can do all sorts of the fun stuff I imagine.

eanzenberg 3349 days ago

I don't see why pandas won't work for your case. It sounds like most if not all the csvs contain the same columns and type of data. You could easily create a pandas dataframe that combines them all, then use any plotting library like matplotlib and/or seaborn to plot. If you need help provide some examples of the csvs you are trying to parse.

RobinL 3349 days ago

Pandas is definitely the most popular (and imo best) data wrangling library for Python.

abhirag 3350 days ago

You should also have a look at this book(http://www.goodreads.com/book/show/14744694-python-for-data-...), it helps that author of pandas library is also the author of the book. From the description of your use-case, you seem to be doing exploratory data analysis, pandas can definitely handle that.

neves 3349 days ago

Wait a little to buy it. There is a new edition in the oven.

hprotagonist 3349 days ago

McKinney's book is good. Unfortunately there have been several reasonably important breaking API changes since it was published, so it's now to be taken with some salt.

closed 3350 days ago

You're not alone! I think pandas made some design decisions around their transformation functions that make it a lot more cumbersome to use than R's dplyr. It's not obvious from the documentation, though.

As an example from the pandas docs [1], in dplyr you can do

> gdf <- group_by(df, col1)

> summarise(gdf, avg=mean(col1))

In pandas this is similar to

> df.groupby('col1').agg({'col1': 'mean'})

But dplyr's summarize it's much more flexible than agg, as you can do all kinds of things to any number of columns. E.g.

> summarise(gdf, some_name = f1(col1) + f2(col2))

But in pandas you can apply 1 function to 1 column with agg.

[1] http://pandas.pydata.org/pandas-docs/stable/comparison_with_...

RobinL 3349 days ago

Swings and roundabouts. I'm a big fan of dplyr, and R definitely does some thing better than Pandas, but I've never found anything as flexible as pd.pivot_table for cross tabulations. For instnace, the lack of multiindexing in R is a big drawback.

madenine 3349 days ago

you can supply a map, ie:

gdf = df.groupby('col1').agg({'col2': np.mean, 'col3':np.std,'col4': lamba x: np.mean((x) / np.std(x))})

once you've got your grouped dataframe, go nuts

gdf['some_name'] = gdf['col1'].apply(f1) + gdf['col2'].apply(f2)

closed 3349 days ago

The summarise example I gave creates a single new column (some_col), that is a function of two columns from the grouped data frame. Passing a map to agg is just creating multiple columns, each a function of at most a single column in the dataframe.

(I should have used a better example, like summarise(gdf, some_col = f(cola / colb))

upquark 3349 days ago

Totally a preference thing. I strongly prefer pandas to dplyr having worked with both.

theghostofjr 3349 days ago

You can supply a map to agg, something like {col1:'sum', col2: 'mean',...}

closed 3349 days ago

Yes, but it is still applying that function to a single column (the summarise example I gave could aggregate multiple columns into a single result, e.g. sum(col1 / col2))

neves 3349 days ago

I enjoyed this free Edx course:

https://courses.edx.org/courses/course-v1:Microsoft+DAT208x+...

Their is some pratical exercises that you do in your browser that really helps to get the grasp of it.

Don't miss Pandas, they are really cool!

Blackthorn 3350 days ago

Don't feel bad. Pandas is powerful but it has (at least when I used it) some truly abominable documentation.

goatlover 3350 days ago

Agreed the documentation could be a lot better. I wonder about a visual approach to using DataFrames and Series with all the different methods to demonstrated more clearly what's being done.

neves 3349 days ago

I really liked this visual explanation of pivot and reshape:

http://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pi...

(I´m not affiliated to site)

Nydhal 3350 days ago

You mean excel?

sixo 3350 days ago

And some annoying interface decisions, like d['x'] is a column but d[1:3] is a row slice.

rjurney 3349 days ago

pandas does a lot. This does nothing. That's the difference.