Hacker News new | ask | show | jobs
by bicubic 3644 days ago
> This strikes me as a bad idea, because data-frames are better thought of as a collection of columns

The dataframe is a collection of records then len operator tells you how big the dataset you're dealing with. You also have len(df.columns) and df.shape

> Second, the apply method seems totally redundant

df.water_year refers to a column. You can certainly use the syntax you wrote, provided you crafted a function that manipulate a column in some way. E.g. if you had a function that returns the first 2 elements of what was given, passing a column to that function would return a view into that column with only the first 2 rows. Passing the same function into apply would process every element in the (string) column and return the first 2 letters, finally returning a brand new column where each row is the first 2 letters of the corresponding row of the input.

Both of these behaviours make perfect sense if you think about them in terms of expected Python and Numpy which Pandas is built on.

2 comments

> Both of these behaviours make perfect sense if you think about them in terms of expected Python and Numpy which Pandas is built on.

My PyData London presentation "Pandas from the Inside" [1, 2] explains in detail how pandas gets its speed from numpy, with benchmarks comparing slow vs fast ways to do common operations. Column-wise operations can be three orders of magnitude faster than iterating by row.

[1] https://www.youtube.com/watch?v=Dr3Hv7aUkmU

[2] https://github.com/SteveSimmons/PyData-PandasFromTheInside

Thanks for clearing that up, now it does make sense. In R most functions handle vectors as well as scalars without distinction, so normally one would use the function directly. Whereas if you wanted to process each element of a vector individually then you'd use apply(). It works the other way around.
Well that's because R doesn't have scalars, just vectors containing a single value.