|
|
|
|
|
by bicubic
3644 days ago
|
|
> This strikes me as a bad idea, because data-frames are better thought of as a collection of columns The dataframe is a collection of records then len operator tells you how big the dataset you're dealing with. You also have len(df.columns) and df.shape > Second, the apply method seems totally redundant df.water_year refers to a column. You can certainly use the syntax you wrote, provided you crafted a function that manipulate a column in some way. E.g. if you had a function that returns the first 2 elements of what was given, passing a column to that function would return a view into that column with only the first 2 rows. Passing the same function into apply would process every element in the (string) column and return the first 2 letters, finally returning a brand new column where each row is the first 2 letters of the corresponding row of the input. Both of these behaviours make perfect sense if you think about them in terms of expected Python and Numpy which Pandas is built on. |
|
My PyData London presentation "Pandas from the Inside" [1, 2] explains in detail how pandas gets its speed from numpy, with benchmarks comparing slow vs fast ways to do common operations. Column-wise operations can be three orders of magnitude faster than iterating by row.
[1] https://www.youtube.com/watch?v=Dr3Hv7aUkmU
[2] https://github.com/SteveSimmons/PyData-PandasFromTheInside