|
|
|
|
|
by lottin
3644 days ago
|
|
As an R user I noticed a couple of oddities. First, len(df)
returns the number of rows rather than the number of columns. This strikes me as a bad idea, because data-frames are better thought of as a collection of columns. Typically you want to loop over the columns of a data-frame and not so much over its rows, which is performance-wise much more costly.Second, the apply method seems totally redundant. Why call a method that calls a function when you can simply call the function directly df['year'] = base_year(df.water_year)
Probably I'm missing something here. |
|
The dataframe is a collection of records then len operator tells you how big the dataset you're dealing with. You also have len(df.columns) and df.shape
> Second, the apply method seems totally redundant
df.water_year refers to a column. You can certainly use the syntax you wrote, provided you crafted a function that manipulate a column in some way. E.g. if you had a function that returns the first 2 elements of what was given, passing a column to that function would return a view into that column with only the first 2 rows. Passing the same function into apply would process every element in the (string) column and return the first 2 letters, finally returning a brand new column where each row is the first 2 letters of the corresponding row of the input.
Both of these behaviours make perfect sense if you think about them in terms of expected Python and Numpy which Pandas is built on.