|
As someone who is, uh, fluent in R (begrudgingly), allow me to retort: While you're right that in R a data frame is essentially a list of columns, this strikes me as a flaw in R. Others coming to R expect to be able to loop over the observations in a data frame, or get number of observations by taking the length of the data structure. Indeed for most of my real world work that's what I actually want to do: iterate over customers or units that have multiple observations, stored as rows in the df with variables describing characteristics regarding that observation. I assure you, for everyone else coming to R, that is a genuine "WTF" moment when they loop across a data frame and find themselves iterating across variables rather than observations, or that they accidentally took the length of the data frame to be the number of observations rather than the number of variables: and I've got a glorious real world story of a bug caused by that on a 1 x 0 dimension data frame being returned by consultants code... I have no idea if that's how it's actually implemented in pandas though... As for the apply thing: I'm guessing that has to do with python syntax and the nature of functions/methods/data frames, but I agree with you it's a bit kludgy to me too. But I guess that's because what you're actually doing is applying a scalar function across a sequence of values, not actually calling a function that takes a sequence as an argument. In your example there, which is very R'y because the function application would be automatically vectorised, in python there's no such (necessary) thing. The reason this "kind of" works "naturally" in R is actually because R is weird and takes an efficiency hit by not having unboxed scalar values at all: even single numbers are actually vectors, as is the result of the returned operations/functions on them, so you actually have no scalar operations at all (but for many applications you don't actually notice:[1] + [1] = [2] is effectively the same as 1 + 1 = 2 in an unvectorised language, barring the R resource hit which is insignificant in smaller examples/problems. |
The reason is that data-frames are intended for dealing with heterogeneous data. The proper way to loop over observations is to convert the variables to a common data type, e.g. logical or numeric, then you have a matrix and then you can loop over rows.
If recall correctly pandas uses a dictionary to implement data-frames, therefore iterating over rows in pandas has the same performance hit as in R.