| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lottin 3644 days ago

As an R user I noticed a couple of oddities. First,

  len(df)

returns the number of rows rather than the number of columns. This strikes me as a bad idea, because data-frames are better thought of as a collection of columns. Typically you want to loop over the columns of a data-frame and not so much over its rows, which is performance-wise much more costly.

Second, the apply method seems totally redundant. Why call a method that calls a function when you can simply call the function directly

  df['year'] = base_year(df.water_year)

Probably I'm missing something here.

6 comments

bicubic 3644 days ago

> This strikes me as a bad idea, because data-frames are better thought of as a collection of columns

The dataframe is a collection of records then len operator tells you how big the dataset you're dealing with. You also have len(df.columns) and df.shape

> Second, the apply method seems totally redundant

df.water_year refers to a column. You can certainly use the syntax you wrote, provided you crafted a function that manipulate a column in some way. E.g. if you had a function that returns the first 2 elements of what was given, passing a column to that function would return a view into that column with only the first 2 rows. Passing the same function into apply would process every element in the (string) column and return the first 2 letters, finally returning a brand new column where each row is the first 2 letters of the corresponding row of the input.

Both of these behaviours make perfect sense if you think about them in terms of expected Python and Numpy which Pandas is built on.

stevesimmons 3643 days ago

> Both of these behaviours make perfect sense if you think about them in terms of expected Python and Numpy which Pandas is built on.

My PyData London presentation "Pandas from the Inside" [1, 2] explains in detail how pandas gets its speed from numpy, with benchmarks comparing slow vs fast ways to do common operations. Column-wise operations can be three orders of magnitude faster than iterating by row.

[1] https://www.youtube.com/watch?v=Dr3Hv7aUkmU

[2] https://github.com/SteveSimmons/PyData-PandasFromTheInside

lottin 3644 days ago

Thanks for clearing that up, now it does make sense. In R most functions handle vectors as well as scalars without distinction, so normally one would use the function directly. Whereas if you wanted to process each element of a vector individually then you'd use apply(). It works the other way around.

_Wintermute 3644 days ago

Well that's because R doesn't have scalars, just vectors containing a single value.

ACow_Adonis 3644 days ago

As someone who is, uh, fluent in R (begrudgingly), allow me to retort:

While you're right that in R a data frame is essentially a list of columns, this strikes me as a flaw in R. Others coming to R expect to be able to loop over the observations in a data frame, or get number of observations by taking the length of the data structure. Indeed for most of my real world work that's what I actually want to do: iterate over customers or units that have multiple observations, stored as rows in the df with variables describing characteristics regarding that observation. I assure you, for everyone else coming to R, that is a genuine "WTF" moment when they loop across a data frame and find themselves iterating across variables rather than observations, or that they accidentally took the length of the data frame to be the number of observations rather than the number of variables: and I've got a glorious real world story of a bug caused by that on a 1 x 0 dimension data frame being returned by consultants code...

I have no idea if that's how it's actually implemented in pandas though...

As for the apply thing: I'm guessing that has to do with python syntax and the nature of functions/methods/data frames, but I agree with you it's a bit kludgy to me too. But I guess that's because what you're actually doing is applying a scalar function across a sequence of values, not actually calling a function that takes a sequence as an argument. In your example there, which is very R'y because the function application would be automatically vectorised, in python there's no such (necessary) thing. The reason this "kind of" works "naturally" in R is actually because R is weird and takes an efficiency hit by not having unboxed scalar values at all: even single numbers are actually vectors, as is the result of the returned operations/functions on them, so you actually have no scalar operations at all (but for many applications you don't actually notice:[1] + [1] = [2] is effectively the same as 1 + 1 = 2 in an unvectorised language, barring the R resource hit which is insignificant in smaller examples/problems.

lottin 3644 days ago

Iterating over variables may seem counter-intuitive but it actually is the right thing to do when you have a data-frame.

The reason is that data-frames are intended for dealing with heterogeneous data. The proper way to loop over observations is to convert the variables to a common data type, e.g. logical or numeric, then you have a matrix and then you can loop over rows.

If recall correctly pandas uses a dictionary to implement data-frames, therefore iterating over rows in pandas has the same performance hit as in R.

nzjrs 3644 days ago

> The reason is that data-frames are intended for dealing > with heterogeneous data. The proper way to loop over > observations is to convert the variables to a common data > type, e.g. logical or numeric, then you have a matrix and > then you can loop over rows.

Pandas saves its users the 'proper' step of 'converting the variables to a common data type', and lets me iterate over rows to get the observations. That seems like a win to me, no?

ACow_Adonis 3644 days ago

Ah, its starting to come back to me...

Does this mean that pandas effectively implements the dataframe as a simple hash on columns vs R which does it as a list? Because if so, yes, that means that they'll probably be relatively comparable in practice.

But I don't think its right to say there's a "right way" to do things with "datasets" though (i'm calling them that as a general concept for these rectangular data structures across languages and platforms, though I appreciate there are differences between their implementations). I do think there's an aesthetic and real effect drawn from the choices of each though, and I can speak loosely about preferences, style, pluses and minuses.

If pandas does have its implementation underlying as a column based philosophy, then yes, I agree its an interesting weird/choice to go with the row-based notions mentioned earlier in spite of this.

That being said, I think there's reasonable grounds to critique your notion that if you want to iterate over observations that you should have to split things out into matrices of different types. Its true, of course, that it might be more efficient to do so given how R chose to implement dataframes, but I would argue that the point of bringing disparate types of data together (in R or elsewhere) into a rectangular data structure that mixes types across the members of an observation is because you likely want to do operations on observations that involve mixed data.

Its seems curious to me, therefore, that this is relatively inefficient and the preference is given to columns in R. And I've met enough people who were also caught out by this to think its not just me.

SAS, for instance, for all its failures and quirks, effectively does this: pulls together basic mixed data types into a rectangular data structure for a relatively efficient, compiled, row-based iterative operations across mixed data types. Its in this one area of analysis and arbitrary row based data munging where SAS, I think, wipes the floor with R and the R data frame.

Now, I speak SAS and R quite fluently, as well as Lisp, from which the R implementation evolved, and when I look at the R data frame, I don't see beautiful design for observation based mixed data-type munging or analysis, I see a linked list of vectors. The R data structure philosophy of course plays to its strengths when you're doing modelling and things on finite columns of fixed variable types in data sets, but its weakness is in row based mixed-type data munging and analysis on messy data of mixed types (which is, also, I think R's and the data frame's dirty little insecurity).

Its an insecurity specifically because a lot of the real world data experience of what many people face and how many people think about data, and the reason they bring data into a rectangular mixed-type data asset...is because that's what they want to do...which could explain why pandas went that particular way: observations are often the general subject of analysis.

(or they might have done it with no particular thought, I don't know.)

lottin 3643 days ago

Yes, internally Pandas stores the data as a series of homogeneous arrays, which correspond to one more columns in the data-frame. Details here: http://www.jeffreytratner.com/slides/pandas-under-the-hood-p...

I agree with what you say except that I consider data-frames one of R's strengths. What makes R data-frames great is that the language is designed around these data structures, thus allowing most of their inherent limitations to be overcome by following "good practices". The problem of porting data-frames to other environments as in the case of pandas in my opinion is precisely a lack of language support, which makes the whole thing feel a little stitched together.

sin7 3644 days ago

If you are fluent in R, why are you looping over a data frame?

ACow_Adonis 3643 days ago

I'm not saying I'm doing it (although sometimes I will for readability, small problems that can't be naively vectorised, and where I have to make code readable for non-R people).

But not everything is naively vectorisable or best expressed as a vector operation, which is an idea that offends some R programmers.

The truth is a lot of real world analysis is done where the observation is the unit of natural analysis, and not the variable, and lots of people from other languages think in rows vs columns.

Common lisp realised this, and you've got there a language that allows for efficient expression of scalar, compiled loops, vectors and vectorisation/functional application, so I think this shows it's not entirely an either/or dichotomy in practice and is more about design/implementation choices and trade offs.

My point is not that R gets it wrong, it's that you can't say the R way is the "right way".

keypusher 3644 days ago

Iterating over the rows much more intuitive to me, just like rows in a database. In their example dataframe each row is a year, and columns represent different information about that year. So, if I wanted to compare rain from oct-sep on a yearly basis, I would iterate over the years (rows) and then grab that column by name.

filmor 3644 days ago

It's inconsistent though, as iterating over a dataframe like

    for c in df:

will return the column labels. I expect `len(obj)` to return the same as `len([i for i in obj])`.

sin7 3644 days ago

Between dplyr, ifelse, and apply family functions, I don't think I've ever had to iterate over a data frame in R.

nzjrs 3644 days ago

As a large pandas user, I don't agree with the len() comment. Can you give an example?

lottin 3644 days ago

In R a data-frame is a list of vectors (in Python parlance, a dictionary of arrays). Therefore the length of a data-frame is the number of columns and an iteration over a data-frame iterates over its columns. Iterating over the rows can be done but it's generally better avoided because it's highly inefficient. The reason is that since the columns have different types each row has to be represented as a list. This is also true in Python, as far as I know.

nurettin 3644 days ago

Sure if your column data is completely independent and you don't need more than one column at a time in a given algorithm, it is natural to iterate over columns instead of rows. However if you need multiple columns (or data properties) at each iteration, which is more likely the case in my experience, then you end up iterating over the rows.

lottin 3644 days ago

That's what Pandas encourages you to do! In my experience iterating is rarely needed at all if you have functions that operate on arrays.

Bromskloss 3644 days ago

> returns the number of rows rather than the number of columns. This strikes me as a bad idea

I don't know. In my eyes, "rows" is a name that refers to the first dimension of a possibly high-dimensional array. "Colums" would refer to the next dimension (and then I don't have any more names).

0. rows

1. columns

2. …

3. …