| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ACow_Adonis 3644 days ago

As someone who is, uh, fluent in R (begrudgingly), allow me to retort:

While you're right that in R a data frame is essentially a list of columns, this strikes me as a flaw in R. Others coming to R expect to be able to loop over the observations in a data frame, or get number of observations by taking the length of the data structure. Indeed for most of my real world work that's what I actually want to do: iterate over customers or units that have multiple observations, stored as rows in the df with variables describing characteristics regarding that observation. I assure you, for everyone else coming to R, that is a genuine "WTF" moment when they loop across a data frame and find themselves iterating across variables rather than observations, or that they accidentally took the length of the data frame to be the number of observations rather than the number of variables: and I've got a glorious real world story of a bug caused by that on a 1 x 0 dimension data frame being returned by consultants code...

I have no idea if that's how it's actually implemented in pandas though...

As for the apply thing: I'm guessing that has to do with python syntax and the nature of functions/methods/data frames, but I agree with you it's a bit kludgy to me too. But I guess that's because what you're actually doing is applying a scalar function across a sequence of values, not actually calling a function that takes a sequence as an argument. In your example there, which is very R'y because the function application would be automatically vectorised, in python there's no such (necessary) thing. The reason this "kind of" works "naturally" in R is actually because R is weird and takes an efficiency hit by not having unboxed scalar values at all: even single numbers are actually vectors, as is the result of the returned operations/functions on them, so you actually have no scalar operations at all (but for many applications you don't actually notice:[1] + [1] = [2] is effectively the same as 1 + 1 = 2 in an unvectorised language, barring the R resource hit which is insignificant in smaller examples/problems.

2 comments

lottin 3644 days ago

Iterating over variables may seem counter-intuitive but it actually is the right thing to do when you have a data-frame.

The reason is that data-frames are intended for dealing with heterogeneous data. The proper way to loop over observations is to convert the variables to a common data type, e.g. logical or numeric, then you have a matrix and then you can loop over rows.

If recall correctly pandas uses a dictionary to implement data-frames, therefore iterating over rows in pandas has the same performance hit as in R.

link

nzjrs 3644 days ago

> The reason is that data-frames are intended for dealing > with heterogeneous data. The proper way to loop over > observations is to convert the variables to a common data > type, e.g. logical or numeric, then you have a matrix and > then you can loop over rows.

Pandas saves its users the 'proper' step of 'converting the variables to a common data type', and lets me iterate over rows to get the observations. That seems like a win to me, no?

link

ACow_Adonis 3644 days ago

Ah, its starting to come back to me...

Does this mean that pandas effectively implements the dataframe as a simple hash on columns vs R which does it as a list? Because if so, yes, that means that they'll probably be relatively comparable in practice.

But I don't think its right to say there's a "right way" to do things with "datasets" though (i'm calling them that as a general concept for these rectangular data structures across languages and platforms, though I appreciate there are differences between their implementations). I do think there's an aesthetic and real effect drawn from the choices of each though, and I can speak loosely about preferences, style, pluses and minuses.

If pandas does have its implementation underlying as a column based philosophy, then yes, I agree its an interesting weird/choice to go with the row-based notions mentioned earlier in spite of this.

That being said, I think there's reasonable grounds to critique your notion that if you want to iterate over observations that you should have to split things out into matrices of different types. Its true, of course, that it might be more efficient to do so given how R chose to implement dataframes, but I would argue that the point of bringing disparate types of data together (in R or elsewhere) into a rectangular data structure that mixes types across the members of an observation is because you likely want to do operations on observations that involve mixed data.

Its seems curious to me, therefore, that this is relatively inefficient and the preference is given to columns in R. And I've met enough people who were also caught out by this to think its not just me.

SAS, for instance, for all its failures and quirks, effectively does this: pulls together basic mixed data types into a rectangular data structure for a relatively efficient, compiled, row-based iterative operations across mixed data types. Its in this one area of analysis and arbitrary row based data munging where SAS, I think, wipes the floor with R and the R data frame.

Now, I speak SAS and R quite fluently, as well as Lisp, from which the R implementation evolved, and when I look at the R data frame, I don't see beautiful design for observation based mixed data-type munging or analysis, I see a linked list of vectors. The R data structure philosophy of course plays to its strengths when you're doing modelling and things on finite columns of fixed variable types in data sets, but its weakness is in row based mixed-type data munging and analysis on messy data of mixed types (which is, also, I think R's and the data frame's dirty little insecurity).

Its an insecurity specifically because a lot of the real world data experience of what many people face and how many people think about data, and the reason they bring data into a rectangular mixed-type data asset...is because that's what they want to do...which could explain why pandas went that particular way: observations are often the general subject of analysis.

(or they might have done it with no particular thought, I don't know.)

link

lottin 3644 days ago

Yes, internally Pandas stores the data as a series of homogeneous arrays, which correspond to one more columns in the data-frame. Details here: http://www.jeffreytratner.com/slides/pandas-under-the-hood-p...

I agree with what you say except that I consider data-frames one of R's strengths. What makes R data-frames great is that the language is designed around these data structures, thus allowing most of their inherent limitations to be overcome by following "good practices". The problem of porting data-frames to other environments as in the case of pandas in my opinion is precisely a lack of language support, which makes the whole thing feel a little stitched together.

link

sin7 3644 days ago

If you are fluent in R, why are you looping over a data frame?

link

ACow_Adonis 3643 days ago

I'm not saying I'm doing it (although sometimes I will for readability, small problems that can't be naively vectorised, and where I have to make code readable for non-R people).

But not everything is naively vectorisable or best expressed as a vector operation, which is an idea that offends some R programmers.

The truth is a lot of real world analysis is done where the observation is the unit of natural analysis, and not the variable, and lots of people from other languages think in rows vs columns.

Common lisp realised this, and you've got there a language that allows for efficient expression of scalar, compiled loops, vectors and vectorisation/functional application, so I think this shows it's not entirely an either/or dichotomy in practice and is more about design/implementation choices and trade offs.

My point is not that R gets it wrong, it's that you can't say the R way is the "right way".

link