Hacker News new | ask | show | jobs
by hendiatris 2930 days ago
>A R factor is a sequence type much like a character atomic vector except that the values of the factor are constrained to a set of string values, called “levels”. For example, if you have a table of measurements of some widgets and each row corresponds to a single measurement of a single widget, you could have a factor-typed column called measurement.type containing the values “length”, “width”, “height”, “weight”, and “hue”, with the corresponding numeric measurements stored in a “value” column.

This is a very bad example of what factors are for in R, because it makes it seem like factors are for defining variables or keys in key value pairs. You can use them for that, but it isn't the intended use. A better example would be:

suppose you were comparing the amount of sugar in fruits based on several growing locations, and you had three columns:

| Fruit | Location | Density (g/L) |

Fruit would be a factor variable (let's say it takes the possibilities of apple, banana, orange), and location could be too, if it were a discrete set of possibilities (as opposed to lat/lon coords)

This author seems to forget that R was built for working with data in an analytical setting, unlike all of the languages he's comparing it to. It has creeped into other areas, but that seems to be because in the hands of a skilled user it is far easier to implement a data analysis solution. I'm sure someone will come in and say how much better pandas is, but on the small datasets, I'll stick with R, especially with how brittle and buggy matplotlib is.

1 comments

> This is a very bad example of what factors are for in R, because it makes it seem like factors are for defining variables or keys in key value pairs

That is the approach for tidy data, which is used a lot in the R tidyverse (http://tidyr.tidyverse.org/articles/tidy-data.html)

>> This is a very bad example of what factors are for in R, because it makes it seem like factors are for defining variables or keys in key value pairs

> That is the approach for tidy data, which is used a lot in the R tidyverse (http://tidyr.tidyverse.org/articles/tidy-data.html)

Do you have a reference to where Hadley et al. suggest using factors in a key-value system? I'm reading Wickham's books at the moment and have not seen this assertion. Indeed, I believe he would not state this, as he explains the utility of factors explicitly:

    A factor is a vector that can contain only predefined values, and is used to store categorical
    data... Factors are useful when you know the possible values a variable may take, even if you don’t
    see all values in a given dataset...
Advanced R, pp. 21-22
It was my interpretation of the original article quote that it was referring to tidy schema, but I could be incorrect. (the gather() function of tidyr names its parameters key and value as well, and the function is described as "Gather columns into key-value pairs": http://tidyr.tidyverse.org/reference/gather.html)
If you are interested in this topic and haven't seen Advanced R, I'd recommend taking a look - the book explains why those functions have key-value pairs as parameter names. Note that the functions you cite aren't related to factors.