Hacker News new | ask | show | jobs
by louden 3778 days ago
R is a language with a lot of gotcha's. I usually get burned by characters being converted to factors in read.csv() and converting factors to numeric (it works, but not how you intend). The R Inferno (http://www.burns-stat.com/documents/books/the-r-inferno/) has a lot of other gotcha's and is worth a read for people who use the language.

That said, the power, flexibility and user community make it my go-to for any first crack at an analysis of data.

3 comments

What makes R infuriating is most of the complexity and gotchas aren't inherent to the problem.

So you learn Clojure and you inevitably meet the collections. And it takes like 5 minutes to tell you how to map and how to reduce and then the lecture ends with "and it just works". And in fact it does just work.

Then you learn functional R and the first five minutes are the same as the Clojure experience. Then the slow motion train wreck starts and "And R likes mapping so much, we have nine microscopically different apply statements for list and tables and they input some things and output other things and if you pick the wrong one the failure looks like the Trinity nuclear test but more impressive". Every R language lecture is like that, five minutes of how real languages do it, then the rest of the 45 minutes is endless pitfalls and accidents. Its like a 45 minute long fever dream or nightmare "... and if you accidentally tapply, table apply, to a list, then it coerces the input to ..." and drift back to Cthulhu, or maybe away from, whatever.

Pragmatically if you teach R as a statistical analysis language what looks weird often enough turns out to be super convenient. But if you try to teach and learn R as a general purpose computational language, you wonder if its a joke and nobody would actually use Intercal or BF to run analysis, would they?

Its a very powerful system in spite of the language. Think of PC hardware architecture going back to the old XT days, its sinfully ugly, but its quite capable. R is no PDP-11 or VAX, thats for sure.

Factor is one of the worst thing of R-world. I don't recall ever needing factors, yet they creep in with many functions (read.csv, cut).

Btw there's a nice readr package (from Hadleyverse) that has a read_csv method that does away with factors by default.

You should use factor for data cleaning and verification.

So you have "sex" on the questionnaire, and factor will very quickly identify contamination such as "often", "not yet", various mis-spellings, etc.

How would you represent categorical data then? R's primary use case isn't text processing. And HW isn't always right.
As character, for instance (in particular, they can do everything factors can do when used in conjunction with `unique`, and sorted factors can be represented as a conjunction of characters and numerics). Factors work better, but only barely. In particular, they are nowadays not any more efficient than using character (!). They used to be, which is why they are liberally used everywhere in R’s base libraries.
"In particular, they are nowadays not any more efficient than using character"

How could a comparison of two strings of unknown size be as efficient as comparing two integers? I'm curious to learn something new.

R uses a global string cache so any string comparison is just comparing two pointers.
You will (inevitably?) run into factors when importing data from SPSS files... sure, you can discard them upon reading... but are you sure you don't want access to the value labels in the future?
Factors are weird because no other language has anything like it, but they are actually a quite clever way to group data. It just takes a while to get used to them.
I actually use factors a fair amount, and having factor-like data shoved into numeric values gets you to some bad places statistically.
You must not do a lot of regression with categorical data, then. I use commands like `lm(y ~ (x1 + x2) * factor_variable, data = d)` and `xyplot(y ~ x1 | factor_1, groups = factor_2, data = d)` all the time.
Those also work just fine with strings.
Via an implicit call to factor, right?
Factors are great, and surprisingly powerful even outside of statistical computing. With that being said, I prefer to create them on purpose rather than having read.csv attempting to be helpful.
A factor already is a vector of numeric values, which happen to have names.