| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mshron 1803 days ago

I highly recommend anybody getting into R to skip the base language (which indeed is ancient and full of gotchas) and go straight for the Tidyverse[1]. You can always go back in and learn how to do things the old way later.

Over the last decade, the R community has largely standardized around tools like dplyr, ggplot, tibble, purrr, and so on that make doing data science work way easier to reason about. Much more ergonomic. At my company we switched from using Python to using R for most analytical data science work because the Tidyverse tools make it so much easier to avoid bugs and weird join issues than you get in a more imperative programming environment.

[1] https://www.tidyverse.org/

5 comments

uryga 1803 days ago

i would recommend getting comfortable with doing stuff with base R, then trying tidyverse. Starting with dplyr might get you results quick, but its "special evaluation" actively confuses your understanding of how the base language actually works (speaking from experience with an R course and subsequently helping other confused folks)

Consider this example:

  # base R
  starwars[starwars$height < 200 & starwars$gender == "male", ]
  
  # dplyr
  starwars %>% filter(
    height < 200,
    gender == "male"
  )

(Source: https://tidyeval.tidyverse.org/sec-why-how.html)

Where'd `height` and `gender` come from in the dplyr version? They're just columns in a DF, not variables, and yet they act like variables... Well that's the dplyr magic baby!

dplyr (and other tidystuff) achieves this "niceness" by doing a whole bunch of what amounts to gnarly metaprogramming[1] -- that example was taken from a whole big chapter about "Tidy evalutation", describing how it does all this quote()-ing and eval()-ing under the hood to make the "nicer" version work. it's (arguably) more pleasant to read and write, but much harder to actually understand -- "easy, but not simple", to paraphrase a slightly tired phrase.

---

[1] IIRC it works something like this. the expressions

  height < 200
  gender == "male"

are actually passed to `filter` as unevaluated ASTs (think lisp's `quote`), and then evaluated in a specially constructed environment with added variables like `height` and `gender` corresponding to your dataframe's columns. IIRC this means it can do some cool things like run on an SQL backend (similar to C#'s LINQ), but it's not somthing i'd expose a beginner to.

link

canjobear 1803 days ago

My experience is that this weird evaluation order stuff is only confusing for students with a lot of programming experience who already expect nice lexical scope. For those coming in from Excel, the tidyverse conventions are no problem and are in fact easier than all the pedantic quoting you have to do in something like Pandas. It only gets confusing when you want to write new tidyverse functions, and even then, base R isn’t any simpler: the confusing evaluation order is built into R itself at the deepest level.

link

uryga 1803 days ago

EDIT: i gotta admit, you sound like you've got more experience with teaching R than me. so perhaps my opinions here are a bit strong for what they're based on, i.e. tutoring a couple of non-programmer friends and my own learning process. still...

> My experience is that this weird evaluation order stuff is only confusing for students with a lot of programming experience who already expect nice lexical scope

fair point, but for the most part, R itself does use pretty standard lexical scoping unless you opt into "non-standard evaluation" by using `substitute`[1]. so building a mental model of lexical scoping and "standard evaluation" is a pretty important thing to learn. after that, the student can see how quoting can "break" it, or at least be able to understand a sentence like "you know how evaluation usually works? this is different! but don't worry about it too much for now". and i think dropping someone new straight into tidyverse stuff gets in the way of this process.

> and even then, base R isn’t any simpler: the confusing evaluation order is built into R itself at the deepest level.

i mean, quoting can't really work without being deeply integrated into the language, can it? besides:

- AFAICT base R data manipulation functions don't use it a lot. [2]

- for the most part, R's evaluation order can be ignored (at a certain learning stage) because it's not observable if you stick to pure stuff, which you probably should anyway.

---

[1] http://adv-r.had.co.nz/Computing-on-the-language.html#captur...

[2] admittedly, stuff with `formula`s is similarly wacky, and if you're doing stats you're going to run into that sooner or later...

link

canjobear 1803 days ago

It's true that if you write code that is pure and error-free then you will never bump up against R's strangeness.

But try out this bit of base R:

> hello = function(cats, dogs) { return(cats) }

> hello(100, honk) # where honk has never been defined

> hello(100, print("hello!"))

> hello(100)

link

uryga 1802 days ago

yeah, the intersection of lazy evaluation and side-effects (incl. errors/exceptions) gets confusing, you definitely have to be there to help the student out of a jam. but i think it's useful to start out pretending R follows strict evaluation (because it's natural[1]) and then, once the student gets their bearings, you can introduce laziness.

---

[1] well, not "natural", but aligned with how math stuff is usually taught/done. in most cases, when asked to evaluate `f(x+1)`, you first do `x+1` and then take `f(_)` of that.

link

buixuanquy 1803 days ago

Wow, now I understand the reasons. As Python guy I'm having trouble to understand how it is possible in R.

link

uryga 1802 days ago

https://www.r-bloggers.com/2018/07/about-lazy-evaluation/

tldr: basically, R passes all function arguments as bundles of `(expr_ast, env)` [called "promises"]. normally, they get evaluated upon first use, but you can also access the AST and mess around with it. AFAIK this is called an "Fexpr" in the LISP world.

(originally i had a nice summary, but my phone died mid-writing and i'm not typing all that again, sorry!)

it's very powerful (at the cost of being slow and, i imagine, impossible to optimize). it enables lots of little DSLs everywhere - e.g. lm() from stats, aes() from ggplot2, any dplyr function - which can be both a blessing and a curse.

link

jstx1 1803 days ago

I would recommend the opposite - pick stuff from tidyverse (mostly dplyr and ggplot2) only if you need them. Knowing base R goes a long way on its own.

link

melling 1803 days ago

Pipe operator is now in base R so one of the nice features is now standard

https://www.r-bloggers.com/2021/05/new-features-in-r-4-1-0/

link

tarsinge 1803 days ago

I found the book R for Data Science (which is free http://r4ds.had.co.nz) to be a very good introduction to R with Tidyverse.

link

tpoacher 1803 days ago

I'm on the same page as the other commenter here, except stronger.

Avoid tidyverse like the plague, except when you can't, or when you don't actually care about the sanity of your code and are happy copy/pasting pre-prescribed snippets without needing to understand let alone modify them.

link

mr_toad 1802 days ago

> ggplot

One day I’ll have a whole week free so I can sit down and learn an entire graphical grammar so that I can remove the egregious amounts of chart-junk in the ggplot defaults.

link