| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by riskneutral 2928 days ago
	Tidy features (like pipes) are detrimental to performance. The best things R has going for it are data.table, ggplot, stringr, RMarkdown, RStudio, and the massive, unmatched breadth and depth of special-purpose statistics libraries. Combined, this is a formidable and highly performant toolset for data analytics workflows, and I can say with some certainty that even though “base Python” might look prettier than “base R,” the combination of Python and NumPy is not necessarily more powerful or even a more elegant syntax. The data.table syntax is quite convenient and powerful, even if it does not produce the same “warm fuzzy” feeling that pipes might. NumPy syntax is just as clunky as anything in R, if not worse, largely because NumPy was not part of the base Python design (as opposed to languages like R and MATLAB that were designed for data frames and matrices). What is probably not a good idea (which the article unfortunately does) is to introduce people to R by talking about data.frame without mentioning data.table. Just as an example, the article mentions read.table, which is a very old R function which will be very slow on large files. The right answer is to use fread and data.table, and if you are new to R then get the hangs of these early on so that you don’t waste a lot of time using older, essentially obsolete parts of the language.

5 comments

roenxi 2928 days ago

> Tidy features (like pipes) are detrimental to performance.

Detrimental to the runtime performance; if you happen to be reading and processing tabular data from a csv (which is all I've ever used R for, I must admit), then you get real performance gains as a programmer. For one thing, it allows a functional style where it is much harder to introduce bugs. If someone is trying to write performant code they should be using a language with actual data structures (and maybe one that is a bit easier to parallelism than R). The vast bulk of the work done in R is not going to be time sensitive but is going to be very vulnerable to small bugs corrupting data values.

Tidyverse, and really anything that Hadley Wickham is involved in, should be the starting point for everyone who learns R in 2018.

> languages like R and MATLAB that were designed for data frames and matrices

Personal bugbear; the vast majority of data I've used in R has been 2-dimensional, often read directly out of a relational database. It makes a lot of sense why the data structures are as they are (language designed a long time ago in a RAM-lite environment), but it is just so unpleasant to work with them. R would be vastly improved by /single/ standard "2d data" class with some specific methods for "all the data is numeric so you can matrix multiply" and "attach metadata to a 2d structure".

There are 3 different data structures used in practice amongst the R libraries (matrix, list-of-lists, data.frame). Figuring out what a given function returns and how to access element [i,j] is just an exercise in frustration. I'm not saying a programmer can't do what I want, but I am saying that R promotes a complicated hop-step-jump approach to working with 2d data that isn't helpful to anyone - especially non-computer engineers.