Hacker News new | ask | show | jobs
by clarityPhone 2549 days ago
I skimmed through the book, and think it does a very poor job at showcasing how R and Python are juxtaposed in industry.

To be fair, the book advertises showing R and Python code side-by-side. And that’s what it does. But it does it unlike how the languages are most often used in industry.

As a quick example, I saw no tidyverse code, which is essentially the only thing keeping R in the game. Learning R from this book won’t prepare you for writing R in most R shops.

I don’t see the utility in knowing how to do the same thing in both python and R if you’re a beginner. This is even more true if you’re not taking advantage of the strengths/weaknesses of either language.

Instead, just learn one of the languages well, and then learn the other well. Shallow dives in both will make you weak in both.

Unfortunately, 90% of data science content seems to be geared at beginners.

6 comments

> tidyverse code, which is essentially the only thing keeping R in the game

From my experience this is not the case. In biomedicine and bioinformatics few people actually use tidyverse because the data is much better represented as a matrix, and not in the "tidy" form.

Outside of that corporations (well at least 2 I contracted with) used `data.table` explicitly. Join 3 ad-click dataframes matching by userID, sessionID and closest possible time-point - that's one line in `data.table`.

Tidyverse is well suited for learning and for managing (relatively) simple datasets. But becomes cumbersome for more complex data. It can be used for those data too of course, just that it will be adding ad-hoc solutions and maybe get in a way more than help.

I agree with your experiences.

I've only use base R for my medical data (subsetting dataframe and such). Very rarely do I need tidy and also I find the pipe operator makes debugging harder. If and when I need it I'll use it that's that.

I think R have much more packages in medical, especially statistical packages, where many fields within medical cares about inferences not just prediction/forecasting. So I disagree with the "essentially the only thing keeping R in the game". The breath of packages in R is one of the many things that keep R in the game.

The tribalism and highly bias comments makes it very toxic and harder to have an honest discord.

They are just tools, use what makes you happy and get the job done.

I have a similar feeling. And that is why I spent one whole chapter in data.table (and pandas). Hope more R users would like to learn and use data.table.
I agree for the most part, but R does have a few things beyond the tidyverse: built-in dataframe support, lots of domain-specific packages, more consistent interfaces for basic statistics and machine learning models, etc. Python is definitely better for matrices (because of NumPy) and anything involving custom gradient descent methods (because of TensorFlow).

I think 90% of data science content is for beginners because anything more advanced isn't best described as data science. As soon as you get beyond the initial stages of data analysis (cleaning and processing data), you're doing something best described as some other word (statistics, machine learning, etc.) - although, granted, there isn't much content in these areas if you don't know _exactly_ what you're looking for.

Even if you ignore the tidyverse, the example code for "roll your own linear regressions by hand" uses the R6 object system, which is... not even one of the two popular object systems for R (which are S3 and S4). No beginner needs to learn how to write classes in R.
`no beginner needs to learn how to write classes in R`. a) using classes properly is great for all level R users; b) a major reason that classes are not widely used (for beginners) is that S3/S4 are not easy to follow. R6 provides a natural and clear way to understand and write classes (especially for beginners).
Using classes at all is unnecessary for most R users. R is really, to the extent that paradigms matter to the average R user at all (which is: not much) a functional-first language. The idiomatic way to deal with the things you would use classes for is to use functions and closures. There are people who need objects in R, which is why R has object systems available, but it is of no help to a beginner to know them -- it doesn't help them to interact with the code they are going to see, and they don't have the background to understand why you would use classes instead of functions.
> built-in dataframe support

Not an advantage if you ask me - exactly because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it. That's how R ended up with 3 different structures that are similar but have inconsistent apis and behaviour.

> lots of domain-specific packages

That's true.

> more consistent interfaces for basic statistics and machine learning models

Can't disagree more - there is no one go-to library for ML in R (like sklearn in Python) and each package has it's own strange interface and implementation.

> Not an advantage if you ask me - exactly because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it. That's how R ended up with 3 different structures that are similar but have inconsistent apis and behaviour.

I've been fortunate to only work on projects that use built-in data frames, never encountered tibble or data.table in the wild.

> there is no one go-to library for ML in R (like sklearn in Python) and each package has it's own strange interface and implementation.

I still disagree here - one example being the unified interface for generalized linear models. Also, the vast majority of classifiers (RF, SVM, etc.) have similar or identical interfaces. Also, the unified `predict` interface as well. Granted, `sklearn` does have a consistent API as well.

That said, some of this is just a personal preference for the vaguely functional interface in R. The object-orientedness in Python feels a little forced for some tasks in `sklearn`.

> because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it.

To make what I think your point is more explicit, people build their own things in R because R must maintain compatibility with S. So by and large, changes happen in packages and not the base language. This does lead to a proliferation of solutions for the same kinds of problems.

You mean like keras? or tensorflow? Or base random forest. You know, like the original Breiman implementation.

Python has utility. But R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.

I run a machine learning shop. Right now all of the training, application, and data management is handled via R. R is simply superior in too many ways for us to be bothered with python for the scale of work we are doing.

Since we're moving some big applications to keras/ TF we do use python and will be using more in the future. However, for almost all data management, munging, movement visualization, reporting, its an R world.

> You mean like keras? or tensorflow? Or base random forest. You know, like the original Breiman implementation.

> ...

> Since we're moving some big applications to keras/ TF we do use python and will be using more in the future.

Not sure if I misunderstood, or you're contradicting yourself there.

> R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.

I not only disagree but I think that the exact opposite is true for each one of these points. But if things are working well in our shop, I'm not going to try to convince you otherwise.

> > R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.

> I not only disagree but I think that the exact opposite is true for each one of these points. But if things are working well in our shop, I'm not going to try to convince you otherwise.

I partially agree with you here. I'm extremely careful about what non-standard packages I use in R. Code quality varies wildly outside of these, likewise for documentation. But outside of neural networks, I've never found a package in Python that I felt better about in terms of code quality or documentation than its equivalent in R.

My point behind the keras/ TF comment is that the libraries have front ends in both python and R, so its mix mox/ dealers choice on what you like to work in (since the backends of both are identical).

The primary reason to moving these to python is due to convenience/ the community. Most new work is published in python. If we find a new/ interesting model we want to implement, its probably written in python. Rather than reskin the thing in its entirety, its easier here to work in python.

A couple disclaimers: my group works primarily in geospatial data, and principally in LiDAR and multispectral imagery.

The coarse division I see between R/ Python, is that if you come from a research/ academic background (non-engineering), you probably learned to program in R. If you were an engineer, you probably learned matlab. If you are self taught/ coursera/ youtube, you probably learned in python.

R libraries are generally more geared towards academic research, and specifically, working within existing frameworks (handling geospatial data as geospatial data rather then turning them into a numpy arrays). Working in python, there is far more re-invention of the wheel, and its always a pain the ass to get things back into the structures they came in as.

Python has huge utility and is an important tool for certain work. But its really really not faster than R (it def used to be, this isnt the case any more).

R has better support for more scientific programming than python.

> My point behind the keras/ TF comment is that the libraries have front ends in both python and R, so its mix mox/ dealers choice on what you like to work in (since the backends of both are identical).

Not as a point of argument, just additional information: R's support for keras and TF is a wrapper around the Python interface to those libraries.

> Python is definitely better for matrices (because of NumPy)

How so?

numpy is significantly faster and arguably more usable (e.g. broadcasting) than anything in R, and only recently has there been progress in more efficient matrix manipulation in R like rray[0], a wrapper for xtensor.

[0]https://github.com/r-lib/rray

As far as I can tell, R uses BLAS for matrix operations, and Python probably does the same, so in terms of efficiency I wouldn't expect a big difference between the two.
Both R and numpy use BLAS, and if both are linked to the same library, say OpenBLAS or Intel MKL, then performance is in fact almost identical for expensive operations like matrix multiplication. (R also ships with its own internal BLAS implementation, which is reliable but not very fast, and I believe is still single threaded, so the first thing you should do if you are using R and care about performance is to swap it out.)

For more sophisticated linear algebra algorithm, such as SVD, both will use typically LAPACK, and again, both will exhibit essential identical performance.

There is one important difference though: when R is compiled for 64-bit machines, it can only use 64-bit floats! While numpy can support 32 and even (through software emulation) 16 bit floats. This can halve memory usage, which in turn halves cache misses, which results in a significant speed up in cases where 64-bits of precision is not needed.

This is really interesting! I had always just claimed that Python was faster (see the benchmark I linked above [1]) based on personal experience. I wonder if this internal implementation has something to do with it...

[1] https://julialang.org/benchmarks/

Speed. [1]

See the yellow benchmark (matrix multiply). I suspect it's memory-related.

[1] https://julialang.org/benchmarks/

Theodore Sturgeon update: 90% of all programming books are geared at beginners.
Is tidyverse really the only option? I'm a big fan of data.table + magrittr as a very powerful data munging combo.
This is what I've got my crew running. Tidyverse is basicly worthless once you hit a certain scale of data. If you've got datatables representing.

List functionality within datatable is blazingly fast. Faster than anything else I've seen in python or R.

magrittr is part of the tidyverse, but I agree that data.table is a comparably powerful and sometimes faster option versus dplyr.
magrittr existed before the tidyverse and can be used on standalone perfectly fine.

In all benchmarks I've seen data.table is faster than dplyr on all tasks. Curious to see other results.

At the scale of what I'm doing the benchmarks don't sway me, but I do like the syntax of data.table - it feels a bit like relational algebra.
So then I would assume you must be working with tables of less than 1000 rows, because thats pretty much the only case where it doesn't matter. At anything more than 1k rows, the differences are substantial.
Hundreds of rows is about usual for me. I do analysis on clinical studies with human participants. Nothing too tricky, most of my munging runs in effectively zero time.
I was going to make this point, but yeah. The only thing I think people have a bit of a time with is how you do operations in data.table. If you are coming from plyr/dplyr, the transition can be difficult. However, I've found that the more I do, the more I prefer it, inspite of the fact that the main reason I use dt over tidy is the phenomenal performance gain.
That's pretty much the only group of people who will use this though. Those that are serious about it or have some background won't really look at another book on data science and probably do the necessary research themselves
any suggestions for intermediate-advanced level articles outside of distill?