Hacker News new | ask | show | jobs
by wjn0 2557 days ago
I agree for the most part, but R does have a few things beyond the tidyverse: built-in dataframe support, lots of domain-specific packages, more consistent interfaces for basic statistics and machine learning models, etc. Python is definitely better for matrices (because of NumPy) and anything involving custom gradient descent methods (because of TensorFlow).

I think 90% of data science content is for beginners because anything more advanced isn't best described as data science. As soon as you get beyond the initial stages of data analysis (cleaning and processing data), you're doing something best described as some other word (statistics, machine learning, etc.) - although, granted, there isn't much content in these areas if you don't know _exactly_ what you're looking for.

3 comments

Even if you ignore the tidyverse, the example code for "roll your own linear regressions by hand" uses the R6 object system, which is... not even one of the two popular object systems for R (which are S3 and S4). No beginner needs to learn how to write classes in R.
`no beginner needs to learn how to write classes in R`. a) using classes properly is great for all level R users; b) a major reason that classes are not widely used (for beginners) is that S3/S4 are not easy to follow. R6 provides a natural and clear way to understand and write classes (especially for beginners).
Using classes at all is unnecessary for most R users. R is really, to the extent that paradigms matter to the average R user at all (which is: not much) a functional-first language. The idiomatic way to deal with the things you would use classes for is to use functions and closures. There are people who need objects in R, which is why R has object systems available, but it is of no help to a beginner to know them -- it doesn't help them to interact with the code they are going to see, and they don't have the background to understand why you would use classes instead of functions.
> built-in dataframe support

Not an advantage if you ask me - exactly because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it. That's how R ended up with 3 different structures that are similar but have inconsistent apis and behaviour.

> lots of domain-specific packages

That's true.

> more consistent interfaces for basic statistics and machine learning models

Can't disagree more - there is no one go-to library for ML in R (like sklearn in Python) and each package has it's own strange interface and implementation.

> Not an advantage if you ask me - exactly because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it. That's how R ended up with 3 different structures that are similar but have inconsistent apis and behaviour.

I've been fortunate to only work on projects that use built-in data frames, never encountered tibble or data.table in the wild.

> there is no one go-to library for ML in R (like sklearn in Python) and each package has it's own strange interface and implementation.

I still disagree here - one example being the unified interface for generalized linear models. Also, the vast majority of classifiers (RF, SVM, etc.) have similar or identical interfaces. Also, the unified `predict` interface as well. Granted, `sklearn` does have a consistent API as well.

That said, some of this is just a personal preference for the vaguely functional interface in R. The object-orientedness in Python feels a little forced for some tasks in `sklearn`.

> because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it.

To make what I think your point is more explicit, people build their own things in R because R must maintain compatibility with S. So by and large, changes happen in packages and not the base language. This does lead to a proliferation of solutions for the same kinds of problems.

You mean like keras? or tensorflow? Or base random forest. You know, like the original Breiman implementation.

Python has utility. But R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.

I run a machine learning shop. Right now all of the training, application, and data management is handled via R. R is simply superior in too many ways for us to be bothered with python for the scale of work we are doing.

Since we're moving some big applications to keras/ TF we do use python and will be using more in the future. However, for almost all data management, munging, movement visualization, reporting, its an R world.

> You mean like keras? or tensorflow? Or base random forest. You know, like the original Breiman implementation.

> ...

> Since we're moving some big applications to keras/ TF we do use python and will be using more in the future.

Not sure if I misunderstood, or you're contradicting yourself there.

> R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.

I not only disagree but I think that the exact opposite is true for each one of these points. But if things are working well in our shop, I'm not going to try to convince you otherwise.

> > R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.

> I not only disagree but I think that the exact opposite is true for each one of these points. But if things are working well in our shop, I'm not going to try to convince you otherwise.

I partially agree with you here. I'm extremely careful about what non-standard packages I use in R. Code quality varies wildly outside of these, likewise for documentation. But outside of neural networks, I've never found a package in Python that I felt better about in terms of code quality or documentation than its equivalent in R.

My point behind the keras/ TF comment is that the libraries have front ends in both python and R, so its mix mox/ dealers choice on what you like to work in (since the backends of both are identical).

The primary reason to moving these to python is due to convenience/ the community. Most new work is published in python. If we find a new/ interesting model we want to implement, its probably written in python. Rather than reskin the thing in its entirety, its easier here to work in python.

A couple disclaimers: my group works primarily in geospatial data, and principally in LiDAR and multispectral imagery.

The coarse division I see between R/ Python, is that if you come from a research/ academic background (non-engineering), you probably learned to program in R. If you were an engineer, you probably learned matlab. If you are self taught/ coursera/ youtube, you probably learned in python.

R libraries are generally more geared towards academic research, and specifically, working within existing frameworks (handling geospatial data as geospatial data rather then turning them into a numpy arrays). Working in python, there is far more re-invention of the wheel, and its always a pain the ass to get things back into the structures they came in as.

Python has huge utility and is an important tool for certain work. But its really really not faster than R (it def used to be, this isnt the case any more).

R has better support for more scientific programming than python.

> My point behind the keras/ TF comment is that the libraries have front ends in both python and R, so its mix mox/ dealers choice on what you like to work in (since the backends of both are identical).

Not as a point of argument, just additional information: R's support for keras and TF is a wrapper around the Python interface to those libraries.

> Python is definitely better for matrices (because of NumPy)

How so?

numpy is significantly faster and arguably more usable (e.g. broadcasting) than anything in R, and only recently has there been progress in more efficient matrix manipulation in R like rray[0], a wrapper for xtensor.

[0]https://github.com/r-lib/rray

As far as I can tell, R uses BLAS for matrix operations, and Python probably does the same, so in terms of efficiency I wouldn't expect a big difference between the two.
Both R and numpy use BLAS, and if both are linked to the same library, say OpenBLAS or Intel MKL, then performance is in fact almost identical for expensive operations like matrix multiplication. (R also ships with its own internal BLAS implementation, which is reliable but not very fast, and I believe is still single threaded, so the first thing you should do if you are using R and care about performance is to swap it out.)

For more sophisticated linear algebra algorithm, such as SVD, both will use typically LAPACK, and again, both will exhibit essential identical performance.

There is one important difference though: when R is compiled for 64-bit machines, it can only use 64-bit floats! While numpy can support 32 and even (through software emulation) 16 bit floats. This can halve memory usage, which in turn halves cache misses, which results in a significant speed up in cases where 64-bits of precision is not needed.

This is really interesting! I had always just claimed that Python was faster (see the benchmark I linked above [1]) based on personal experience. I wonder if this internal implementation has something to do with it...

[1] https://julialang.org/benchmarks/

Speed. [1]

See the yellow benchmark (matrix multiply). I suspect it's memory-related.

[1] https://julialang.org/benchmarks/