Hacker News new | ask | show | jobs
by pickdenis 2360 days ago
I know this is a dead horse, but I think R seriously shot itself in the foot with its data structures[1]. I don't really see a solution for this, as fixing it would never be backward compatible. I'll always pick Python over R because the data structures actually make sense to me as a programmer (objects that look like lists, dicts, matrices, etc. or any combination of the above, and they all behave in very predictable ways). I think this puts off a lot of other people like me.

[1]: https://jamesmccaffrey.wordpress.com/2016/05/02/r-language-v...

6 comments

True, the default semantics of R's data structures are somewhat arcane (of course as they're based on S [1] from the 70's). And the current support for e.g. 64bit integers leaves something to be desired.

But behind the scenes, R is just a lisp with some data structures that are adapted to statistics and data science.

All base data structures are by default immutable. And e.g. the vector type is extremely performant as it's just a thinly wrapped C Array. In Python you need to reach for Numpy for anything similar, and you do feel some pain when converting between native python types and Numpy types for various functions which support one or the other.

The data frame is immensely powerful. And has excellent performance characteristics as it's built upon vectors. A list of objects, like you'd make in python is just a lot slower and more unwieldy to deal with. And much harder to make generalizable functions upon.

Hadley Wickham's Tidyverse[2] is exactly an attempt to hide away the arcane details and create a modern, coherent and consistent language on top of R, keeping the power of all the great statistics R libraries. The fact that R behind the scenes is a Lisp, with support for macros, makes this possible. For doing data-transformations and statistics, I can't think of anything currently as powerful as CRAN + Tidyverse.

[1] https://en.wikipedia.org/wiki/S_(programming_language)

[2] https://www.tidyverse.org/

In typical Lisps a vector would be a one-dimensional array, which by default is not specialized to a particular data-type. So the most general data type would be the n-dimensional array and a vector would be a one-dimensional array. A matrix would be a two dimensional array. In Common Lisp one can also ask Lisp to generate a type-specific array (like a string, a bitvector, an array of single-floats, ...).

In R it's slightly different. The vector (being generally without dimensions) is the base data type and n-dimensional arrays are made of a vector and dimensions. A matrix is then a 2d array. Also vectors/arrays are by default type-specific.

> support for macros

From what I've seen, R does not support macros, but functions which can retrieve/generate code at runtime. That's an early mechanism which got replaced by macros in Lisp. Macros in Lisp are source code transformers and can be compiled - thus they are not a runtime mechanism like in R or earlier Lisps with so-called FEXPRs.

This 5 minute video by Wickham was eye opening for me regarding the lispiness or R.

https://youtu.be/nERXS3ssntw

modern Lisps don't use unquote/quote like that.

This looks more like 'FEXPRS' from decades ago.

1962 the ideas of macros were introduced and macros are source code transformers, which take source code and generate new source code. This can also used in a compiled implementation, where macros translate the code before compiling.

FEXPRs are then functions which get arguments unevaluated and can decide at runtime which to evaluate and how.

> 64bit integers leaves something to be desired

This is something I wish there was more progress on. A serious limitation in some contexts.

From that link:

> A vector is what is called an array in all other programming languages except R

Vectors are called vectors in several "wispy" languages: Common Lisp, Scheme, Clojure...

> An array with two dimensions is (almost) the same as a matrix.

I think it's the same, not "almost" the same. At least in the current version of R:

  > class(array(1, c(2,3)))
  [1] "matrix"
  > identical(array(1, c(2,3)), matrix(1, nrow=2, ncol=3))
  [1] TRUE
In 4.0 there will be a change and the class of a matrix will be both "matrix" and "array", but I think the fact that there is no difference between a 2-dimensional array and a matrix remains.
I look at it and don't see what is the problem. I think in fact is a very sensible progression of structures?
It's based pretty directly on S, which was designed in the mid 1970s. Yeah, it has very rough edges here, but hard to argue that they should have foreseen the future back then.

That said, the real value in R seems to be the libraries. Has anyone looked at a shim that could make those libraries available to Python in a reasonably natural way? If that existed, the R language itself could be allowed to finally rest in peace.

There is something to be said to build a programing language to solve a certain task in mind.

Being vector aware and having a dataframe support in R is much more elegant for me than Python's add on library. It's like Scala building on top of Java but trying to have an Actor paradigm vs Erlang built from get go around concurrency and choosing Actor as it's main concurrency paradigm. You can see this in other language on PHP and C++ let you be OOP but it's an after thought compare to Ruby or Python.

I'm not unsympathetic to this idea, but after learning my 87th domain-specific language that couldn't be bothered with reasonable control structures, or even solid error checking, it's really starting to wear.

Statisticians aren't that interested in writing a really good programming language. And why should they be? They have better things to do. The trick is to not take on responsibility for something you don't care about, if you can help it.

You can embed an R interpreter in any language with a C interface. That said, most of the complaints I see about R reflect preferences and prior programming experience with newer programming languages. While there are things I don't like about R, it's a Scheme without s-expressions, and overall I like it.
There are various ways to call R from Python, or Python from R. They never end up being very idiomatic, which typically makes them a pain to work with.
The only thing you need to understand about R data structures is that everything is a vector, including scalars. You have atomic vectors and lists, which are a special kind of vector. Everything else is built on top of those.
This is an insight you gain very early in your R experience. It breaks down rather quickly. Not that it is not true — there are just too many details around the core concept.
It's possible if you provide the migration tool, something like Rust's `cargo fix`[1]. Apart from small obvious warnings, it can apply the migration from the Rust-2015 edition to Rust-2018 one[2]. Introducing the new R edition and a similar tool could help with this.

[1] https://github.com/rust-lang/rustfix

[2] https://doc.rust-lang.org/nightly/edition-guide/editions/tra...