Hacker News new | ask | show | jobs
by aorist 1247 days ago
The ergonomics of grouping and aggregation in R are really much better because libraries can make of its non-standard evaluation[^0] (which in other cases also makes the language a nightmare to deal with).

Compare:

    pd_df.groupby(['date'])['failure'].count() #  pandas
    pl_df.groupby(pl.col('date')).agg(pl.count('failure')) #  polars
    dt[, .N, date] # R data.table

In both Pandas and Polars, the specification of the date has to be a string inside a list or method call, but in R it can be a bare token.

[^0]: http://adv-r.had.co.nz/Computing-on-the-language.html

5 comments

Having migrated 1000s of lines of legacy r, all I can say is… yes, but then you have you to use r. (:

R is not a replacement for pandas.

R is it’s own special little painful ecosystem, loved by people who don’t have to maintain the code they write.

You can complain all you like about pandas, but at the end of the day, it’s python. Python tooling works with it. The python ecosystem works with it.

It’s not without faults, but at least you’ll have a community of people to help when things go wrong.

R, not so much.

(Spoken as jaded developer who had to support r on databricks, which is deep in the hell of “well, it’s not really a tier one language” even from their support team)

Having written tens of thousands of lines of R code that I've been maintaining and using for production pipelines for 9 years...

Sounds like you worked with (or wrote) really bad R code.

The point I’m making isn’t that you can write bad r code; you can write bad code in any language.

…the point I’m making is that when you already have bad r code, it’s a massive pain in the ass.

Bad python is terrible too, but lots of people know python, and it’s easy to find help to unduck bad python code and turn it into maintainable code.

That has not been my experience with r. Ever. At any organisation.

Your experience may vary. (:

> when you already have bad r code, it’s a massive pain in the ass.

Do you think it’s because R code tends to be written by statisticians and stats-adjacent domain experts who don’t necessarily know how to write clean code while Python code has at least some input from actual programmers? Or is this really down to the language itself?

Can you elaborate please? This hasn’t been my experience whatsoever, curious what the issues have been
? I'm not sure I can say more than I already did, but I'll try to be more specific:

The R community is categorically smaller than the python community. The support on community forums is harder to get, or non-existent (eg. with databricks).

Are you saying you've worked in places where its easier to find people that are familiar with R to help work on a project than it is to find people are familiar with python?

That you've found its easier to hire people who are familiar with R than it is to hire people who are familiar with python?

I... all I can say is that has not been my experience.

The places I've worked, of all the developers a small handful of people use R, and a small subset of those are good at it.

I don't hate R. I don't think it's a bad language. I'm saying: It's harder to support, because it's obscure, rarely used by most developers, and the people who use it and know it well are rare and expensive.

As a data engineer, expected to support workflows in production: Don't use obscure crap and expect other people to support it. Not R. Not rust. Not pony.

Using R on databricks, specifically, is a) unsupported^, and b) obscure and c) buggy. Don't do it.

(^ sorry, it's a 'tier 2 language' if you speak to DB representative, which means bugs don't count and new features don't get support)

All I can say, is that my experience has been that supporting python has been less painful; it's a simple known quantity, and its easy to scale up a team to fix projects if you need to.

Thanks for sharing. Seems your issues are more with databricks than R, but certainly R is more obscure.

At least in my experience we’ve never had issues with people learning it on the job and far fewer software issues from eg versioning, dependencies, regression bugs. It just works, there’s rarely even a need for a venv.

I’d never expect it applied as a general purpose language like python though, typical projects are <1k lines of some specific data task, perhaps our use cases are just different

I think the difference is working in teams or with other people's code. Bad python is usually fairly readable, there is a sense of "pythonic" code that the language pushes you towards. R is the complete opposite, there are 50 different ways to do every simple thing, coupled with R users generally not having much sense of good code practices.

Maybe I'm just jaded because I've inherited a 100K+ line R codebase at my job written by a single person with no unit tests and about 3 lines of comments, and it's a completely miserable experience.

data.table syntax is indeed more concise but harder to parse later, which is more important in a collaborative environment.

dplyr syntax IMO is the best balance between clarity and nonredundancy as evident in pandas/polar code.

While it's shorter, it seems more magical? How does it know to count `failure`?
It doesn't count `failure` — just the number of rows. But neither does the pandas version: `pd_df.groupby(['date'])['failure'].count()` and `pd_df.groupby(['date']).count()` are the same except the former returns a single `pd.Series` with the count and the latter produces a `pd.DataFrame` where each column has the same count (not super useful).

e.g.

    > iris.groupby('species').count()
                sepal_length  sepal_width  petal_length  petal_width
    species
    setosa                50           50            50           50
    versicolor            50           50            50           50
    virginica             50           50            50           50
vs.

    > iris.groupby('species')['sepal_length'].count()
    species
    setosa        50
    versicolor    50
    virginica     50
I believe `count` will only give you the number of non-null rows, so the numbers from the first command could differ by column if there were null values. You can also use the `size` command to get the total number of rows, and that will return a `pd.Series` with or without a column specifier.
R is easy and fun to use.

R is also impossible to understand whta is actually going on.

R is Perl for math
I would do the Pandas and Polars examples differently:

``` pd_df['date'].value_counts() # pandas pl_df.select(pl.col('date').value_counts()) # polars ```

Note: we could also do it the Pandas way in Polars but square bracket indexing in Polars is not recommended