| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aorist 1247 days ago

The ergonomics of grouping and aggregation in R are really much better because libraries can make of its non-standard evaluation[^0] (which in other cases also makes the language a nightmare to deal with).

Compare:

    pd_df.groupby(['date'])['failure'].count() #  pandas
    pl_df.groupby(pl.col('date')).agg(pl.count('failure')) #  polars
    dt[, .N, date] # R data.table

In both Pandas and Polars, the specification of the date has to be a string inside a list or method call, but in R it can be a bare token.

[^0]: http://adv-r.had.co.nz/Computing-on-the-language.html

5 comments

wokwokwok 1247 days ago

Having migrated 1000s of lines of legacy r, all I can say is… yes, but then you have you to use r. (:

R is not a replacement for pandas.

R is it’s own special little painful ecosystem, loved by people who don’t have to maintain the code they write.

You can complain all you like about pandas, but at the end of the day, it’s python. Python tooling works with it. The python ecosystem works with it.

It’s not without faults, but at least you’ll have a community of people to help when things go wrong.

R, not so much.

(Spoken as jaded developer who had to support r on databricks, which is deep in the hell of “well, it’s not really a tier one language” even from their support team)

jasonpbecker 1247 days ago

Having written tens of thousands of lines of R code that I've been maintaining and using for production pipelines for 9 years...

Sounds like you worked with (or wrote) really bad R code.

wokwokwok 1246 days ago

The point I’m making isn’t that you can write bad r code; you can write bad code in any language.

…the point I’m making is that when you already have bad r code, it’s a massive pain in the ass.

Bad python is terrible too, but lots of people know python, and it’s easy to find help to unduck bad python code and turn it into maintainable code.

That has not been my experience with r. Ever. At any organisation.

Your experience may vary. (:

nequo 1246 days ago

> when you already have bad r code, it’s a massive pain in the ass.

Do you think it’s because R code tends to be written by statisticians and stats-adjacent domain experts who don’t necessarily know how to write clean code while Python code has at least some input from actual programmers? Or is this really down to the language itself?

bovinejoni 1246 days ago

Can you elaborate please? This hasn’t been my experience whatsoever, curious what the issues have been

wokwokwok 1246 days ago

? I'm not sure I can say more than I already did, but I'll try to be more specific:

The R community is categorically smaller than the python community. The support on community forums is harder to get, or non-existent (eg. with databricks).

Are you saying you've worked in places where its easier to find people that are familiar with R to help work on a project than it is to find people are familiar with python?

That you've found its easier to hire people who are familiar with R than it is to hire people who are familiar with python?

I... all I can say is that has not been my experience.

The places I've worked, of all the developers a small handful of people use R, and a small subset of those are good at it.

I don't hate R. I don't think it's a bad language. I'm saying: It's harder to support, because it's obscure, rarely used by most developers, and the people who use it and know it well are rare and expensive.

As a data engineer, expected to support workflows in production: Don't use obscure crap and expect other people to support it. Not R. Not rust. Not pony.

Using R on databricks, specifically, is a) unsupported^, and b) obscure and c) buggy. Don't do it.

(^ sorry, it's a 'tier 2 language' if you speak to DB representative, which means bugs don't count and new features don't get support)

All I can say, is that my experience has been that supporting python has been less painful; it's a simple known quantity, and its easy to scale up a team to fix projects if you need to.

bovinejoni 1246 days ago

Thanks for sharing. Seems your issues are more with databricks than R, but certainly R is more obscure.

At least in my experience we’ve never had issues with people learning it on the job and far fewer software issues from eg versioning, dependencies, regression bugs. It just works, there’s rarely even a need for a venv.

I’d never expect it applied as a general purpose language like python though, typical projects are <1k lines of some specific data task, perhaps our use cases are just different

_Wintermute 1246 days ago

I think the difference is working in teams or with other people's code. Bad python is usually fairly readable, there is a sense of "pythonic" code that the language pushes you towards. R is the complete opposite, there are 50 different ways to do every simple thing, coupled with R users generally not having much sense of good code practices.

Maybe I'm just jaded because I've inherited a 100K+ line R codebase at my job written by a single person with no unit tests and about 3 lines of comments, and it's a completely miserable experience.

minimaxir 1247 days ago

data.table syntax is indeed more concise but harder to parse later, which is more important in a collaborative environment.

dplyr syntax IMO is the best balance between clarity and nonredundancy as evident in pandas/polar code.

barumrho 1247 days ago

While it's shorter, it seems more magical? How does it know to count `failure`?

aorist 1247 days ago

It doesn't count `failure` — just the number of rows. But neither does the pandas version: `pd_df.groupby(['date'])['failure'].count()` and `pd_df.groupby(['date']).count()` are the same except the former returns a single `pd.Series` with the count and the latter produces a `pd.DataFrame` where each column has the same count (not super useful).

e.g.

    > iris.groupby('species').count()
                sepal_length  sepal_width  petal_length  petal_width
    species
    setosa                50           50            50           50
    versicolor            50           50            50           50
    virginica             50           50            50           50

vs.

    > iris.groupby('species')['sepal_length'].count()
    species
    setosa        50
    versicolor    50
    virginica     50

iamlemec 1247 days ago

I believe `count` will only give you the number of non-null rows, so the numbers from the first command could differ by column if there were null values. You can also use the `size` command to get the total number of rows, and that will return a `pd.Series` with or without a column specifier.

tryptophan 1247 days ago

R is easy and fun to use.

R is also impossible to understand whta is actually going on.

miohtama 1246 days ago

R is Perl for math

mutant_self 1246 days ago

I would do the Pandas and Polars examples differently:

``` pd_df['date'].value_counts() # pandas pl_df.select(pl.col('date').value_counts()) # polars ```

Note: we could also do it the Pandas way in Polars but square bracket indexing in Polars is not recommended