| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lordgroff 1922 days ago

While I like tidyverse, I honestly have trouble using it most of the time, knowing how much slower it is. It becomes addictive, where I have trouble accepting minutes over seconds many operations take in DT.

As for the speed, Matt Dowle definitely strikes me as a person that optimizes for speed. Then of course, there is the fact that everything is in place, and parallelization is at this point baked in. It's also mature unlike a lot of other alternatives and has never lost sight of speed. Note, for example, how in pandas, in place operations have become very much discouraged over time, and are often not actually in place anyways.

Note back to tidyverse. Why do you think tidyverse breaks with DT. If you enjoy the pipe, write out DT to a function (e.g. dt) that takes a data frame, and ensure that any operations you need specific to DT return a reference to your data table object and off you go with something like this:

  df %>%
    dt(, x := y + z) %>%
    unique() %>%
    merge(z, by = "x") %>%
    dt(x < a)

There, it looks like tidyverse, but way faster.

1 comments

orhmeh09 1922 days ago

There are almost 200 magrittr-related issues in GitHub and I have had a bad time pairing data.table with tidyverse packages (and others because of e.g. IDate). DT code is like line noise to me, but I prefer to write things in it directly — the only reason I use it is because it’s fast, and guessing how it’s going to interact with tidy stuff and NSE (especially when using in place methods) is counterproductive to that goal.

link

lordgroff 1922 days ago

19 of those are open and most of them not terribly relevant. Considering the ubiquity of the package, I'd say the total number of issues is shockingly low.

As for NSE, DT uses NSE as well, but differently of course. I guess it all comes to what we "mean" by tidyverse. If we mean integration with the cast majority of packages, then yeah, it will work, but of course certain things are out of bounds. If you just want to use data table like dplyr, then tidytable is your ticket.

I'd argue the beast thing to do though is to just get used to the syntax. Data table looks like line noise until you're really comfortable with it, then the terse syntax comes across as really expressive and short. I've come to like writing data table in locally scoped blocks, pretty much without the pipe, and using mostly vanilla R (aside from data table). I think it looks pretty good actually, and I think less line noise than pandas with its endless lambda lambda lambda lambda.

link

orhmeh09 1921 days ago

I counted closed issues intentionally — this isn’t some one-off matter that’s easily resolved, as clearly hundreds of people have struggled with these issues over the years, and this should not be dismissed.

It’s far better aesthetically than Python. It’s just too different from the other libraries I use to disrupt my cognitive flow. You might say there are too many ways to do something, too, which makes it that much harder to figure out what code written by someone else (or myself three months ago) does. I also severely dislike seeing calls to eval or unevaluated code within the main body of my program —- quoted code looks awful and I trust it less.

It’d be interesting to see DT repackaged as its own tool with its own syntax. As it stands, it’s constrained by R and it has no comparable ecosystem to the tidyverse around it.

link