Hacker News new | ask | show | jobs
by danielecook 1922 days ago
Pretty impressed with the data.table benchmarks. The syntax is a little weird and takes getting used to but once you have the basics it’s a great tool.
2 comments

I use it a lot but it really breaks the tidyverse, which makes using R actually enjoyable. Why aren’t these other libraries (not in R; I’m talking the others in the benchmark) consistently as fast as data.table? Are the programmers of data.table just that much better?
While I like tidyverse, I honestly have trouble using it most of the time, knowing how much slower it is. It becomes addictive, where I have trouble accepting minutes over seconds many operations take in DT.

As for the speed, Matt Dowle definitely strikes me as a person that optimizes for speed. Then of course, there is the fact that everything is in place, and parallelization is at this point baked in. It's also mature unlike a lot of other alternatives and has never lost sight of speed. Note, for example, how in pandas, in place operations have become very much discouraged over time, and are often not actually in place anyways.

Note back to tidyverse. Why do you think tidyverse breaks with DT. If you enjoy the pipe, write out DT to a function (e.g. dt) that takes a data frame, and ensure that any operations you need specific to DT return a reference to your data table object and off you go with something like this:

  df %>%
    dt(, x := y + z) %>%
    unique() %>%
    merge(z, by = "x") %>%
    dt(x < a)
There, it looks like tidyverse, but way faster.
There are almost 200 magrittr-related issues in GitHub and I have had a bad time pairing data.table with tidyverse packages (and others because of e.g. IDate). DT code is like line noise to me, but I prefer to write things in it directly — the only reason I use it is because it’s fast, and guessing how it’s going to interact with tidy stuff and NSE (especially when using in place methods) is counterproductive to that goal.
19 of those are open and most of them not terribly relevant. Considering the ubiquity of the package, I'd say the total number of issues is shockingly low.

As for NSE, DT uses NSE as well, but differently of course. I guess it all comes to what we "mean" by tidyverse. If we mean integration with the cast majority of packages, then yeah, it will work, but of course certain things are out of bounds. If you just want to use data table like dplyr, then tidytable is your ticket.

I'd argue the beast thing to do though is to just get used to the syntax. Data table looks like line noise until you're really comfortable with it, then the terse syntax comes across as really expressive and short. I've come to like writing data table in locally scoped blocks, pretty much without the pipe, and using mostly vanilla R (aside from data table). I think it looks pretty good actually, and I think less line noise than pandas with its endless lambda lambda lambda lambda.

I counted closed issues intentionally — this isn’t some one-off matter that’s easily resolved, as clearly hundreds of people have struggled with these issues over the years, and this should not be dismissed.

It’s far better aesthetically than Python. It’s just too different from the other libraries I use to disrupt my cognitive flow. You might say there are too many ways to do something, too, which makes it that much harder to figure out what code written by someone else (or myself three months ago) does. I also severely dislike seeing calls to eval or unevaluated code within the main body of my program —- quoted code looks awful and I trust it less.

It’d be interesting to see DT repackaged as its own tool with its own syntax. As it stands, it’s constrained by R and it has no comparable ecosystem to the tidyverse around it.

I dropped dplyr in favor of data.table and never looked back.

https://github.com/eddelbuettel/gsir-te

Vanilla R got a bad name but once you understand the fundamentals it's quite good, fewer footguns than used to be there, and I find it easier to reason about than tidyverse.
But the hexagons! Where are it's hexagons?
There are dozens of us!
> It really breaks the tidyverse

You may want to look at tidyfst.

> Are the programmers of data.table just that much better?

Pixie dust, R's C API (and yes, they're just exceptionally good).

dplyr and related packages use the existing R data frame class. (A "tibble" is just a regular R data frame under the hood.) This means that it inherits all the performance characteristics of regular R data frames. data.table is a completely separate implementation of a data structure that is functionally similar to a data frame but designed from the ground up for efficiency, though with some compromises, such as eschewing R's typical copy-on-modify paradigm. There are other more subtle reasons for the differences, but that's the absolute simplest explanation.

Supposedly you can use data.tables with dplyr, but I haven't experimented with it in depth.

> data.table is a completely separate implementation of a data structure that is functionally similar to a data frame but designed from the ground up for efficiency, though with some compromises, such as eschewing R's typical copy-on-modify paradigm.

This is totally false. data.table inherits from data.frame. Sure, it has some extra attributes that a tibble doesn’t but the way classing works in R is so absurdly lightweight, that’s meaningless in comparison. Both tibble and data.table are data.frames at their core which are just lists of equal length vectors. You can pass a data.table wherever you pass a data.frame.

Thank you for the correction. I knew that tibbles were essentially just data frames with an extra class attribute, but for some reason I didn't realize this was also true of data.table. I think assumed that data.table's reference semantics couldn't be implemented on top of the existing data frame class, but I guess I'm wrong about that. Unfortunately it's too late for me to edit my original comment.
Tibbles are not just data frames with extra class attribute. For one - they don't have row names. Second, consider this example, demonstrating how treating tibbles as data frames can be dangerous:

    df_iris <- iris
    tb_iris <- tibble(iris)

    nunique <- function(x, colname) length(unique(x[,colname]))

    nunique(df_iris, "Species")
    > 3

    nunique(tb_iris, "Species")
    > 1
R-devel mailing list had a long discussion about this too: https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...
Ok, fine, to be more precise, tibbles and data frames and data tables are all implemented as R lists whose elements are vectors which form the columns of the table. And also `is.data.frame` currently returns TRUE for all of them, whether or not that is ultimately correct.
dtplyr, the dplyr backend for data table is still IMHO not great, and will often break in subtle and not so subtle ways. Tidytable is, I think, a much more interesting implementation, and gets close to the same speeds.
Hmm, this looks very interesting! I've ended up preferring dplyr for it's expressiveness in spite of the speed difference, so this might be a nice compromise for when dplyr gets too slow.
Oh, I know that, I use it daily and I’ve read some of its source code. I’m just astonished that the best-performing data frame library in the world is developed in R and it outperforms engines written with million/billion dollar companies behind it.
data.table is written primarily in C. But R happens to have a very good package system and a very good interface to C code.

And Matt Dowle has bled for that C code.

I feel like some of it is to do with the way R's generics work - being lisp-based and making use of promises. It allows for nice syntax / code while interfacing the C backend.
Me too: I've tended to let the database do a lot of heavy lifting before I bring data in. Maybe I don't actually need to do that.
There’s really no harm in doing that, and it’s still a pretty good idea.

I generally try and get my data sources as far as possible with the database, then leave framework/language specific things to the last step, means that-if nothing else-someone else picking up your dataset in a different language/framework toolset doesn’t need to pick up yours as a dependency, and you’re not spending time re-implementing what a database can already do (and can do more portably).

The only downside to letting the database do some of the pre-processing is that I don't have a full raw data set to work with within either R or Python. If I decide I need a an existing measure aggregated up to a different level, or a new measure, I've got to go back to the database and then bring in an additional query. So I have less flexibility within the R or Python environment. But you make a good point: there's trade offs either way, and keeping the dataset as something like a materialized view on the database makes it a little more open to others' usage.