| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by danielecook 1922 days ago
	Pretty impressed with the data.table benchmarks. The syntax is a little weird and takes getting used to but once you have the basics it’s a great tool.

2 comments

orhmeh09 1922 days ago

I use it a lot but it really breaks the tidyverse, which makes using R actually enjoyable. Why aren’t these other libraries (not in R; I’m talking the others in the benchmark) consistently as fast as data.table? Are the programmers of data.table just that much better?

lordgroff 1922 days ago

While I like tidyverse, I honestly have trouble using it most of the time, knowing how much slower it is. It becomes addictive, where I have trouble accepting minutes over seconds many operations take in DT.

As for the speed, Matt Dowle definitely strikes me as a person that optimizes for speed. Then of course, there is the fact that everything is in place, and parallelization is at this point baked in. It's also mature unlike a lot of other alternatives and has never lost sight of speed. Note, for example, how in pandas, in place operations have become very much discouraged over time, and are often not actually in place anyways.

Note back to tidyverse. Why do you think tidyverse breaks with DT. If you enjoy the pipe, write out DT to a function (e.g. dt) that takes a data frame, and ensure that any operations you need specific to DT return a reference to your data table object and off you go with something like this:

  df %>%
    dt(, x := y + z) %>%
    unique() %>%
    merge(z, by = "x") %>%
    dt(x < a)

There, it looks like tidyverse, but way faster.

orhmeh09 1922 days ago

There are almost 200 magrittr-related issues in GitHub and I have had a bad time pairing data.table with tidyverse packages (and others because of e.g. IDate). DT code is like line noise to me, but I prefer to write things in it directly — the only reason I use it is because it’s fast, and guessing how it’s going to interact with tidy stuff and NSE (especially when using in place methods) is counterproductive to that goal.

lordgroff 1922 days ago

19 of those are open and most of them not terribly relevant. Considering the ubiquity of the package, I'd say the total number of issues is shockingly low.

As for NSE, DT uses NSE as well, but differently of course. I guess it all comes to what we "mean" by tidyverse. If we mean integration with the cast majority of packages, then yeah, it will work, but of course certain things are out of bounds. If you just want to use data table like dplyr, then tidytable is your ticket.

I'd argue the beast thing to do though is to just get used to the syntax. Data table looks like line noise until you're really comfortable with it, then the terse syntax comes across as really expressive and short. I've come to like writing data table in locally scoped blocks, pretty much without the pipe, and using mostly vanilla R (aside from data table). I think it looks pretty good actually, and I think less line noise than pandas with its endless lambda lambda lambda lambda.

orhmeh09 1921 days ago

I counted closed issues intentionally — this isn’t some one-off matter that’s easily resolved, as clearly hundreds of people have struggled with these issues over the years, and this should not be dismissed.

It’s far better aesthetically than Python. It’s just too different from the other libraries I use to disrupt my cognitive flow. You might say there are too many ways to do something, too, which makes it that much harder to figure out what code written by someone else (or myself three months ago) does. I also severely dislike seeing calls to eval or unevaluated code within the main body of my program —- quoted code looks awful and I trust it less.

It’d be interesting to see DT repackaged as its own tool with its own syntax. As it stands, it’s constrained by R and it has no comparable ecosystem to the tidyverse around it.

clircle 1922 days ago

I dropped dplyr in favor of data.table and never looked back.

https://github.com/eddelbuettel/gsir-te

jrumbut 1922 days ago

Vanilla R got a bad name but once you understand the fundamentals it's quite good, fewer footguns than used to be there, and I find it easier to reason about than tidyverse.

warlog 1922 days ago

But the hexagons! Where are it's hexagons?

clircle 1922 days ago

There are dozens of us!

zhdc1 1922 days ago

> It really breaks the tidyverse

You may want to look at tidyfst.

> Are the programmers of data.table just that much better?

Pixie dust, R's C API (and yes, they're just exceptionally good).

rcthompson 1922 days ago

dplyr and related packages use the existing R data frame class. (A "tibble" is just a regular R data frame under the hood.) This means that it inherits all the performance characteristics of regular R data frames. data.table is a completely separate implementation of a data structure that is functionally similar to a data frame but designed from the ground up for efficiency, though with some compromises, such as eschewing R's typical copy-on-modify paradigm. There are other more subtle reasons for the differences, but that's the absolute simplest explanation.

Supposedly you can use data.tables with dplyr, but I haven't experimented with it in depth.

_2d30 1922 days ago

> data.table is a completely separate implementation of a data structure that is functionally similar to a data frame but designed from the ground up for efficiency, though with some compromises, such as eschewing R's typical copy-on-modify paradigm.

This is totally false. data.table inherits from data.frame. Sure, it has some extra attributes that a tibble doesn’t but the way classing works in R is so absurdly lightweight, that’s meaningless in comparison. Both tibble and data.table are data.frames at their core which are just lists of equal length vectors. You can pass a data.table wherever you pass a data.frame.

rcthompson 1921 days ago

Thank you for the correction. I knew that tibbles were essentially just data frames with an extra class attribute, but for some reason I didn't realize this was also true of data.table. I think assumed that data.table's reference semantics couldn't be implemented on top of the existing data frame class, but I guess I'm wrong about that. Unfortunately it's too late for me to edit my original comment.

kkoncevicius 1921 days ago

Tibbles are not just data frames with extra class attribute. For one - they don't have row names. Second, consider this example, demonstrating how treating tibbles as data frames can be dangerous:

    df_iris <- iris
    tb_iris <- tibble(iris)

    nunique <- function(x, colname) length(unique(x[,colname]))

    nunique(df_iris, "Species")
    > 3

    nunique(tb_iris, "Species")
    > 1

R-devel mailing list had a long discussion about this too: https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...

rcthompson 1921 days ago

Ok, fine, to be more precise, tibbles and data frames and data tables are all implemented as R lists whose elements are vectors which form the columns of the table. And also `is.data.frame` currently returns TRUE for all of them, whether or not that is ultimately correct.

lordgroff 1922 days ago

dtplyr, the dplyr backend for data table is still IMHO not great, and will often break in subtle and not so subtle ways. Tidytable is, I think, a much more interesting implementation, and gets close to the same speeds.

rcthompson 1922 days ago

Hmm, this looks very interesting! I've ended up preferring dplyr for it's expressiveness in spite of the speed difference, so this might be a nice compromise for when dplyr gets too slow.

orhmeh09 1922 days ago

Oh, I know that, I use it daily and I’ve read some of its source code. I’m just astonished that the best-performing data frame library in the world is developed in R and it outperforms engines written with million/billion dollar companies behind it.

hugh-avherald 1922 days ago

data.table is written primarily in C. But R happens to have a very good package system and a very good interface to C code.

And Matt Dowle has bled for that C code.

dm319 1921 days ago

I feel like some of it is to do with the way R's generics work - being lisp-based and making use of promises. It allows for nice syntax / code while interfacing the C backend.

ineedasername 1922 days ago

Me too: I've tended to let the database do a lot of heavy lifting before I bring data in. Maybe I don't actually need to do that.

FridgeSeal 1922 days ago

There’s really no harm in doing that, and it’s still a pretty good idea.

I generally try and get my data sources as far as possible with the database, then leave framework/language specific things to the last step, means that-if nothing else-someone else picking up your dataset in a different language/framework toolset doesn’t need to pick up yours as a dependency, and you’re not spending time re-implementing what a database can already do (and can do more portably).

ineedasername 1922 days ago

The only downside to letting the database do some of the pre-processing is that I don't have a full raw data set to work with within either R or Python. If I decide I need a an existing measure aggregated up to a different level, or a new measure, I've got to go back to the database and then bring in an additional query. So I have less flexibility within the R or Python environment. But you make a good point: there's trade offs either way, and keeping the dataset as something like a materialized view on the database makes it a little more open to others' usage.