| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hadley 4191 days ago
	Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you've written it. It's very reminiscent of APL.

2 comments

arun_sriniv 4186 days ago

data.table's `DT[i, j, by]` is quite consistent actually and is comparable to SQL's - i = where, j = select | update and by = group by.

This form is always intact. For example:

  require(data.table)  
  DT = data.table(x=c(3:7), y=1:5, z=c(1,2,1,1,2))

  DT[x >= 5, mean(y), by=z]        ## calculates mean of y while grouped by z on 
                                   ## rows where x >= 5

  DT[x >= 5, y := cumsum(y), by=z] ## updates y in-place with it's cumulative sum 
                                   ## while grouped by z on rows where x >= 5

"Harder to read after you've written it" and "harder to learn" are all very subjective and pointless. One could make very similar observations about `dplyr`, but I'll refrain from it here.

I implore the readers to take a look at over 100+ reviews on crantastic: http://crantastic.org/packages/data-table from users of the package.

Keeping `i`, `j` and `by` operations together allows optimising for speed and more importantly memory usage (altogether under a consistent syntax) - which are two very important aspects especially working on really huge data sets (10-100GB in RAM or more).

Here's a detailed benchmark (only on grouping so far) on 10 million (in MB) to 2 billion rows (100GB): https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A...

link

ajinkyakale 4186 days ago

I agree to what Hadley said in some ways. It takes a bit more time to get used to the [i, j, by] notation and I personally feels its unlike most of the R syntax. But I dont see that stopping me from using something as fast as data.table.

link

arun_sriniv 4185 days ago

ajinkyakale, "harder to learn" doesn't expose the fact that data.table provides so many features that, for example, dplyr just doesn't. And in addition, it is fast and memory efficient.

Rolling joins for example are slightly harder to grasp concept because most of us don't know what a "rolling" join is (unless you work regularly with time series).

Aggregating while joining is hard to grasp not because the syntax is hard, but the concept is inherently new.. It allows us to perform operations in a more straightforward manner, which most embrace after investing some time to understand it.

Binary search based subset, e.g., DT[J(4:6)] is again another concept that's new. One could use base R syntax and use vector scans to subset. But when you learn the difference between vector scans and binary search, you obviously don't want to vector scan. Now we can say that learning the difference between "vector scan" and "binary search" is really hard, but that'd be missing the point.

DT[x %in% 4:6] now internally uses binary search by constructing an index automatically! So you can keep using base R syntax.

And dplyr doesn't have any of these features.

In short, a huge part of "bit more time to get used" is due to data.table introducing concepts that aren't available in other tools/packages for faster and more efficient data manipulation. And I say this as a data.table user turned developer.

"harder to read after writing it" is very very subjective. I don't know what to say to that.

link