| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by crayola 4857 days ago
	Interesting analysis, but it would really benefit from a section about data.table. For me and many others, data.table has almost completely replaced data.frame (of which data.table is a subclass) and completely replaced plyr. The speed and ease of use of data.table are much more favorably comparable with pandas than the R tools mentioned here.

2 comments

wesm 4857 days ago

Comparisons with data.table on performance are much more favorable than with vanilla R or plyr; a lot of progress has been made last couple years, too. I personally find the data.table syntax to be a bit obtuse at times but it's a great library.

link

oddthink 4857 days ago

Aside from the performance differences, data.table makes it very easy to do interactive manipulation, at the cost of making it hard to program. Pandas currently goes in the opposite direction.

I'd rather have R/data.table at the prompt and python/pandas in my script, but if you have to err on one side, the python/pandas "low magic" is the side to err on. Pandas does have its own strange corners, though. For example, it seems like it tries hard to stick similar-typed columns into contiguous matrices, which leads to some unexpected casting, and I have no idea what the supposed benefit is over just keeping distinct columns.

link

takluyver 4857 days ago

I'd guess the benefits are related to performance - Wes is known as something of a speed junkie (see also his vbench project). I know there's quite a bit of code in pandas that makes it much faster than a naive implementation of a similar interface.

That said, if it causes unexpected behaviour, check to see whether it's a bug.

link

dagw 4857 days ago

If your use case fits data.table then you probably want to use pytables in python. It's much faster than pandas when dealing with very large data sets, at the cost of some features you may or may not need.

link

crayola 4857 days ago

The benefits from data.table are not as much processing very large data (anything above 10M observations is mostly outside of R's comfort zone on a reasonable machine, anyway), as much as the ease of performing operations such as indexed joins, aggregations, and so on.

link