| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by meztez 2001 days ago
	R has data.table. It is the game changer as I agree base R data.frame do not cut it for performance. tibble will come close once they incorporate more of the data.table performance tricks. https://h2oai.github.io/db-benchmark/

2 comments

hnracer 2001 days ago

Does R have robust CSV parsing? I remember using the default and it'd be extremely finicky about getting the header and index flags right and wouldn't typecast numeric columns properly (instead they'd end up as factors and not play nice)

link

st1ck 2001 days ago

Python version of data.table has very fast CSV parsing (compared to Pandas), and it didn't have issues like those you mention. Even if data.table had issues with CSV parsing, you could probably use Apache Arrow to parse CSV into arrow table and then convert it to data.table (but that is probably suboptimal).

link

alexhutcheson 2001 days ago

https://readr.tidyverse.org/

link

bostonfincs 2001 days ago

Personally have never had a problem with R csv parsing

link

disgruntledphd2 2001 days ago

It happens, but mostly because other formats don't produce usable CSV's. The biggest problem is if there are any free-entry text fields (common for customer/business name), and there isn't full quoting around these fields, base R will break.

I believe both fread and readr::read_csv do the right thing here, but the base-R perspective on data manipulation before read.csv is to use Perl (the R-core team are pretty old-school, to be fair).

link

eyeball 2001 days ago

h2o's data.table clone is fine

https://github.com/h2oai/datatable

link