| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Annatar 2615 days ago
	I cannot understand why I would use Python over R. R is designed from the ground up for massive amounts of data processing at speed and with ease. Even if Python continues accreting computational functionality, it will never be as fast or as efficient as R. Improving Python for something R is designed to do seems to me to be a huge waste of time: familiarity should not be the driving force behind replicating R's functionality. That's just so wrong.

1 comments

kuzehanka 2615 days ago

> R is designed from the ground up for massive amounts of data processing at speed

What? The R ecosystem doesn't provide meaningful out of core capabilities, nevermind the ability to handle anything approaching 'massive amounts of data'.

-- Would sure love to know why an agenda-less factual comment is getting downvoted.

link

javierluraschi 2615 days ago

In my experience, R is really fast since I t was designed to store data in columnar format which we now all know is best for data analysis. So, in most cases, scaling up computation is quite easy. To scale out, you can use Apache Spark with R, the interface I’ve worked on, sparklyr is quite easy to use and allows you to scale out computation. Just to give you an example of what’s possible, I was playing around yesterday with a ray tracing prototype someone is building and scaled it out in Spark, see https://twitter.com/javierluraschi/status/112055769372135424... — it’s a misconception that R is slow or can’t scale.

link

kuzehanka 2615 days ago

You can plug any compute kernel you want into spark, that's not a pro or con of R.

Column stores are standard in any analytics pipeline today. They make up Python's Pandas, R's dplyr, and Java's DataFrame. How or why does R stand out for 'massive amounts of data'?

R does not have have meaningful out of core compute offerings that compare with something like Dask.

R does not at all have cluster compute offerings that compare to Dask Distributed.

If you want to know what real performance looks like, check out Python's cudf which will shortly fully match the Pandas api. That raytracing example you linked would run at interactive rates with cudf, I really don't see any basis for perf arguments in R's favour, and 'massive data' arguments are laughable here.

Whatever advantages R has, perf or scalability are definitely not amongst them.

link

Annatar 2615 days ago

You are arguing for Python and speed in the same breath? If you want portable speed, you better "warm up a chair" and master Fortran.

Bonus: modern Fortran is a joy to develop in, far more fun than Python. And you get to compile to machine code, either for a processor or a GPU.

link

tylermw 2615 days ago

> That raytracing example you linked would run at interactive rates with cudf, I really don't see any basis for perf arguments in R's favour, and 'massive data' arguments are laughable here.

I don't see how the "GPU DataFrames" provided in cuDF would enhance a raytracer in any way.

link

bicubic 2615 days ago

You don’t see how a gpu accelerated numeric array would speed up ray tracing?

link

tylermw 2615 days ago

The bottlenecks for raytracing are primarily in scene traversal/intersection testing--which does not benefit from a GPU-accelerated array structure.

link