Hacker News new | ask | show | jobs
by thauck 5024 days ago
I mostly agree with this, because I do like Ruby even though I don't use it.

The missing link here, and the reason Python gets more love from the data community, is that Python scales down to the smaller data sets as well as it handles big ones. (Not sure if you ment it couldn't, but the distinction you make implies that.)

2 comments

Python is surprisingly heavy-duty. But my kingdom for a seamlessly distributed or parallelized version of NumPy/SciPy! How nice would it be to just enter "C = A * B", with A living as a sparse CSC across many nodes?
Would Disco (http://discoproject.org/) work for you?
I don't think MR is a good abstraction for implementing linear algebra, and I expect the overhead to be too high (although I don't have numbers to back that up). For large problems (>> couple of machines worth of RAM), you use big iron HPC solutions, or you avoid 'exact' linear algebra altogether to focus on one-pass algorithms.

For example, instead of computing an exact SVD, you will use something like Hebbian algorithm to compute the SVD in a streaming manner (that's what Mahaout implements for example).

No, the sparse matrix code in SciPy is plain C (not even multi-core, let alone distributed).

EDIT: or did you mean Disco offers distributed sparse CSC operations?

we, http://continuum.io/, are working on this.
Agreed, Python scales down and so is also good for small tasks. What I was saying is that Ruby - as much as I like it - does not generally scale up beyond a certain point.

Both R and Ruby have had issues with large data sets which have been addressed to some degree in more different distributions and more recent releases. Python is ready out of the box for large data sets. So what I meant to communicate is that if you know that you are going to be dealing with a large data set, you might as well go straight to python.