Hacker News new | ask | show | jobs
by stephanfroede 4093 days ago
My question? Why Python?

The most (if not all) Big Data technologies are all based on the JVM (Java and/or Scala), why not just using JVM based languages, like Java, Scala and Clojure?

I have nothing against Python as such, but adding just another language is not simplifying the job.

1 comments

To understand why, I'll start with a quote from the Wikipedia page for "Big Data":

> Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set.

See the last term? That's the meaning used in this article. You can see that in "Mistake #1" when it mentions Python Pandas. That expects that data can fit into RAM. You can see more of it in "10s of gigabytes of data, the power of a scripting languange [sic] like Python, no matter how optimized, may not be enough" -- I used Python to process 10s of GB of data, and renting a 60 GiB machine from Amazon costs $1.680/hour.

If you believe that it's not "Big Data" unless you need a cluster of machines to have enough RAM to work with it, then I can well understand why you might complain about Python in this context.

On the flip side, I've seen, or heard of, "Big Data" projects which start with the expectation that it will require a cluster, and never investigate if 'traditional data processing applications' are adequate. Eg, in one project I developed optimizations that gave an overall 40x performance boost, so that one machine was needed when previously my client required a cluster.

If Big Data includes using machine learning to identify patterns in data sets too large to understand by people, then we were using Python to do data mining of large chemical screening data in the late 1990s. Even earlier, Python was being used to control supercomputing tasks, where the high-level glue code was in Python and the low-level code in C or Fortran.

Unlike simple map-reduce jobs, these included codes with complex inter-node dependencies, like molecular dynamics, where the network bandwidth is another important performance factor. A molecular dynamics simulation can product 10s of GB per day, so can easily count as "Big Data" projects which cannot be done using the JVM-based solutions you mentioned.

http://www.infoq.com/news/2014/01/bigdata-languages also gives a viewpoint on the lack of importance of the specific language on big data analysis.