Hacker News new | ask | show | jobs
by iskander 3695 days ago
In theory, Spark lets you seamlessly write parallel computations without sacrificing expressivity. You perform collections-oriented operations (e.g. flatMap, groupBy) and the computation gets magically distributed across a cluster (alongside all necessary data movement and failure recovery).

In practice, Spark seems to perform reasonably well on smaller in-memory datasets and on some larger benchmarks under the control of Databricks. My experience has been pretty rough for legitimately large datasets (can't fit in RAM across a cluster) -- mysterious failures abound (often related to serialization, fat in-memory representations, and the JVM heap).

The project has been slowly moving toward an improved architecture for working with larger datasets (see Tungsten and DataFrames), so hopefully this new release will actually deliver on the promise of Spark's simple API.

1 comments

Thanks for the reply but I was looking for a usecase. e.g. with spark i was able to do X. I don't even know where Spark would be applied to.
We use it for two things:

* distributed machine learning tasks using their built-in algorithms (although note that some of them, e.g. LDA, just fall over with not-even-that-big datasets)

* as a general fabric for doing parallel processing, like crunching terabytes of JSON logs into Parquet files, doing random transformations of the Common Crawl

As a developer, it's really convenient to spin up ~200 cores on AWS spot instances for ~$2/hr and get fast feedback as I iterate on an idea.

It originally billed itself as a replacement for Hadoop and MapReduce as an in-memory data processing pipeline. It is typical in MR programs to create many sequential MR jobs and save the output between successive jobs to HDFS. So Spark can solve these use cases. Since its early days, it has built on its capabilities.

So real world use-cases? Any MR use case should be doable by Spark. There are plenty of companies using Spark to create analytics from streams, some are using it for its ML capabilities (sentiment analysis, recommendation engines, linear models, etc.).

I apologize if my comment isn't as specific as you're looking for, but I know of people who use it for exactly the scenarios I've outlined above. We are probably going to use it as well, but I don't have a use case to share just yet (at least nothing concrete at the moment). Hopefully this gives you some idea of where Spark fits.

I think your question is oriented towards X being a business problem.

Netflix has users (say 100M) who have been liking some movies (say 100k). Say The question is: for every user, find movies he/she would like but have not seen yet.

The dataset in question is large, and you have to answer this question with data regarding every user-movie pair (that would be 1e13 pairs). A problem of this size needs to be distributed across a cluster.

Spark lets you express computations across this cluster, letting you explore the problem. Spark also provides you with a quite rich Machine Learning toolset [1]. Among which is ALS-WR [2], which was developped specifically for a competition organised by Netflix and got great results [3].

[1] http://spark.apache.org/docs/latest/mllib-guide.html [2] http://spark.apache.org/docs/latest/mllib-collaborative-filt... [3] http://www.grappa.univ-lille3.fr/~mary/cours/stats/centrale/...

We use Spark essentially as a distributed programming framework for data processing - anything you can do on a small dataset on a single server, you can do the same thing on a huge dataset and 20 servers or 2000 servers with minimal extra development
We primarily use it to aggregate a large-ish (10 TB/day) amount of data for insertion into an analytics database.

The code is very straightforward and it is fast.