Hacker News new | ask | show | jobs
by hcrisp 3804 days ago
Impressive, but it seems an inversion of paradigms. Small data to compute ratios is usually associated with high performance computing (HPC). Why use Spark when the data is small and is broadcast to each worker? You have to pay the serialization-deserialization penalties of moving the data from Python to JVM and back again. In fact the JVM isn't really needed here at all since all the computation is done in the pure-Python workers in an embarrassingly-parallel way. Seems to me that you would just move onto an HPC and use TensorFlow within a IPython.parallel paradigm and be done much sooner.
1 comments

The "broadcast" is pretty cheap because often you already have the data in some distributed file system, or if on a single node the network bandwidth is pretty high. The problem with a lot of the deep learning workloads is that it is very compute intensive and as a result takes a long time to run. For example, it is not uncommon to take a week to train some models.
Deep learning workloads are typically compute-intensive, but they also tend to be extremely I/O intensive, and convergence may depend on a synchronous step where all the nodes must finish making their contribution to the model before any of them can continue. (This may not be quite true though -- see Google's DistBelief paper--but most frameworks work this way). Often times, adding more machines to a cluster may make training proportionally slower.
Did you actually read the article? It was using Spark to parallelize hyperparameter tuning, which is embarrassingly parallel.
Why not just use GNU Parallel (or something similar) instead of Spark?
I think this could have been done with GNU parallel. One advantage I see with Spark is that is that it is easier to interact with Python, for example these two lines are all is needed to call the relevant Python function:

  urls = sc.parallelize(batched_data)
  labelled_images = urls.flatMap(apply_batch)
So if you already have a cluster with Spark installed (like Databrick does) then it takes less work to just call your Python code than setting up a GNU Parallel cluster and a writing a small wrapper script. Additionally a Python script would have to load/init the models on every call from Parallel. I agree that this is not a great demonstration of Spark main strengths.
I think one reason would fault tolerance. Is there a fault tolerance layer in GNU parallel? last time I checked their homepage ( a few minutes ago), there was no reference to fault tolerance.

Another reason is, perhaps, scheduling.

what fault tolerance does spark give you in this scheme? It cannot look into TF progress and checkpoint all state. Using Spark with TF, seems like an overkill -- you need to manage and install two framework what should ideally be a 200 line python wrapper or small mesos framework at most.
Does --retries count as fault tolerance?
Oh dear. You're right, sorry. Shouldn't have commented before actually reading the article...