Hacker News new | ask | show | jobs
by nomel 1063 days ago
> I mean, hopefully why you might need multiprocessing in python is clear?

My experience, after first getting into python, was:

I needed to do something concurrently on one set of data.

Python threading doesn't provide concurrent execution, so my program slowed down when I used threads.

So, I tried multiprocessing. My program slowed down even more, because any communication between processes uses pickle. I was trying to process one dataset in parallel, and pass big chunks back, for a final processing.

So, I saved it to disk, loaded the dataset into each process, multiplying my memory usage by 16x.

I then threw it all out and wrote the performant bits in C++, using swig to automagically make the python interface for it.

So, knowing why (concurrency) isn't necessarily enough.

1 comments

I mean, of course? Concurrency != parallelism, so that makes sense.

Where multiprocessing shines is when you have an algorithm that can be fully parallelized and represented in a baby map-reduce framework, where the data being sent to each process isn't too big. The idiom I often reuse is literally the first example in the python docs:

  from multiprocessing import Pool

  def f(x):
    return x\*x

  if __name__ == '__main__':
    with Pool(5) as p:
      print(p.map(f, [1, 2, 3]))
      
      # to give more of a map reduce flavor:
      print(sum(p.map(f, [1, 2, 3])))
This can be modified to fit a pretty big range of tasks in my use cases and chop hours off my workflows. It's so much faster in DS workflows to just have extra cores and use them than to spin up a cluster for distributed compute.