Hacker News new | ask | show | jobs
by mmmmpancakes 1061 days ago
I mean, hopefully why you might need multiprocessing in python is clear?

If you have a python task that is highly parallelizable on a single machine with multiple cores, then multiprocessing is probably the right tool to quickly see if you can dramatically speed up your code with parallelism with basically no code overhead or investment in distributed solutions (there are edge cases where it is not, but it takes very little time to test if you are an edge case).

I encounter this situation in my data science workflow routinely. It is an easy way to impress product / managers and say "hey, I made this batch algorithm 50x faster, so now it runs in 10 minutes instead of 500."

3 comments

> I mean, hopefully why you might need multiprocessing in python is clear?

My experience, after first getting into python, was:

I needed to do something concurrently on one set of data.

Python threading doesn't provide concurrent execution, so my program slowed down when I used threads.

So, I tried multiprocessing. My program slowed down even more, because any communication between processes uses pickle. I was trying to process one dataset in parallel, and pass big chunks back, for a final processing.

So, I saved it to disk, loaded the dataset into each process, multiplying my memory usage by 16x.

I then threw it all out and wrote the performant bits in C++, using swig to automagically make the python interface for it.

So, knowing why (concurrency) isn't necessarily enough.

I mean, of course? Concurrency != parallelism, so that makes sense.

Where multiprocessing shines is when you have an algorithm that can be fully parallelized and represented in a baby map-reduce framework, where the data being sent to each process isn't too big. The idiom I often reuse is literally the first example in the python docs:

  from multiprocessing import Pool

  def f(x):
    return x\*x

  if __name__ == '__main__':
    with Pool(5) as p:
      print(p.map(f, [1, 2, 3]))
      
      # to give more of a map reduce flavor:
      print(sum(p.map(f, [1, 2, 3])))
This can be modified to fit a pretty big range of tasks in my use cases and chop hours off my workflows. It's so much faster in DS workflows to just have extra cores and use them than to spin up a cluster for distributed compute.
As a simple example, Threading in Python works good for i/o-bound operations like scraping a website, whereas Multiprocessing works best for CPU-bound operations like result hashing/transforming the data you just scraped.
That's the perfect situation many parallelization packages in python ecosystem take for granted. In my case, I had big datasets that wouldn't fit into memory (or I didn't want to demand this peak RAM usage on a server), and I had to write a parallelizer that read data in chunks and fed it to workers -- something very simple but unavailable in Pandas ecosystem (most packages assume you have swallowed the entire dataset, which isn't that large, and it's easy to throw the data to process-based workers)