Hacker News new | ask | show | jobs
by LiamPa 3217 days ago
Have you timed it? Starting threads in Python is slow...
2 comments

Considering everything that is involved in making a request to the internet, multithreading would have to be spectacularly slow to even come close to making serial approach quicker:

  $ python quicktest.py 
  ['http://www.google.com', 'http://news.bbc.co.uk', 'http://news.ycombinator.com', 'http://www.cnn.com', 'http://www.foxnews.com', 'http://www.msnbc.com']
  [<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]
  Serial: 1.23853206635
  [<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]
  Multiprocess: 0.912357807159
  [<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]
  Multithreaded: 0.708998918533
edit: Here's the code:

  import requests
  import time
  from multiprocessing import Pool
  from multiprocessing import Pool as ThreadPool
  
  
  session = requests.Session()
  
  urllist = ['http://www.google.com',
             'http://news.bbc.co.uk',
             'http://news.ycombinator.com',
             'http://www.cnn.com',
             'http://www.foxnews.com',
             'http://www.msnbc.com']
  # Warm up?
  responses = []
  for url in urllist:
      responses.append(session.get(url))
  
  print urllist
  
  start = time.time()
  responses = []
  
  for url in urllist:
      responses.append(session.get(url))
  
  print responses
  print "Serial: {}".format(time.time()-start)
  
  start = time.time()
  
  pool = Pool()
  responses = pool.map(requests.get, urllist)
  
  print responses
  print "Multiprocess: {}".format(time.time()-start)
  
  start = time.time()
  pool = ThreadPool()
  responses = pool.map(requests.get, urllist)
  
  print responses
  print "Multithreaded: {}".format(time.time()-start)
Have _you_ timed it? Not in general, but for this specific case? Thread creation is relatively expensive to some operations, but maybe the speed is entirely irrelevant to the task at hand. In this case, the author of the article is auto-curating some articles from a list of people he finds interesting. If this were done once per day as a cron job, it could almost certainly be done entirely _serially_ with zero concurrency and full blocking and still finish fine. Adding in concurrency is nice, but certainly any method will do with this volume.

This is certainly one of the cases where you should just do whatever is simplest (to _you_ the programmer). The first step is always to optimize for cognitive overhead. I.e. make the code easy to reason about. Next (and relatively rarely) is it necessary to good to optimize for different bottlenecks in your code.

I went back and timed it. The overhead is at _most_ 100ms in my use case (there's some ambiguity because of other problems with the async implementation, I suspect it's actually lower than this). Given many of the requests are 1s long and this is a background task, that's totally fine.