Hacker News new | ask | show | jobs
by glic3rinu 2051 days ago
when you say parallel you mean the work is distributed across multiple cores? e.g. multiple python worker processes. Or is all work running single threaded in a single process (concurrent IO, but not true parallelism)?
1 comments

All running in a single thread - but the work is executed asynchronously. This is preferred for I/O heavy throughput such as many HTTP requests.
Why not distribute the concurrent work over multiple processes to get the most out of multiple cores? More event loops, better performance, and maybe more text preprocessing.
That would be a good idea, but it's non-trivial in Python as multi-processing async support is patchy. If you have to choose multi-processing vs async for HTTP requests, from experience, async is much faster (as your not CPU bound but IO bound) and easier to use.
I do both when I'm scaling to millions of requests per day. I also do some cpu bound preprocessing after the io bound async functions end.
Look into Go, specifically the colly pkg. Learnt Go just so I could use this library after wrestling with highly parralelized Python web scraping. The 1k/requests/sec/core figure is true and frankly deadly, and Go borrows a lot of useful stuff from Python like array slicing.

It was pretty remarkable to go from 32 cores being maxed for 100 concurrent workers to... A 10-15% usage spike on a 4 core XPS :)

https://github.com/gocolly/colly