Hacker News new | ask | show | jobs
by __turbobrew__ 1902 days ago
Make sense, thank you. Wouldn't you want to multiprocess the downloads instead of multithread? I imagine you would run into the Python global interpreter lock before being able to push through 40gb?
1 comments

Yes, you would. The CPU and memory overhead of multiprocessing for this application is why we ended up migrating away from boto3 and to the AWS Go SDK for this specific purpose (https://github.com/chanzuckerberg/s3parcp as I mentioned in another comment). We still use boto3 in other areas, but for maxing out the network connection, golang is far more scalable.
Using multiprocessing I've been able to quite easily saturate a 20GBps Ec2 NICs in python. https://github.com/NewbiZ/s3pd

There is no reason why multiprocessing for IO in python would use _crazily_ more memory than in an other language, when done properly.