Hacker News new | ask | show | jobs
by aloer 1902 days ago
Extreme case can be when you make heavy use of s3 for ec2 workloads and want to saturate your instance connection. This can go up to 40g afaik

Edit: I read this recently and if I remember correctly there’s a limit of like a thousand parallel connections to s3

AWS internally probably has higher limits for some of their services, e.g. when you query data in s3 with Athena

1 comments

Make sense, thank you. Wouldn't you want to multiprocess the downloads instead of multithread? I imagine you would run into the Python global interpreter lock before being able to push through 40gb?
Yes, you would. The CPU and memory overhead of multiprocessing for this application is why we ended up migrating away from boto3 and to the AWS Go SDK for this specific purpose (https://github.com/chanzuckerberg/s3parcp as I mentioned in another comment). We still use boto3 in other areas, but for maxing out the network connection, golang is far more scalable.
Using multiprocessing I've been able to quite easily saturate a 20GBps Ec2 NICs in python. https://github.com/NewbiZ/s3pd

There is no reason why multiprocessing for IO in python would use _crazily_ more memory than in an other language, when done properly.