| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by banana_giraffe 1902 days ago

Yes, parallel downloads are faster in some cases. There are even extreme cases, mostly around giant EC2 nodes where you can take this idea a step further and spin up multiple processes to each download parts of a file, and really saturate your network or disk.

My favorite version of this is when you start to use shared memory of some fashion to move terabytes of data from S3 to EC2 to work on it without ever hitting a disk.

Not for everyone, and for sure many times the extra milliseconds saved won't matter, but sometimes you really do need to get hundreds of gigabytes or terabytes of data moved as quickly as possible.

2 comments

Galanwe 1902 days ago

That's what my small python package does (https://github.com/NewbiZ/s3pd). Definitely not perfect (not saturating cores because no use of event loop per process) but download is split in multiple processes and stored in shared memory.

I've been able to saturate 20GB NICs on Ec2 with it (32 cores)

link

killingtime74 1902 days ago

You just described Apache Spark

link