| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vtuulos 1902 days ago

It is a bit of a hidden gem but Metaflow includes a Boto-based highly parallelized, error-tolerant S3 client that Netflix uses routinely to get 10-20Gbps throughput between EC2 and S3.

Technically it is independent from Metaflow, so you could use it as a stand-alone, high-performance S3 client.

See docs here https://docs.metaflow.org/metaflow/data#store-and-load-objec...

And code here https://github.com/Netflix/metaflow/tree/master/metaflow/dat...

(I wrote it originally - AMA if curious)

2 comments

nicornk 1902 days ago

Really interesting going through your code, thanks for sharing. Why did you opt for running the parallelization code in a separate python process (s3op)?

You might want to update section ‚Caution: Overwriting data in S3‘ in the docs since S3 offers strong read after write consistency since dec 2020.

https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-rea...

link

vtuulos 1902 days ago

re: separate process - fault-tolerance is a key requirement. There are a myriad of ways how a highly parallelized, network- and data-intensive code can fail, so isolating it in a separate process is a safer approach than trying to try-except everything and hope it works.

Good catch re: the warning about consistency! The docs were written before the change :)

link

brian_herman 1902 days ago

Can you do a blogpost please!

link

vtuulos 1902 days ago

The topic is close to my heart so maybe one day :)

link