Hacker News new | ask | show | jobs
by mecredis 997 days ago
It's kind of wild that these tools just transfer a copy of these models every time they're spun up (whether it's to a Google Colab notebook or a local machine.)

This must mean Hugging Face's bandwidth bill must be crazy, or am I missing something (maybe they have a peering agreement? heavily caching things?)

6 comments

Their Python module caches the downloads, which is checked before downloading them again...but you're probably not wrong on the crazy bandwidth bill. Looks like they have crazy VC money though, considering the current climate.
The Colab notebooks are a fresh and independent session with no caching.
Google might cache further up the chain, which could help
Unmetered 10+ gigabit connections were on the order of $1/mbit/mo wholesale over a decade ago when I priced out a custom CDN so for the cost of 100 TB of data transfer out of AWS you could get a 24/7 sustained 10gbit/s (>3 PB per month at 100% utilization).

Bandwidth has always been crazy cheap.

Not all connections are created equal. Even some big providers clearly have iffy peering agreements upstream that’ll manifest as terrible performance if you have a widely-geographically-distributed bandwidth-heavy load.
Indeed. If you're not using a cloud provider bandwidth is extremely cheap.

In fact locally I can get a 10 gbps home internet unmetered connection for $300/mo.

I'm not sure how they'd react if I transferred 1 PB/mo though :)

That’s pretty expensive. Sonic offers 1-10gbps (depending on where you live) unmetered symmetric connections for $60/mo to the Bay Area… they’re also the only ISP that petitioned the FCC in favor of net neutrality.

For work I end up transferring 50-150 gigs often, sometimes daily. Never heard a word from them that this has been a problem.

That's pretty cool, but I'd say the opposite that Sonic is crazy cheap.
Is my math wrong here? 10 gbps -> 8s per 10 GB -> 800s per 1TB -> 80,000s per 1PB -> 22.3 hrs at full speed for 1 PB?
If you search "1pib/(10 gbps)" on google, you'll get 10.4 days.

An unmetered 10G port at a US data center is ~$1500/mo. Not particularly expensive

800,000s per 1PB, off by a 10 factor
Thanks!
Fully saturated you could transfer a few petabytes per month on a 10gig line.
If you host copies of your data with a few big providers could you do something smart like detect and redirect requests from AWS to an S3 bucket and not pay for bandwidth leaving the provider?
Huggingface has a strategic partnership with AWS.

1. AWS is far behind Azure and GCP in AI, so they gotta partner up to gain credibility.

2. Huggingface probably does face insane bills compared to github. But AWS can probably develop some optimizations to save bandwidth costs. There's 100% some sort of generalized differential storage method being developed for AI models.

AWS egress traffic charge is just outrageous so they can easily offer huge discount without improvement
One doesn't usually opt for AWS when their goal is to reduce transfer costs.
Unless aws makes an agreement to not charge you transfer costs, they often do for various open source and software projects like this.
Is hugging face just a model repository like GitHub is a code repository? Seems you can rent compute both cpu & gpu, but you are right that most models seem to be run elsewhere.
Yes, exactly.
I really wish I could configure this crap to cache somewhere other than my C: drive

Or better yet, how about asking me where I want to store my models?

On linux there's the XDG_CACHE_HOME env variable for pip, but strangely enough there doesn't seem to be an windows equivalent.
I haven’t used windows in a while but I thought it supported some form of cross-volume symlink? Or at least mounting an image stored on another volume to an arbitrary path.
Links in windows are a thing, but not well known. I must have been using Windows for close to 20 years before I realized they were in there.

https://learn.microsoft.com/en-us/windows/win32/fileio/hard-...

https://learn.microsoft.com/en-us/windows-server/administrat...

So not-well-known that several tools that really should know better don't check for junctions with occasionally disastrous results in a fs walk. (Using junctions sounded really clever to me until this had me up all night figuring out why the backup system crashed.)
mklink /d on windows has saved me many times.
You can do a lot of these fully locally with things like RVC web ui or https://tryreplay.io/
https://fakeyou.com has unlimited free RVC without an account. The UI needs work, though.
wish they had something for Linux