Hacker News new | ask | show | jobs
by tonyplee 3481 days ago
I have 1/2 TB of pictures kids pic, videos from cell phone, SLR etc backup to multiple HDD. If everyone put 1/2 TB or more to google, can google backend really handle that, if so for how long?

Also need to consider how long it will take me to download back those pic once they decide to shutdown the "free" service.

2 comments

> If everyone put 1/2 TB or more to google, can google backend really handle that, if so for how long?

Not everyone is going to do that though (anytime soon anyway) so that's not a real concern. It's like asking when gmail first launched "okay but what if EVERYONE uses the full gigabyte?"

I'll answer in reverse order:

> Also need to consider how long it will take me to download back those pic once they decide to shutdown the "free" service.

This is an incredibly good point, both in terms of bandwidth considerations (particularly their ratelimiting) and in terms of products randomly disappearing with limited takeout windows.

FWIW, https://get.google.com/albumarchive/<G+ UID> will net you takeout archives of your image albums. Incidentally this works with any Google account that doesn't have public photo access turned off, and is rather fun to play with (as is the site: search operator :D)

--

> can google backend really handle that, if so for how long?

YouTube used to officially report that 300 hours are uploaded per minute, back in 2014. http://tubularinsights.com/hours-minute-uploaded-youtube/ says we're likely at 700hr/min now.

OK. (Been wanting to do this math for a while, actually...) Let's see. This is all back-of-the-envelope and I wouldn't mind some more concrete numbers to work with!

YT reencodes all videos into several formats.

I'm looking at http://youtu.be/1tQ5XwvjPmA, which is 1:20:58 long. It was uploaded fairly recently so has the full complement of encodings. I see:

- 5 DASH audio bitrates: 51k (27.53MB), 66k (31.93MB), and 120k (58.02MB) for clients that can decode OPUS, 89k Vorbis (46.67MB), and 132k M4A (73.16MB)

- 6 DASH video sizes in both WebM/MP4 (so 12 total formats): 256x144 (43.09MB / 63.54MB); 426x240 (39.79MB / 140.34MB); 640x360 (71.80MB / 122.65MB); 854x480 (118.37MB / 266.00MB); 1280x720 (234.63MB / 548.81MB); and 1920x1080 (463.04MB / 1.05GB). (Yes, WebM is amazing compared to MP4.)

- Three legacy video formats: 176x144 3GP (39.51MB), 320x180 3GP (116.05MB), 640x360 WebM (211.30MB), 640x360 MP4 (205.97MB), and 1280x720 MP4 (621.68MB).

So, for this standard, 30fps 1080p video, YouTube is actually storing... 4.51GB of data. Huh! Nice.

If this video is 1h20m, 1-(60/80) means I should subtract 25% from 4.51, and I get 3.38GB for one hour of video.

OK. Taking that figure of 700 hours... that's 2366GB (2.31TB) per minute :)

In other words YouTube needs to find disk capacity for 39.42GB of data every second.

I'm not sure how to multiply by an increasing gradient with a back-of-the-envelope calculation, so I'll punt and pretend it was 700 hours/min all the way back to 2014, so the past 2 years. Quite inaccurate, but possibly still interesting:

(2.31 * (1024^4)) * 12 * 365 * 2 = 22249277495024025.60

Uhh.... that's... ah. 22PB. Err, 19.76PB to be precise.

This is for the boring 30fps-and-under 1080p videos out there. Not the 60fps, 2K/4K/8K (!), 360° and similar stuff, and there's an increasing pile of that being uploaded.

22 PB = total Youtube data need for last two year.

1/2 TB per user (like me)

22PB = 44,000 users.

Google need 1000 times that space in their data centers to handle 44 million users.

Also, I might think those 1/2 TB of data are very valuable, But only a few of them are interesting to a few of my friends, family members. They are probably very hard to monetize. Even for myself, I only browse them may a few times every a few years.

If I am a PM for such product and try to propose to Alphabet to build 1000 new youtube size data center to handle only 44 millions users, I would have hard time to justify it.

FWIW, I'm not familiar with how and where the Internet Archive gets their funding, but in 2014 they had 50PB of storage (https://archive.org/web/petabox.php). So IA can manage 50PB as a small-to-medium private company. (Incidentally they've been running since '99.)

As for BackBlaze, also a medium-large business, they're now storing... https://www.backblaze.com/blog/200-petabytes-of-customer-dat...

Both IA and BackBlaze are private/nontraded, which means have they have lower operating capital. Diskspace is simply not that expensive now.

There's a guy on a DC++ filesharing server (find a server list - it's one of the biggest ones) who has been sharing 400TB of data for some time. Speaking of DC++, most newer clients show the total shared data for all users connected to the server you're on, and that number on some of those larger servers is usually 1-2PB.

I also saw a guy on reddit a while back who was in exactly the right place at the right time when his workplace was upgrading, and he now has a nice $200/mo electricity bill in the form of, you guessed it, 400TB of diskspace. I'm not sure if he got it all for free, but I think he may have.

So it's not a money problem; it's a space problem and a power problem. This is why flash storage is so interesting, it generates less heat and can be packed somewhat more densely, and it uses less power too. Once Flash-vs-platter hits the 49%/51% in terms of relative cost things are going to get interesting.

At the moment the major retailers are just doing simple things like firmware customizations to run their disks at lower speeds (for nearline storage) or start up with the disk off and stuff like that. Facebook's cold storage datacenters also use Reed-Solomon encoding instead of RAID/ZFS for redundancy at less used space.

I actually do think Google have actually done the kinds of allocations you speak of, using thin provisioning; after all, literally every new Google account gets 15GB of diskspace! And then there's sync profile data, whatever internal metadata is associated with the account (such as your search history), etc, that needs to be stored too.

I fully believe Google have multiple exabyte-scale datacenters. If they don't I'll be genuinely surprised.

Using thin provisioning (which is ultimately just "how much are they really using, and how can we encourage them not to use more than X") is how they manage it.

So you're right - actually provisioning enough free storage for these users would definitely be an unpleasant task. But they carefully balance what everyone uses with what they have available.

This kind of high quality, high effort comment is why I love this site so much. Thanks for crunching the numbers and making me drop my jaw at the amount of data.

Just upvoting you doesn't suffice today.

Or to look at it another way, not even 20% of a single AWS snowmobile: https://aws.amazon.com/snowmobile/

Of course, they presumably need to duplicate it for redundancy too, so maybe a full 2/3s of one!