Hacker News new | ask | show | jobs
by dilyevsky 1006 days ago
> So if you pay for 1GB of storage, they probably actually store 5GB of data or more for you

The actual factor is most likely around 1.4-1.5x and for sure can’t be any more than 2.2x in this day and age. Dumbest possible implementation will be “only” 3x so no it’s nowhere close to 5gb

Edit: looks like it’s public so i can actually tell you that google uses RS 3,2 which gives 1.5 replication factor. When i was there a few years ago storage folks told me they never lost a single stripe of data

5 comments

Those numbers are for Colossus. Blobstore, which backs the cloud object store, is different, and used to be a lot higher.
Yeah those aren’t public afaik, iirc they adjusted replication for cold data
Presumably Google also keeps backups, though, right?
And mirrors to at least one extra datacenter as they can lose bandwidth with a fiber getting cut, become unreachable due to networking snafu, or even burn down entirely.
Yes they also store to tape
That means 2X at a minimum. Then RAID overhead. Possibly other hot or warm copies ready to take over instantly.
And tape costs same as ssd/spindles in your calculations?
A common problem would be throughput though. Storage capacity scales much faster than access speed. If you are storing an item only 3 times and lets say each storage location gives you 50,000 IOPS max then you can only ever service 150,000 IOPS of this item which might not be enough.
Vast majority of data rarely (if ever) gets read so you use a cache for that
What does RS 3,2 stand for? Thanks
Reed-Solomon erasure coding. 2 data blocks, one parity. Basically raid-6 but distributed
But thats within one cell... But data will be stored in more than one cell to deal with scheduled and unscheduled downtime of the cell...
You have pay more for that
Not with GCS, you don't.

https://cloud.google.com/storage/docs/availability-durabilit...

Are you thinking of Persistent Disk (PD) and Replicated PD?

(I work on storage at Google.)

I was thinking of gcp regions in which case you do have to pay for it. For colossus cells within a single regions you obviously don't but I don't know enough how it maps it out down there and whether it just moves data around in the event of PCR