Hacker News new | ask | show | jobs
by chickenpotpie 2438 days ago
I've actually been working on a library to help mitigate cloud storage lock-in. The idea is to treat cloud storage providers like disks are treated in RAID. For example, you have 3 separate cloud providers. Cloud providers 1 and 2 have every other byte of data striped across them. Cloud provider 3 has parity data. To pull a file you can only need 2 of the 3 cloud providers. If you don't like how a cloud storage provider is treating you or charging you just pull from the other 2 providers and use them as a backup in case one goes down. You can also just remove them entirely from the equation, but then you have no redundancy if one of the others goes down. It gives you a lot of negotiating power to lower egress costs because you can just pull them out of the equation at any time and reinstate them once you get better pricing.
4 comments

> every other byte of data striped across

I assume you didn't mean that literally because I can't see how that will ever work out in terms of cpu cost. I think breaking it up into blocks like what RAID4/5/6 would be better but will still impact the performance of reads.

The performance of writes is going to be worse. Not because of the parity calculation but because you will be taking the max latency over all the cloud providers.

I can't see people trading off that much performance for better fault tolerance (in a world where S3 guarantees 11 nines) or ease of switching.

Yeah, RAID 4/5/6 are planned for the future. The plan is to offer all of them and let developers choose what is the best practice for their application. RAID 0/2/3 are not CPU efficient, but are great for privacy and security. No cloud provider has the full picture and can't spy on your data and if they have a data leak it won't be anything useful. RAID 1 gives great fault tolerance with no extra latency (except on failures) and prevents vendor lock-in.
Pretty sure paying 3 cloud providers is more expensive than 1 cloud provider
It's actually cheaper when you want global redundant storage. Cloud providers often charge twice as much for global redundant data. RAID 2,3,4 offer global redundancy, but only take up 1.5 times us much space. Instead of paying twice as much you only end up paying 1.5 times as much because you can get away with locally redundant pricing. If you're large enough you'll actually save money by having more negotiating power since you can walk away from a provider at any given time.
Sounds interesting. Link?

I'd probably use commodity VMs for this rather than big clouds if it is indeed resilient.

Are you doing this to replace a CDN? There are already 3rd party CDNs like CloudFlare.

If you are doing it as a replacement for traffic within an AWS region and availability zone, it seems like you will be both more expensive and have much higher latency.

Or is the application something else entirely?

It's something else entirely. It's a mixed cloud approach combining the storage offerings of Azure, Google Cloud, and S3 providers. The idea is not to trust one cloud provider to provide fair pricing and proper redundancy. Right now I'm mirroring RAID 0,1, and 3. Applying RAID 3 to the cloud is going to give you higher latency and more processor and memory usage because the file has to be reassembled on the client machine. However, if you apply RAID 1 to the cloud your latency is similar because each cloud provider has the full file. In the case of RAID 1 the library will upload a full copy to each cloud provider and will download files by trying providers until one succeeds. If you only use two providers your pricing is usually the same because geo-redundancy in storage providers is often twice the cost and you're getting geo-redundancy built in by having multiple providers in different regions. RAID 3 is actually cheaper because you have geo-redundancy, but you're only storing 1.5 times as much data.
Sounds a little like gluster?
Sounds exactly what is Tahoe-LAFS for.
Yeah it's pretty similar. However, I'm focusing entirely on the cloud, keeping the package lightweight, and giving the consuming application decisions on how to store the data based off their needs.
Currently use this across cloud providers.