| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chickenpotpie 2438 days ago
	I've actually been working on a library to help mitigate cloud storage lock-in. The idea is to treat cloud storage providers like disks are treated in RAID. For example, you have 3 separate cloud providers. Cloud providers 1 and 2 have every other byte of data striped across them. Cloud provider 3 has parity data. To pull a file you can only need 2 of the 3 cloud providers. If you don't like how a cloud storage provider is treating you or charging you just pull from the other 2 providers and use them as a backup in case one goes down. You can also just remove them entirely from the equation, but then you have no redundancy if one of the others goes down. It gives you a lot of negotiating power to lower egress costs because you can just pull them out of the equation at any time and reinstate them once you get better pricing.

4 comments

throwaway_bad 2438 days ago

> every other byte of data striped across

I assume you didn't mean that literally because I can't see how that will ever work out in terms of cpu cost. I think breaking it up into blocks like what RAID4/5/6 would be better but will still impact the performance of reads.

The performance of writes is going to be worse. Not because of the parity calculation but because you will be taking the max latency over all the cloud providers.

I can't see people trading off that much performance for better fault tolerance (in a world where S3 guarantees 11 nines) or ease of switching.

link

chickenpotpie 2437 days ago

Yeah, RAID 4/5/6 are planned for the future. The plan is to offer all of them and let developers choose what is the best practice for their application. RAID 0/2/3 are not CPU efficient, but are great for privacy and security. No cloud provider has the full picture and can't spy on your data and if they have a data leak it won't be anything useful. RAID 1 gives great fault tolerance with no extra latency (except on failures) and prevents vendor lock-in.

link

Havoc 2437 days ago

Pretty sure paying 3 cloud providers is more expensive than 1 cloud provider

link

chickenpotpie 2437 days ago

It's actually cheaper when you want global redundant storage. Cloud providers often charge twice as much for global redundant data. RAID 2,3,4 offer global redundancy, but only take up 1.5 times us much space. Instead of paying twice as much you only end up paying 1.5 times as much because you can get away with locally redundant pricing. If you're large enough you'll actually save money by having more negotiating power since you can walk away from a provider at any given time.

link

Havoc 2437 days ago

Sounds interesting. Link?

I'd probably use commodity VMs for this rather than big clouds if it is indeed resilient.

link

planteen 2438 days ago

Are you doing this to replace a CDN? There are already 3rd party CDNs like CloudFlare.

If you are doing it as a replacement for traffic within an AWS region and availability zone, it seems like you will be both more expensive and have much higher latency.

Or is the application something else entirely?

link

chickenpotpie 2438 days ago

It's something else entirely. It's a mixed cloud approach combining the storage offerings of Azure, Google Cloud, and S3 providers. The idea is not to trust one cloud provider to provide fair pricing and proper redundancy. Right now I'm mirroring RAID 0,1, and 3. Applying RAID 3 to the cloud is going to give you higher latency and more processor and memory usage because the file has to be reassembled on the client machine. However, if you apply RAID 1 to the cloud your latency is similar because each cloud provider has the full file. In the case of RAID 1 the library will upload a full copy to each cloud provider and will download files by trying providers until one succeeds. If you only use two providers your pricing is usually the same because geo-redundancy in storage providers is often twice the cost and you're getting geo-redundancy built in by having multiple providers in different regions. RAID 3 is actually cheaper because you have geo-redundancy, but you're only storing 1.5 times as much data.

link

thispbowden 2437 days ago

Sounds a little like gluster?

link

pepemon 2438 days ago

Sounds exactly what is Tahoe-LAFS for.

link

chickenpotpie 2438 days ago

Yeah it's pretty similar. However, I'm focusing entirely on the cloud, keeping the package lightweight, and giving the consuming application decisions on how to store the data based off their needs.

link

heavyset_go 2437 days ago

Currently use this across cloud providers.

link