Hacker News new | ask | show | jobs
by klodolph 3585 days ago
I think these days we should by default think of storing blobs of data (like video files) in storage systems like S3 or the alternatives, and that ordinary filesystems should be thought of as a special case where you want to attach storage to an individual computer.

Edit: I'm going to elaborate, because people are calling me naïve. Full disclosure: I work at a cloud provider on a storage team.

For most people and applications, you simply don't get good value for your money by using filesystems and hard drives directly. We've tried to make things more reliable and durable with backup policies, RAID, and ZFS but the fact is all of these things come with operational and capital expenditures that compare unfavorably with common cloud storage options. There are some good technical reasons why cloud storage is better: basically technologies like RAID and ZFS are attempts to make each layer of your storage stack completely durable and available, but this approach is not competitive with the way cloud storage is typically implemented, which is to build a reliable distributed service on top of cheap hardware. Consider RAID 1, for example. This gives you N+1 redundancy at the drive level for an individual computer. This worked in the 1990s but drives are bigger and RAID failure modes suck with larger drives—it's worrying how common it is to see errors when rebuilding a degraded RAID array, and at N+1 that means that your data is lost from that computer. Essentially, with modern drive sizes (4+ TB seems pretty common these days) a RAID 1 array should always be considered N+0 instead of N+1.

Cloud storage is implemented much more intelligently. If you have distributed storage, you can simply spread files across computers in different DCs and use error correction codes to increase the redundancy. You can get more nines of durability and availability for less money this way. You end up with something like 33% overhead on disk space instead of 300% overhead, and you're also off the hook for a big chunk of your capacity planning and various other operational expenditures.

These days I would consider starting from "this file is in cloud storage, and we have a local cache" rather than "this file is in local storage, but we have a cloud backup". That's really all I'm saying.

It also won't always be competitive. Sometimes cloud storage is more expensive than regular filesystems, depending on how you're using it. If you're a big company you can sometimes amortize the costs of doing it yourself better. That's all I mean by "default"—I'm going to put my data in cloud storage unless I have a compelling reason to store it some other way.

5 comments

That's awfully naive, especially for tasks like video editing that are significantly impacted by disk read/write speeds. Even a NAS on a gigabit network is going to be roughly 6x slower than a standard internal SATA III spinning disk.
I said "by default", the implication being that you'd do something else if your application needs it. But it's much easier from an operational perspective if you start with a reliable system (replicated, networked storage) and cache locally for speed, then to try and make local filesystems reliable and durable.
I agree. Network-wise we start at 10GbE. It's a lot more complicated than simple file storage on network though. Many needs and solutions. And I mean MANY.
I disagree. I think we should default on storing blobs of data in local storage to retain full legal and technological control of them. Storing them in 3rd party services under their EULA's, SLA's, and API's should be a special case to improve attributes of data like its availability or cost of distribution. The way most people and companies use them now. :)
Uhhh have you seen the internet data transfer costs for S3? That would become absurdly expensive quickly. Even with a dedicated cross connect.
S3 data transfers costs are an issue -- that's why you can host minio yourself at any hosting company, and save significantly (multiple times) on data transfer and storage costs.
You're right, but I would reframe it along the lines of Network filesystems (like NFS or OCFS3) vs. Distributed Object Storage (S3). In that sense, certainly, the current "default" is to use the latter and avoid the former.

Local filesystems and/or volume managers won't go away anytime soon. Internally, a system like S3 needs a unified access to the storage, which is provided by the filesystem.

I think we are going to see the emergence of new filesystems that are much simpler in design compared to ZFS (as reliability is left to an upper layer in the stack) for use in the Cloud. Somewhat similar to the trend toward lightweight OSes built for the cloud (CoreOS, Project Atomic, etc.). Many features that were in the realm of the operating system are now delegated to upper layers in the stack.

Can you help me understand this statement better? Why should we do that?

I may sound like I'm playing dumb, but I'm really struggling to see whats compelling about this in its current state aside from the fact that its one tool as opposed to a RAID + filesystem + something to make the data available.