| > So I'm pretty sure you don't know what you are talking about. Trust me, I intimately know what I’m talking about. Without personal jabs, let me explain in a bit more detail: App in VM (kinda posix) -> ext4 (repackaging of data to fit into “blocks”) -> NVMe driver -> (Google’s virtualization/block device stack, aka Vanadium/PD) -> CFS. The moment data got into ext4, it goes through legacy stack that only exists because many years ago there were hardware devices that had 512 byte sectors (as illustration, upgrade to 4K took forever). All repackaging, IO scheduling to work with 4kb block abstraction is wasted performance and cycles. From customer perspective, all they want is VM with scalable file system. With Kubernetes, etc. they don’t want to ever think about volume size, which is major hurdle to size correctly and provision. BTW, both small and large customers run into volume sizing issues all the time. There are also internal customers that need posix-compliant storage “on borg” because they run oss lib/software. Anyway, optimal stack in this case is to plug in into VM on a file system level. Now, is it hard problem to solve? Yes. Would it eliminate PD? No, still required for legacy cases. Would it be enormously beneficial for modern conteinerized cloud workloads? Absolutely. |
If you have an app which needs a NoSQL interface, then you can do much better by using a cloud-native NoSQL service, as opposed to using Cassandra on your VM and then hoping you can get cross-zone reliability by using something like a Regional Persistent Disk. And sure, you could use Cassandra on top of cifs/smbfs or nfs, but the results will be disappointing. These are 20th century tools, and it shows.
If customers want Posix because they don't want to update their application to use Spanner, or Big Table, or GCS, they certainly have every right to make that choice. But they will get worse price/performance/reliability as a result. You keep talking about ossification and people refusing to refactor the storage stack. Well, I'd like to submit to you that being wedded to a "posix file system" as the one true storage interface is another form of ossification. Storage stacks that feature NoSQL, relational database, and object storage WITHOUT an underlying Posix file systems might be a much more radical, and ultimately, the "proper stack refactoring". A "modern containerized cloud workload" is better off using Cloud Spanner, Cloud BigTable, or Cloud Storage, depending on the application and use case. Why stick with a 1970's posix file system with all of its limitations? (And I say this as an ext4 maintainer who knows about all of the warts and limitations of the Posix file interface.)
Of course, for customers who insist on a Posix file system, they can use GCE PD or Amazon EBS for local file systems, or they can use GCE Cloud Filestore or Amazon EFS if they want an NFS solution. But it will not be as cost effective, or performant as other cloud native alternatives.
Finally, just because you are using "oss lib/software" does not mean that you need "Posix-complaint storage". Especially inside Google, while those internal customers do exist, they are a super-tiny minority. Most internal teams use a much smarter approach, even if that means that an adaption layer is needed between some particular piece of OSS software and a more modern, scalable storage infrastructure. (And for many OSS libraries, they don't need a Posix-complaint interface at all!)
Posix-complaint means sticking with an interface invented 50 years ago, with technological assumptions which may not be true today. Sometimes you might need to fall back to Posix for legacy software --- but we're talking about "modern containerized cloud workloads", remember?