I believe that NVMe uses multiple I/O queues compared to serialized access with SATA and I think you’d be able to side unnecessary abstractions like file systems and block-based access with an NVMe-specific datastore.
I’m also curious if different and more performant data structures can leveraged; if so, there may be downstream improvements for garbage collection, retrieval, and request parallelism.
Latency ought to be much better, since you're skipping several abstraction layers in the kernel.
But that's about it. And the latency is still worse than in-memory solutions.
Between that and the non-trivial effort needed to make this work in any sort of cloud setup (be it self-hosted k8s or AWS), it's a hard sell. If I really need latency above all, AWS gives me instances with 24TB RAM, and if I don't… why not just use existing kv-stores and accept the couple of ns extra latency?
Agreed. The classic reason is when you have latency needs, but your data set is large enough that RAM is cost-prohibitive, and random-access enough that disk won’t work. The cost savings from switching to NVMe have to justify the higher NRE cost, and simultaneously, you have to be sensitive to latency.
Individual NVMe drives are also rather small – the biggest I can find is 30TB, which is still more than what AWS offers me as RAM, but not much. Once you start adding custom algorithms to spread your data over multiple "raw" NVMe drives to get more capacity, the latency gap between your custom solution and existing, well-optimized file system stacks starts to erode. Might as well stick to existing kv stores on ZFS or something, rather than roll your own project that might be able to beat it, maybe.
While you can get 24TB ram, there is a pretty big cost difference. 2 TB of ram costs roughly $10000 compared to $130 for NVME storage (or $230 for 12 TB of a good hard drive). Sure the NVME is ~3.5x more expensive, but the latency will be dramatically lower and the throughput will be dramatically higher. Sure you can build a 24 TB ram system, but at that point the cost of the server will be entirely the ram. The reason for NVME based storage at this point is that at only ~3.5x the cost of a hard drive, you can switch all your storage over and as long as you don't need tons of storage (i.e. less than 100TB), the SSDs will be a minority of the cost of the system.
All that applies to regular kv stores abstracted through filesystems and block device layers just fine.
But when your latency requirements are so tight that you cannot possibly afford the latency penalty of a filesystem, you better have a good business case to justify either developing a custom bare-metal-nvme (which is $$$$$ and takes time) or getting a multi-TB RAM system, which is also $$$$$, but far more predictable, and can be put into production today, not 6+ months later when you finish developing your custom kv store.
For the other 99.999% of use cases, sure, just go with NVMe backing your regular virtualization/containerization infrastructure.
Yes, mostly on the durability side. NVMe actually has the relevant API to be sure that a write was flushed, while posix like filesystem API usually do not handle it.
I’m also curious if different and more performant data structures can leveraged; if so, there may be downstream improvements for garbage collection, retrieval, and request parallelism.