| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bestouff 980 days ago
	Naive question: are there really expected gains to address natively an NVMe disk wrt using a regular key-value database on a filesystem ?

4 comments

chaos_emergent 980 days ago

I believe that NVMe uses multiple I/O queues compared to serialized access with SATA and I think you’d be able to side unnecessary abstractions like file systems and block-based access with an NVMe-specific datastore.

I’m also curious if different and more performant data structures can leveraged; if so, there may be downstream improvements for garbage collection, retrieval, and request parallelism.

link

lathiat 980 days ago

SATA also has multiple I/O queues. It’s called “NCQ”

The exact semantics vary per protocol but it’s a feature of most protocols at least in the currently used revisions: https://en.wikipedia.org/wiki/Native_Command_Queuing

link

wtallis 980 days ago

That's one queue per drive. NVMe allows multiple queues per drive, commonly used to assign one queue per CPU core.

link

londons_explore 980 days ago

Most filesystems will make use of multiple IO queues - ie. if an application sends many different read requests, they may be satisfied out-of-order.

link

creshal 980 days ago

Latency ought to be much better, since you're skipping several abstraction layers in the kernel.

But that's about it. And the latency is still worse than in-memory solutions.

Between that and the non-trivial effort needed to make this work in any sort of cloud setup (be it self-hosted k8s or AWS), it's a hard sell. If I really need latency above all, AWS gives me instances with 24TB RAM, and if I don't… why not just use existing kv-stores and accept the couple of ns extra latency?

link

klodolph 980 days ago

Agreed. The classic reason is when you have latency needs, but your data set is large enough that RAM is cost-prohibitive, and random-access enough that disk won’t work. The cost savings from switching to NVMe have to justify the higher NRE cost, and simultaneously, you have to be sensitive to latency.

link

creshal 980 days ago

Individual NVMe drives are also rather small – the biggest I can find is 30TB, which is still more than what AWS offers me as RAM, but not much. Once you start adding custom algorithms to spread your data over multiple "raw" NVMe drives to get more capacity, the latency gap between your custom solution and existing, well-optimized file system stacks starts to erode. Might as well stick to existing kv stores on ZFS or something, rather than roll your own project that might be able to beat it, maybe.

link

adgjlsfhk1 979 days ago

While you can get 24TB ram, there is a pretty big cost difference. 2 TB of ram costs roughly $10000 compared to $130 for NVME storage (or $230 for 12 TB of a good hard drive). Sure the NVME is ~3.5x more expensive, but the latency will be dramatically lower and the throughput will be dramatically higher. Sure you can build a 24 TB ram system, but at that point the cost of the server will be entirely the ram. The reason for NVME based storage at this point is that at only ~3.5x the cost of a hard drive, you can switch all your storage over and as long as you don't need tons of storage (i.e. less than 100TB), the SSDs will be a minority of the cost of the system.

link

creshal 979 days ago

All that applies to regular kv stores abstracted through filesystems and block device layers just fine.

But when your latency requirements are so tight that you cannot possibly afford the latency penalty of a filesystem, you better have a good business case to justify either developing a custom bare-metal-nvme (which is $$$$$ and takes time) or getting a multi-TB RAM system, which is also $$$$$, but far more predictable, and can be put into production today, not 6+ months later when you finish developing your custom kv store.

For the other 99.999% of use cases, sure, just go with NVMe backing your regular virtualization/containerization infrastructure.

link

threeseed 980 days ago

Significant gains if you want a distributed key-value database because you can take advantage of NVMEoF.

link

di4na 980 days ago

Yes, mostly on the durability side. NVMe actually has the relevant API to be sure that a write was flushed, while posix like filesystem API usually do not handle it.

link