| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bob1029 1830 days ago

Things I have learned about SSDs:

If you want to go fast & save NAND lifetime, use append-only log structures.

If you want to go even faster & save even more NAND lifetime, batch your writes in software (i.e. some ring buffer with natural back-pressure mechanism) and then serialize them with a single writer into an append-only log structure. Many newer devices have something like this at the hardware level, but your block size is still a constraint when working in hardware. If you batch in software, you can hypothetically write multiple logical business transactions per block I/O. When you physical block size is 4k and your logical transactions are averaging 512b of data, you would be leaving a lot of throughput on the table.

Going down 1 level of abstraction seems important if you want to extract the most performance from an SSD. Unsurprisingly, the above ideas also make ordinary magnetic disk drives more performant & potentially last longer.

8 comments

pclmulqdq 1830 days ago

I used to think the same thing, but now that I work on SSD-based storage systems, I'm not sure this holds up in today's storage stacks. Log structuring really helped with HDDs since it meant fewer seeks.

In particular, the filesystem tends to undo a lot of the benefits you get from log-structuring unless you are using a filesystem designed to keep your files log-structured. Using huge writes definitely still helps, though.

A paper that I really like goes deeper into this: http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

Edit: I had originally said "designed for flash" instead of "designed to keep your files log-structured." F2FS is designed for flash, but in my testing does relatively poorly with log-structured files because of how it works internally.

Edit 2: de-googled the link. Thank you for pointing that out.

10000truths 1830 days ago

Achieving cutting-edge storage performance tends to require bypassing the filesystem anyways. Traditionally, that meant using SPDK. Nowadays, opening /dev/nvme* with O_DIRECT and operating on it with io_uring will get you most of the way there.

In either case, the advice given in the article and by the OP is filesystem agnostic.

nyanpasu64 1830 days ago

Will an end user downloading a video editing app (or similar) have a NVME drive, know how to give your app direct access to a NVME drive, and will your app not corrupt the rest of the files on the drive?

10000truths 1830 days ago

Extreme performance requires extreme tradeoffs. As with anything else, you have to evaluate your use cases and determine for yourself whether the tradeoffs are worth it. For a mass-market application that has to play nice with other applications and work with a wide variety of commodity hardware, it's probably not worthwhile. For a state-of-the-art high performance data store that expects low latencies and high throughput (à la ScyllaDB), it may very well be.

nyanpasu64 1830 days ago

Would high-performance data storage be easier to implement on commodity hardware if operating systems supplied an API to get a blob of bytes, segmented out of an entire disk (eg. a file), that presented low-level semantics like a full-fledged SSD partition or drive?

I feel that operating systems need to provide self-contained reliable APIs designed for atomically overwriting configuration files, without losing permissions or overwriting symlinks or such. Or perhaps supply more powerful primitives, like a faster/weaker fsync that serves as an ordering barrier rather than flushing to disk, or an API to replace a file without altering permissions. One issue I've heard is:

> I even had an issue with atomic writes over ssh that created the temp file but where not able to rename it, so the old one stayed.

naikrovek 1830 days ago

at that point just use a RAM disk and periodically write that data to physical disk or SSD. no extreme tradeoff required, because RAM disks are WAY faster than SSDs.

manhandling /dev/nvme0 seems equally likely to corrupt data in the event of a power failure.

wtallis 1830 days ago

> manhandling /dev/nvme0 seems equally likely to corrupt data in the event of a power failure.

If we make the reasonable assumption that this subthread is discussing a server use case, then we can assume that the SSD is tolerant of power failures and has the capacitors necessary to finish any cached writes it has reported as complete. Thus, having fewer layers between the hardware and the application means there are fewer opportunities for some layer to lie to those above it about whether the data has made it to persistent storage.

Whether or not you're bypassing large parts of the operating system's IO stack, the application needs to have a clear idea of what data needs to be flushed to persistent storage at what times in order to properly survive unexpected power loss without unnecessary data loss or corruption.

10000truths 1830 days ago

> at that point just use a RAM disk and periodically write that data to physical disk or SSD. no extreme tradeoff required, because RAM disks are WAY faster than SSDs.

A storage application that need to bypass the filesystem will already be implementing its own caching system anyways. The idea is to persist the data to maintain durability without sacrificing latency.

> manhandling /dev/nvme0 seems equally likely to corrupt data in the event of a power failure.

That is what O_SYNC flag is for.

natmaka 1830 days ago

Given enough RAM on a Linux machine one may use tmpfs, which maintains a RAM disk and at any moment only uses the amount of RAM needed, with a pre-defined limit.

On PostgreSQL create an adequately-caped tmpfs, create a TABLESPACE on it, then store temporary tables into this TABLESPACE. No SSD (I have access to) beats this. Hint: before shutting PG down you may DROP this TABLESPACE.

It also is useful for a blockchain, amazingly fast (and a relief for HDDs), in most cases alleviating the need for a SSD. Place the blockchain file(s) on the tmpfs mount. Before machine shutdown stop any blockchain-using software, then store a compressed copy of the blockchain file(s) on permanent storage (I use "zstd -T0 --fast"...), and upon reboot restore it on the tmpfs mount. If anything fails the blockchain-writing software will re-download any missing block.

quotemstr 1830 days ago

Why would you want to bypass the filesystem by talking to the block device directly? Doesn't O_DIRECT on a preallocated regular file accomplish the same thing with less management complexity and special OS permissions? Granted, the file extents might be fragmented a bit, but that can be fixed.

10000truths 1830 days ago

A "regular file" might reside in multiple locations on disk for redundancy, or might have a checksum that needs to be maintained alongside it for integrity. Or, as you say, its contents might not reside in contiguous sectors - or you might be writing to a hole in a sparse file. There's a lot of "magic" that could go on behind the scenes when operating on "regular files", depending on what filesystem you're using with what options. Directly operating on the block device makes it easier to reason about the performance guarantees, since your reads and writes map more cleanly to the underlying SCSI/ATA/NVME commands issued.

lazide 1830 days ago

If you understand your workload and the hardware well enough to understand how doing direct I/O on a file will help - then you’re going to generally do better against a direct block device because there are fewer intermediate layers doing the wrong optimizations or otherwise messing you up. From a pure performance perspective anyway. Extents are one part of the issue, flushes to disk (and how/when they happen), caching, etc.

Doesn’t mean it isn’t easier to deal with as a file from an administration perspective (and you can do snapshots, or whatever!), but Lvm can do that too for a block device, and many other things.

quotemstr 1830 days ago

With O_DIRECT though you're opting out of the filesystem's caching (well, VFS's), forced flushes, and most FS level optimizations, so I'd expect it to perform on par with direct partition access.

Do you have numbers showing an advantage of going directly to the block device? Personally, I'd consider the management advantages of a filesystem compelling absent specific performance numbers showing the benefit of direct partition access.

lazide 1830 days ago

You do when it does that/respects it which isn’t always. The point is that you have more layers. If you’re trying to be as direct as possible, more layers is unhelpful.

Since you get most of the same advantages management wise with lvm while using the block interface (including snapshots, resizing, and all the other management goodies), you’re not exactly getting much extra functionality either.

Unklejoe 1829 days ago

I'm wondering if it's really necessary to get at the block device directly.

I'm able to saturate a PCIe 3.0 x4 link doing direct IO to an NVMe drive with a single 1.7 GHz Power PC core without breaking a sweat. This is through ext4.

My accesses are sequential though. Maybe there's more of a penalty with random IO.

trulyme 1830 days ago

Degoogled link: http://pages.cs.wisc.edu/~jhe/eurosys17-he.pdf

gravypod 1830 days ago

This is the "secret sauce" behind LevelDB: https://github.com/google/leveldb#performance

bob1029 1830 days ago

This looks to be a similar technique.

In my testing of these ideas, I've been able to push over 2 million transactions per second (~1Kb per transaction) to a Samsung 960 Pro. For reference, its rated for 2.1GB/s sequential writes, so I've got it pretty much 100% saturated.

The implementation for something like this is actually really underwhelming when you figure out how to put all the pieces together. I assembled this prototype (also a key-value store) using .NET5, LMAX Disruptor, and a splay tree implementation i copied from google somewhere. The hardest part was figuring out how to wait for write completion on the caller side (multiple calling threads are ultimately serialized into a single worker thread via the Disruptor). Turns out, busy wait for a few thousand cycles followed by a yield to the OS is a pretty good trick. You just do a while(true) over a completion flag on the transaction object which is set en masse by the handling thread after the write goes to disk. Batch sizes are determined dynamically based on how long the previous batch took to write. In practice, I never observed a batch that took longer than 2-3 milliseconds on my 960 pro. Max batch size is 4096, and it is permanently full when 100% loaded. A full batch = a nice big IO to disk.

ww520 1830 days ago

LMDB has similar write characteristics where its b-tree is append-only. This gives LMDB amazing performance and very robust ACID transaction support as immutability is baked in.

fulafel 1830 days ago

This is quite common in traditional DBs too. Eg PostgreSQL has its write-ahead log. Both LMDB and PostgreSQL then occasionally need to do do some kind of compaction, checkpoint or garbage collection, whatever it's called in various systems, the write-only log is reset and any live data in it improted into the main db data.

ww520 1830 days ago

I only have a cursory knowledge on LMDB (listening to a podcast while biking). Anyway, LMDB has no transaction log nor write ahead log. There's no overwrite during update. Data page update is copy-on-write and b+tree index update is append only. The update on the b+tree pages is performed from the bottom of the tree to the root, linking newly appended pages to higher level pages. The transaction is committed when the new root page is appended. When there's a crash, the incomplete appended index pages have not been linked up to the root page yet and are not reachable from the previous valid root page. They can be just thrown away. Recovery just means searching for the last valid root index page. There's no need for a WAL and undo/redo of the transaction log.

Deleted pages and obsolete pages are actively put back into a free list (tracked by another b+tree), which will be reused for new page allocation. This avoids the long garbage collection phase to walk all the live pages for compaction (no vacuum is needed).

hyc_symas 1828 days ago

No. LMDB is copy-on-write, with double buffering/shadow pages for the root page updates. No searching for the last valid root page.

Looks like you have the other details right.

fulafel 1829 days ago

Thanks for the explanation. Clever stuff, LMDB is taking advantage of not having to support multiple writers here.

remram 1830 days ago

Shouldn't the OS or libc take care of that? If I write and don't immediately flush()?

KMag 1830 days ago

I don't think most libc implementations take care to buffer to filesystem block/cluster boundaries.

AtlasBarfed 1830 days ago

This is basically the purpose of rocksdb, and to a lesser extent Cassandra

senderista 1829 days ago

Also: parallelize your writes. This is the biggest difference between SSDs and HDDs: internal parallelism. You’ll have a hard tine saturating I/O bandwidth even with huge sequential writes if you don’t introduce some parallelism. Fortunately, io_uring makes this easy from a single thread.

hypertele-Xii 1830 days ago

Buffering writes is fine if you're ok with losing your data. For some applications that's acceptable, but when I'm writing to disk, it's because I want persistence. "It'll get flushed to disk at some point as long as power doesn't go out" is hardly that.

scns 1830 days ago

Like this?

https://en.wikipedia.org/wiki/NILFS?wprov=sfla1