Hacker News new | ask | show | jobs
by notaplumber 1829 days ago
> Because flash does not overwrite anything, ever.

This is repeated multiple times in the article, and I refuse to believe it is true. If NVME/SSDs never overwrote anything, they would quickly run out of available blocks, especially on OSs that don't support TRIM.

6 comments

There's nuance to this; the deletes / overwrites are accomplished by bulk wiping entire blocks.

Rather than change the paint color in a hallway you have to tear down the house and build a new house in the vacant lot next door that's a duplicate of the original, but with the new hallway paint.

To optimize, you keep a bucket of houses to destroy, and a bucket of vacant lots, and whenever a neighborhood has lots of "to be flattened houses" the remaining active houses are copied to a vacant lot and the whole neighborhood is flattened.

So, things get deleted, but not in the way people are used to if they imagine a piece of paper and a pencil and eraser.

Just to add to the explanation, SSDs are able to do this because they have a layer of indirection akin to virtual memory. This means that what your OS thinks is byte 800000 of the SSD may change it's actual physical location on the SSD over time even in the absence of writes or reads to said location.

This is a very important property of SSDs and is a large reason why log structured storage is so popular in recent times. The SSD is very fast at appends, but changing data is much slower.

> The SSD is very fast at appends, but changing data is much slower.

No, it's worse than that. The fact that it's an overly subtle distinction is the problem.

SSDs are fast while write traffic is light. From an operational standpoint, the drive is lying to you about its performance. Unless you are routinely stress testing your system to failure, you may have a very inaccurate picture of how your system performs under load, meaning you have done your capacity planning incorrectly, and you will be caught out with a production issue.

Ultimately it's the same sentiment as people who don't like the worst-case VACUUM behavior of Postgres - best-effort algorithms in your system of record make some people very cranky. They'd rather have higher latency with a smaller error range, because at least they can see the problem.

Are there write-once SSDs? They would have a tremendous capacity. Probably good for long term backups or archiving. Also possibly with a log structured filesystem only.
Making them write-once doesn't increase the capacity; that's mostly limited by how many analog levels you can distinguish on the stored charge, and how many cells you can fit. The management overhead and spare capacity to make SSDs rewritable is –to my knowledge– in the single digit percentages.

(Also you need the translation layer even for write-once since flash generally doesn't come 100% defect free. Not sure if manufacturers could try to get it there, but that'd probably drive the cost up massively. And the translation layer is there for rewritable flash anyway... the cost/benefit tradeoff is in favor of just living with a few bugged cells.)

I suspect that hawki was assuming that a WORM SSD would be based on a different non-flash storage medium. I don't know any write once media that has similar read/write access times to an SSD.

FWIW, there are WORM microsd cards available but it looks like they still use flash under the hood.

I don't know enough specifics, so I didn't assume anything :) In fact I was not aware of non-flash SSDs.

Because of the Internet age there probably is not much place for write once media anyway, even it would be somewhat cheaper. But maybe for specialized applications or if it would be much much cheaper per GB.

> Making them write-once doesn't increase the capacity

It could theoretically make them cheaper. But I guess that there wouldn't be enough demand, so you'd be better off having some kind of OS enforced limitation on it.

I find this a super interesting question. I always assumed that long term stability of electronic non-volatile memory is worse than that of magnetic memory. When I think about it, I can't think of any compelling reason why that should be the case. Trapped electrons vs magnetic regions; I have no intuition which one of them is likely to be more stable.

There is a question on stackoverflow about this topic with many answers but no definitive conclusion. There seem to be some papers touching the subject but at a glance I couldn't find anything useful in them.

[1] https://superuser.com/questions/4307/what-lasts-longer-data-...

According to https://www.ni.com/en-no/support/documentation/supplemental/... (Seems kinda reputable at least)

"The level of charge in each cell must be kept within certain thresholds to maintain data integrity. Unfortunately, charge leaks from flash cells over time, and if too much charge is lost then the data stored will also be lost.

During normal operation, the flash drive firmware routinely refreshes the cells to restore lost charge. However, when the flash is not powered the state of charge will naturally degrade with time. The rate of charge loss, and sensitivity of the flash to that loss, is impacted by the flash structure, amount of flash wear (number of P/E cycles performed on the cell), and the storage temperature. Flash Cell Endurance specifications usually assume a minimum data retention duration of 12 months at the end of drive life."

> During normal operation, the flash drive firmware routinely refreshes the cells to restore lost charge. However, when the flash is not powered the state of charge will naturally degrade with time.

You have to be careful how you interpret this bit. "Normal operation" here assumes not just that the SSD is powered, but that it is actively used to perform IO. Writes to the SSD will eventually cause data to be refreshed as a consequence of wear leveling; if you write 1TB per month to a 1TB drive then every (in-use) cell will be refreshed approximately monthly, and data degradation won't be a problem.

If you have an extremely low-write workload, the natural turnover due to wear leveling won't keep the data particularly fresh and you'll be dependent on the SSD re-writing data when it notices (correctable) read errors, which means data that is never accessed could degrade without being caught. But in this scenario, you're writing so little to the drive that the flash stays more or less new, and should have quite long data retention even without refreshing stored data.

> When I think about it, I can't think of any compelling reason why that should be the case. Trapped electrons vs magnetic regions; I have no intuition which one of them is likely to be more stable.

My layman intuition (which could be totally wrong) is that trapped electrons have a natural tendency to escape due to pure thermal jitter. Whereas magnetic materials tend to stick together, so there's at least that. Don't how much of this matches the actual electron physics/technology though...

Hmm I don't think this is conclusive. Thermal jitter makes magnetic boundaries change too, and of course you have to add to it that it is more susceptible to magnetic interference.

I don't have intuition either, but I don't think this explanation is sufficient

> Trapped electrons vs magnetic regions;

From the physics point of view, aren't both cases the same thing?.

Isn't magnetic regions a state of the electric field? so if I move electrons in and out, the electric field should be changing as well

No. A region of a piece of material is magnetized in a certain direction when its (ionized) atoms are mostly oriented in that direction, the presence of a constant magnetic field is (roughly speaking) only a consequence of that.

So flash memory is about the electrons, while magnetic memory is about the ions.

Modern multi-bit-per-cell flash has quite terrible data retention. It is especially low if it is stored in a warm place. You'd be lucky to see ten years without an occasional re-read + error-correct + re-write operation going on
Any SSD you go through the trouble of building a max capacity disk image for, then dd'ing onto the disk before removing?

I mean... This is general purpose HW here. Write once SSD is a workflow more than an economically tenable use-case in terms of making massive size write once then burn the write circuit devices.

I don't think anyone would make literally write-once drives with flash memory; that's more optical disk territory. But zoned SSDs and host-managed SMR hard drives make explicit the distinction between writes and larger-scale erase operations, while still allowing random-access reads.
That would be magnetic tapes.
Append-only garbage-collected storage was used in data center even when hard disks were (and are) popular because it's more reliable and scalable.
inspired by that last sentence, the analogy could be rewritten as:

  - lines on page
  - pages of paper
  - whole notebooks
and might be easier for people to grok than the earlier houses/paint analogy.
I don’t know, I like the drama of copying a neighborhood and tearing down the old one xD
Reminds me of https://xkcd.com/1737/.

"When a datacenter catches fire, we just rope it off and rebuild one town over."

Speaking of xkcd, 2021 is return of "All your bases" See alt-text on image.

https://xkcd.com/286/

I think the explanation is sound maybe (I am not that familiar) but the analogy gets a bit lost when you talk about buckets of houses and buckets of vacant lots.

Maybe there is a better analogy or paradigm to view this through.

I should have been a little more clear -- the urban planner managing the house building / copying and neighborhood destruction (the realtime controller) The rules are: 1) You can build a house kinda quickly 2) You can't modify a house once it is built 3) you can only build a house on a vacant lot 4) you can change the "mailing address" (relative to the physical location) of the house 5) you can only knock down whole blocks of houses at once (not one at a time) 6) each time you flatten a block more crap accumulates in that block until after a while you can't build there anymore. 7) the flatten / rebuild step may be quite slow (because you have lots of houses to build) 8) You can lie and say you built a house before it is finished, if you don't have too many houses to build. (if you've got an SSD with a capacitor / battery or tiny cache and reserved area for that cache) 9) you've lied to the user and you actually have 5-100% more build-able area than you've advertised. 10) you have a finite area so eventually the dead space accumulates to the point where you can no longer safely build.

So -- you keep track of vacant lots and "dead" houses (abandoned but not flattened); whenever you've got spare time you will copy blocks with some ratio of "live" to abandoned houses to new lots so the new block only has live houses.

These pending / anticipatory compaction/garbage collection operations are what I refer to as "buckets" -- having to compact 300 (neighborhoods) blocks to achieve 300 writes is going to result in glacial performance because of this huge write amplification (behind the scenes the drive is duplicating 100s of mb / gb of data to write a small amount of user modifications)

As you might imagine, there are lots of strategies to how to approach this problem, some of which give you an SSD with extremely unpredictable (when full) performance, others will give a much more consistent but "slower" performance.

Spoiler alert - This is the plot to ‘The Prestige’.
It's true and untrue depending on how you look at it. Flash memory only supports changing/"writing" bits in one direction, generally from 1 to 0. Erase, as a separate operation, clears entire sectors back to 1, but is more costly than a write. (Erase block size depends on the technology but we're talking MB on modern flash AFAIK, stuff from 2010 already had 128kB.)

So, the drives do indeed never "overwrite" data - they mark the block as unused (either when the OS uses TRIM, or when it writes new data [for which it picks an empty block elsewhere]), and put it in a queue to be erased whenever there's time (and energy and heat budget) to do so.

Understanding this is also quite important because it can have performance implications, particularly on consumer/low-end devices. Those don't have a whole lot of spare space to work with, so if the entire device is "in use", write performance can take a serious hit when it becomes limited by erase speed.

[Add.: reference for block sizes: https://www.micron.com/support/~/media/74C3F8B1250D4935898DB... - note the PDF creation date on that is 2002(!) and it compares 16kB against 128kB size.]

> Understanding this is also quite important because it can have performance implications

Security implications too. The storage device cannot be trusted to securely delete data.

If you write whole drive capacity of random data, you should be fine.
No. Say a particular model of SSD has over-provisioning of 10%, then even after writing the "whole" capacity of the drive, you can still be left with up to 10% of data recoverable from the Flash chips.
Right, so one better write 2x or 10x drive capacity of random data to it.
You should be running flash with self-encryption (and make sure you have a drive that implements that correctly).

To zap a drive you ask it to securely drop the self-encryption key. The data will still be there, but without the key it is indistinguishable from random noise.

For some family photos? Probably. For sensitive material or crypto keys? Absolutely not, due to overprovisoning as mentioned (which can be way higher than 10% for enterprise drives), but also due to controllers potentially lying to you especially when drives have things like pSLC caches, etc.
By any reasonable definition they do overwrite data. It's just that they can't overwrite less than a block of data.
If a logical overwrite only involved bits going from 1 to 0, are and drives smart enough to recognize this and do it as an actual overwrite instead of a copy and erase?
On embedded devices, yes, this is actually used in file systems like JFFS2. But in these cases the flash chip is just dumb storage and the translation layer is implemented on the main CPU in software. So there's no "drive" really.

On NVMe/PC type applications with a controller driving the flash chips… I have absolutely no idea. I'm curious too, if anyone knows :)

I do know. Apparently you downvoted my sibling response to you as too simplistic, but I was clearly responding to someone where the embedded bare drive situation is irrelevant.

When it comes to what non bare flash drives do, you can start here: http://www.vldb.org/pvldb/vol13/p519-kakaraparthy.pdf

This paper is imperfect and the following citations are worth skimming. There's a cohort of similar papers chasing the same basic question in recent years that aren't densely cited amongst each other.

Go here next: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46... but note that's just a jumping off point to the more recent papers.

It's hard to gain a full understanding of this layer because it's the basis of intense competition, hence held closely by controller manufacturers.

I'm far from world expert on this, but have read a lot about it and can answer with what I know to the best of my ability.

> Apparently you downvoted my sibling response to you as too simplistic,

I didn't downvote your sibling response, but I did ignore it since it provided neither any sources nor any context for why I should trust your knowledge. Apparently others were less kind on your short statement.

With the additional information in this post, I'm much more willing to accept it into my head — thanks for answering this!

Yeah sorry that was unnecessarily grouchy of me.
Generally no, because the unit of write is a page.
Flash has a flash translation layer (FTL). It translates linear block addresses (LBA) into physical addresses ("PHY").

Flash can write blocks at a granularity similar to a memory page (cells, around 4-16 KB). It can erase only sets of blocks, at a much larger granularity (around 512-ish cell sized blocks).

The FTL will try to find free pages to write your data to. In the background, it will also try to move data around to generate unused erase blocks and then erase them.

In flash, seeks are essentially free. That means that it does no longer matter if blocks are adjacent. Also, because of the FTL, adjacent FTL are not necessarily adjacent on the physical layer. And even if you do not rewrite a block, it may be that the garbage collection moves data around at the PHY layer in order to generate completely empty erase blocks.

The net effect is that positioning as seen from the OS no longer matters at all from the OS layer, and that the OS layer has zero control over adjacency and erase at the PHY layer. Rewriting, defragging, or other OS level operations cannot control what happens physically at the flash layer.

TRIM is a "blatant layering violation" in the Linus sense: It tells the disk "hardware" what the OS thinks it no longer needs. TRIM'ed blocks can be given up and will not be kept when the garbage collector tries to free up an erase page.

> In flash, seeks are essentially free. That means that it does no longer matter if blocks are adjacent.

> The net effect is that positioning as seen from the OS no longer matters at all from the OS layer, and that the OS layer has zero control over adjacency and erase at the PHY layer. Rewriting, defragging, or other OS level operations cannot control what happens physically at the flash layer.

I don't agree with this. The "OS visible position" is relevant, because it influences what can realistically be written together (multiple larger IOs targeting consecutive LBAs in close time proximity). And writing data in larger chunks is very important for good performance, particularly in sustained write workloads. And sequential IO (in contrast to small random IOs) does influence how the FTL will lay out the data to some degree.

Disagree, because my understanding your OS visible positions have zero relevance to what will actually be translated to PHYs.

If you feed your NVMe a stream of 1GB writes spread out at completely randomised OS visible places (LBAs), the FTL may very well write it sequentially and you get the solid sustained write performance.

Conversely, you may try to write 1GB of sequential LBAs, and your FTL may very well spread it out all across the physical blocks simply because that's what’s available.

What I'm saying is that sequential reads and writ workloads are good, but whether the OS considers them sequential or not in terms of LBAs is irrelevant. The controller ignores LBAs and abstracts everything away.

My understanding could be wrong, so please correct me if I am.

That may sometimes be true the first times you write the random data (but in my experience it's often not true even then, and only if you carefully TRIMed the whole filesystem and it was mostly empty). But on later random writes it's rarely true, unless your randomness pattern of exactly the same as in the first run. To make room the FTL will (often in the background) need to read the non-written parts of erase blocks sized data assigned in the previous runs, just to be able to write out the new random writes. At some point new writes need to wait for this. Slowing things down.

Whereas with larger/sequential writes, there's commonly no need for read-modify-write cycles. The entire previous erase block sized chunks can just be marked as reusable with new content - the old data isn't relevant anymore.

This is pretty easy to see by just running benchmarks with sustained sequential and random write IO. But on some devices it'll take a bit - initially the writes are all in a faster area (e.g. using SLC flash instead of denser/cheaper mlc/tlc/qlc).

Of course, if all the random writes are >= erase block size, with a consistent alignment to multiples of the write size, then you're not going to see this - it's essentially sequential enough.

Thanks for this part, I feel like this was a crucial piece of information I was missing. Also explains my observations about TRIM not being as important as people claim it is, the firmware on modern flash storage seems more than capable of handling this without OS intervention.
The GC in the device cleans up.

TRIM is useful, it gives the GC important information.

TRIM is not that important as long as the device is not full (less than 80%, generally speaking, but it is very easy to produce pathological cases that are way off in either direction). Once the device fills up above that it is crucial.

The author clearly explains how this works in the sentence immediately following. "Instead it has internally a thing called flash translation layer (FTL)" ...
I unfortunately skimmed over this, isotopp's explanation helped clear things up in my head.
I just saw his post, it's a great explanation.

It might also help to keep in mind that both regular disk drives and solid state drives remap bad sectors. Both types of disks maintain an unaddressable storage area which is used to transparently cover for faulty sectors.

In a hard drive, faulty sectors are mapped during production and stored in the p-list, and are remapped to sectors in this extra hidden area. Sectors that fail at runtime are recorded in the g-list and are likewise remapped.

Writes may usually go to the same place in a hard drive, but it's not guaranteed there either.

This is not true anymore for many recent SMR HDDs. They have a translation layer, just like flash storage.

This is because for SMR HDDs, each block can either be SMR (higher density, EXTREMELY SLOW WRITES like <10mb/s possible, erases will remove multiple blocks just like flash memory), or normal (standard density, normal write speeds).

The controller abstracts this away and does writes as normal, but while the drive is idle, the controller in the background, converts these standard blocks into SMR blocks.

This is also why SMR HDDs support TRIM.

Thanks for the info that makes a lot of sense. It looks like this tech has emerged in the time since I last did much work with disk drives.

Seems it's increasingly a bad idea to presume the implementation of internals.

Perhaps they mean it must erase an entire block before writing any data, unlike a disk that can write a single sector at a time?
The issue is that DDR4 is like that too. Not only the 64 byte cache line, but DDR4 requires a transfer to the sense amplifiers (aka a RAS, row access strobe) before you can read or write.

The RAS command eradicated the entire row, like 1024 bytes or so. This is because the DDR4 cells only have enough charge for one reliable read, after that the capacitors don't have enough electrons to know if a 0 or 1 was stored.

A row close command returns the data from the sense amps back to the capacitors. Refresh commands renew the 0 or 1 as the capacitor can only hold the data for a few milliseconds.

------

The CAS latency statistic assumes that the row was already open. It's a measure of the sense amplifiers and not of the actual data.

It's vaguely similar, but there's a huge difference in that flash needs to be erased before you can write it again, and that operation is much slower and only possible on much larger sizes. DDR4 doesn't care, you can always write, just the read is destructive and needs to be followed by a write.

I think this makes the comparison unhelpful since the characteristics are still very different.

The difference is that on DDR you have infinite write endurance and you can do the whole thing in parallel.

If flash was the same way, and it could rewrite an entire erase block with no consequences, then you could ignore erase blocks. But it's nowhere near that level, so the performance impact is very large.

That's a good point.

There are only 10,000 erase cycles per Flash cell. So a lot of algorithms are about minimizing those erases.

What does DDR have to do with NVMe?
You can't write a byte, or a word, either.

The "fact" that you can do it in your program without disturbing bytes around it is a convenient fiction that the hardware fabricates for you.

DDR4 is effectively a block device and not 'random access'.

Pretty much only cache is RAM proper these days (aka: all locations have equal access time... that is, you can access it randomly with little performance loss).

I’m confused. What’s the difference between a cache line and a row in RAM? They’re both multiples of bytes. You have data sharing per chunk in either case.

The distinction seems to be how big the chunk is not uniformity of access time (is a symmetrical read disk not a block device?)

Hard disk chunks are 512 bytes classically, and smaller than the DDR4 row of 1024 bytes !!

So yes. DDR4 has surprising similarities to a 512byte sector hard drive (modern hard drives have 4k blocks)

>> What’s the difference between a cache line and a row in RAM?

Well DDR4 doesn't have a cache line. It has a burst length of 8, so the smallest data transfer is 64 bytes. This happens to coincide with L1 cache lines.

The row is 1024 bytes long. Its pretty much the L1 cache on the other side, so to speak. When your CPU talks to DDR4, it needs to load a row (RAS all 1024 bytes) before it can CAS read a 64 byte burst length 8 chunk.

-----------

DDR4, hard drives, and Flash are all block devices.

The main issue for Flash technologies, is that the erase size is even larger than the read/write block size. That's why we TRIM for NVMe devices.

Of course it does [0]. It's just it assigns writes as evenly as possible (to have as even wear as possible), so log-like internal "file system" is a way to go.

https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf