Hacker News new | ask | show | jobs
by engendered 4142 days ago
You do understand that flash erase blocks are around 128KB now, right?

How is that relevant to your wrong claim about database 128-byte writes? You word that as if you're correcting me.

I don't even know what point you're trying to make anymore, but you're trying to buttress your argument with what can best be described as diversions.

Most people write far less than they think. Databases aren't particularly magical, any more than the web server log file that is written to a line at a time. Yes, people "give a shit" about a "mere" 100GB of writes, because the vast majority of real projects, including at major corporations, write far less than that per day. So are we just talking about dick measuring now?

2 comments

A block might be 8KB, but the actual update you're making to the block might be much smaller. I imagine by '128 byte write' he's talking about a lot of random row updates, where each row is 128 bytes. Now, if you're not too unlucky, many updates will be combined on the same page per checkpoint, but that's not a given. On the other hand, it's reasonably likely that several updates will be combined per erase block per checkpoint. A heavily indexed table can exhibit some pretty random write patterns, though.

Additionally, the WAL has to be synced to disk every commit (unlike a web server log file), and WAL records can be very small. WAL is of course append-only, so you'd hope that a good SSD with a battery/cap backup would cache the writes and flush on the SSD erase block filling up.

A block might be 8KB, but the actual update you're making to the block might be much smaller.

All major databases will only deal in the 8KB increments (or whatever their block size is, whether larger or smaller, but never as small as originally claimed) though. They don't write less. Indeed, it's worth noting that most (every single major one) database systems actually write to a sequential transaction log (which they do not have to checkpoint every n-bytes), and only on commit do they actually then make a strategy for changing those source pages and extents, unrolling it from the log and checkpointing it, which by default includes coalescing and combining strategies. The idea that databases are randomly writing small bits of data all over the place is simply wrong, but is the entire foundation of almost every comment in this thread.

https://technet.microsoft.com/en-us/library/aa337560%28v=sql...

As one example. Oracle, pgsql, mysql, and others do the exact same thing.

They aren't randomly changing an int here and a bool there.

I worked on a financial system where we wrote just absurd amounts of data a day. We ran it on a FusionIO SLC device (with a mirror backup), and churned the data essentially around the clock. After three years the FusionIO little lifespan hadn't even moved.

tldr; people grossly overestimate the "magical" nature of databases.

I'm really not sure you read my post very thoroughly. I'm fairly intimately acquainted with the internals of database systems, and you don't seem to be replying to what I actually wrote - I wasn't attacking your point of view (I generally agree that a decently designed DB is unlikely to trash an SSD all that quickly), I was just hoping to shed light on the other poster's wording.

If you have an 8kb block and you change 128 bytes of it, the 'actual update' is much smaller than 8kb. Sure, you're reading/writing 8kb to disk, but everything outside of that 128 bytes is basically fat for the purposes of that change. As I said in my previous post, that can absolutely be mitigated by writes being combined through the checkpointing process, and one would hope that a decent SSD could cope easily with combining writes to an appending log.

A database can still be writing data all over the place. A heavily indexed table can cause quite varied write patterns, which can result in a lot of different pages getting touched. Fortunately, the reality is that well-designed DBs and SSDs are fairly capable of dealing with this.

My original comment on this whole discussion was that few databases write more than 100GB a day. I am not talking about whether you inserted n integers or updated so many varchar columns -- when you actually monitor its IO, it is extremely unlikely that your database exceeds 100GB a day of writes, and in all likelihood is a magnitude or two below this. Whenever anyone waves their hands and talks about databases as if they somehow imply massive use, they're just fearmongering -- actual empirical stats are your friend, and actual empirical stats show that most real-world databases barely register on the lifespan of most SSDs.

So now that we're in an understanding that we're talking about database writes to IO, the other matter is how it writes it. I've built a lot of systems on a lot of databases, and the write amplification has generally been very low. I've been building and running significant databases on SSDs for about 7 years now, and while everyone else is finally starting to realize that they're wasting their time if they aren't, we still see the sort of extreme niche fearmongering that makes other people clutch onto their magnetic storage (and I heard it the entire time. "OMG but don't databases kill flash???!?!?". No, certain volumes of writes and types of writes do. Only metering will tell you if that applies). Yes, some people do very odd things that can kill storage, but that is extremely rare. It almost certainly doesn't apply to the overwhelming percentage of HN readers.

The point is that even writing in 8kB chunks will lead to a lot of write amplification when the erase block size is at least 128kB and the flash page size is 16kB. 8kB writes are definitely less bad than 128B writes, but it's still not enough write combining to pass either of the relevant thresholds.
It's a 64x difference. And even that is grossly understating it because most database systems feature write coalescing.

It's actually "ironic" in that database systems were built to avoid random IO because it was glacially slow on spinning rust storage. They do everything possible to avoid it, and in actual, empirical metrics running databases on flash the write amplification has been very low, and hardly mirrors the claims throughout this thread.