Hacker News new | ask | show | jobs
by AlisdairO 4142 days ago
A block might be 8KB, but the actual update you're making to the block might be much smaller. I imagine by '128 byte write' he's talking about a lot of random row updates, where each row is 128 bytes. Now, if you're not too unlucky, many updates will be combined on the same page per checkpoint, but that's not a given. On the other hand, it's reasonably likely that several updates will be combined per erase block per checkpoint. A heavily indexed table can exhibit some pretty random write patterns, though.

Additionally, the WAL has to be synced to disk every commit (unlike a web server log file), and WAL records can be very small. WAL is of course append-only, so you'd hope that a good SSD with a battery/cap backup would cache the writes and flush on the SSD erase block filling up.

1 comments

A block might be 8KB, but the actual update you're making to the block might be much smaller.

All major databases will only deal in the 8KB increments (or whatever their block size is, whether larger or smaller, but never as small as originally claimed) though. They don't write less. Indeed, it's worth noting that most (every single major one) database systems actually write to a sequential transaction log (which they do not have to checkpoint every n-bytes), and only on commit do they actually then make a strategy for changing those source pages and extents, unrolling it from the log and checkpointing it, which by default includes coalescing and combining strategies. The idea that databases are randomly writing small bits of data all over the place is simply wrong, but is the entire foundation of almost every comment in this thread.

https://technet.microsoft.com/en-us/library/aa337560%28v=sql...

As one example. Oracle, pgsql, mysql, and others do the exact same thing.

They aren't randomly changing an int here and a bool there.

I worked on a financial system where we wrote just absurd amounts of data a day. We ran it on a FusionIO SLC device (with a mirror backup), and churned the data essentially around the clock. After three years the FusionIO little lifespan hadn't even moved.

tldr; people grossly overestimate the "magical" nature of databases.

I'm really not sure you read my post very thoroughly. I'm fairly intimately acquainted with the internals of database systems, and you don't seem to be replying to what I actually wrote - I wasn't attacking your point of view (I generally agree that a decently designed DB is unlikely to trash an SSD all that quickly), I was just hoping to shed light on the other poster's wording.

If you have an 8kb block and you change 128 bytes of it, the 'actual update' is much smaller than 8kb. Sure, you're reading/writing 8kb to disk, but everything outside of that 128 bytes is basically fat for the purposes of that change. As I said in my previous post, that can absolutely be mitigated by writes being combined through the checkpointing process, and one would hope that a decent SSD could cope easily with combining writes to an appending log.

A database can still be writing data all over the place. A heavily indexed table can cause quite varied write patterns, which can result in a lot of different pages getting touched. Fortunately, the reality is that well-designed DBs and SSDs are fairly capable of dealing with this.

My original comment on this whole discussion was that few databases write more than 100GB a day. I am not talking about whether you inserted n integers or updated so many varchar columns -- when you actually monitor its IO, it is extremely unlikely that your database exceeds 100GB a day of writes, and in all likelihood is a magnitude or two below this. Whenever anyone waves their hands and talks about databases as if they somehow imply massive use, they're just fearmongering -- actual empirical stats are your friend, and actual empirical stats show that most real-world databases barely register on the lifespan of most SSDs.

So now that we're in an understanding that we're talking about database writes to IO, the other matter is how it writes it. I've built a lot of systems on a lot of databases, and the write amplification has generally been very low. I've been building and running significant databases on SSDs for about 7 years now, and while everyone else is finally starting to realize that they're wasting their time if they aren't, we still see the sort of extreme niche fearmongering that makes other people clutch onto their magnetic storage (and I heard it the entire time. "OMG but don't databases kill flash???!?!?". No, certain volumes of writes and types of writes do. Only metering will tell you if that applies). Yes, some people do very odd things that can kill storage, but that is extremely rare. It almost certainly doesn't apply to the overwhelming percentage of HN readers.