The growth in hard drive space is called Kryder's Law [0] (like Moore's law). There's a paper from 2012 on the cost of long-term storage [1].. and here's a quote from the researchers blog[2]:
Here is a graph I got from Dave Anderson [director of strategic planning at Seagate] years ago. It shows that what looks like a smooth Kryder's Law curve is actually the superposition of a series of S-curves, one for each successive technology generation. Naturally, because the easy transitions get done first, the cost of each successive transition increases, perhaps even exponentially. Since margins are constrained and so, these days, are volumes, to generate a return on the investment in each transition requires that the technology be kept in the market longer. The longer interval between transitions translates to a lower Kryder rate.
FWIIW; worked in Kryder's lab at Hammerschlag Hall as an undergraduate in the 90s: he's one of the best people I've met in the sciences. The E.O. Lawrence of hard drive technology.
and in turn, demand for storage is slowed by lack of need for storage.
The real question should be: Where's my 4K video? My 10-bit-channel images? My lossless audio?
Yes they are here but not nearly as common as content delivered in ancient formats from over a decade ago. For example go on a wallpapers subreddit, the vast majority is still in 8-bit 1920x1080 JPEG. Video is still mostly 1080. The leading music services still deliver in 2-channel lossy formats. And this is 2016.
Most of this I suppose is because of the general slowness of the internet and the usage caps in many areas of the world.
So maybe improve internet service -> create more detailed content and let people save it -> people will want to demand more storage to store all that in.
>Yes they are here but not nearly as common as content delivered in ancient formats from over a decade ago. For example go on a wallpapers subreddit, the vast majority is still in 8-bit 1920x1080 JPEG. Video is still mostly 1080. The leading music services still deliver in 2-channel lossy formats. And this is 2016.
Ever experienced YouTube buffering on an average connection? And that's for the metropolitan US -- consider the developing world, or even rural US, the mobile internet caps, etc.
Not to mention that even if you have enough speed, still you don't get much benefit from 4K video for 99% of stuff out there. Diminishing returns. I should know, I have projected good 1080p video to movie theaters, and nobody would call it bad or inadequate. On an average 15-27" monitor? It's not even an issue...
The rest of the world doesn't matter. If there is sufficient demand for 4K video and faster connection speeds, there will eventually be a product for them.
For a lot of things, that extra resolution is not really any benefit. Particularly for content that wasn't ever recorded at that level of fidelity anyway - I don't need 4k versions of Seinfeld re-runs, or SpongeBob SquarePants for the kiddos. Most of that stuff doesn't even need to be 720p.
Besides, pirated video content was really the only thing that most normal people could fill up their hard drives with (okay, maybe GoPro people or people with huge Steam libraries, too), but the Netflix and YouTube and Amazon Prime have taken a huge chunk out of that.
Also people who shoot raw photos. They're like 35MB each, an order of magnitude bigger than jpegs. They can eat up gigabytes (though to be fair, not terabytes) pretty fast. I have about 525GB of photos and videos that I've taken over years. Something like 300GB is jpegs. If I'd exclusively shot raw files I would need 3-4TB to store everything.
And yet, have you seen the 4k content out there? Breaking Bad was awesome, but is that really better in 4k than, say BSG or Planet Earth?
I bought a 4k TV to use as a monitor several months back and started researching content - Most of it is eyebrow-raising, to say the least.
I remember when 720/1080 were first introduced, there was an expected dearth of content, but at least what was out there was worth high-def (flyovers of tourist destinations, PBS nature documentaries, etc).
For example - Here's some of Netflix's offerings in 4k:
Series:
House of Cards
Marco Polo
Breaking Bad
The Blacklist
Movies and Documentaries
Smurfs 2
Philadelphia
Jerry Macguire
Crouching Tiger, Hidden Dragon
Oceans (documentary)
Forests (documentary)
Flowers (documentary)
Yes that's an important point. There are diminishing returns with better resolutions. Twice the resolution doesn't mean it appears twice as good, and at higher resolutions it can even be hard to tell. At some point you exceed the resolution of the human eye and ear.
But I think storage space still matters on mobile. I haven't come close to filling up my laptop after 4 years of use, but I have to be careful about putting stuff on my iPod. Still it can store dozens of hours of audio, so it's not the biggest issue.
Sure, perhaps there isnt strong consumer demand. But with analysis and value-extraction from big-data becoming easier, and machine learning becoming more accessible, i think there will be plenty of commercial and industrial demand which can drive up need.
>and in turn, demand for storage is slowed by lack of need for storage.
Are you talking about consumer storage? Because need for enterprise storage (which I sell), is only increasing. What is decreasing is the revenue/profit per TB (naturally).
Once content producers make more higher-quality content, and enough consumers start consuming it, then won't the content providers need the extra storage too to serve that content from?
Note in the early 1990s you could read hard drive content pretty easily with a scanning probe microscope, in fact the bits were lozenge shaped and if you looked close you might find the edges of bits that had been written before and weren't perfectly aligned. (i.e. scientists had tools in their labs that were better at reading a hard drive than the read head)
Back then it was possible you could smash the plates and somebody could reassemble some of the data.
Then by 2005 or so the density of the data was high enough that the scanning probe microscope wasn't much better than the read heads, and at that point extreme methods of data extraction got much much harder.
> To date I have found no example of any instance in which digital data recorded on a hard disk drive and subsequently overwritten was recovered from such a drive since 1985, when about 15% of the overwritten data was claimed to have been recovered from an modified frequency modulation (MFM) disk drive.
It cites "Overwriting Hard Drive Data: The Great Wiping Controversy" at http://www.vidarholen.net/~vidar/overwriting_hard_drive_data... which gives a best case example of a pristine hard drive, written once and then wiped once, and where you know the data is located before hand. Even then nearly all of the data had disappeared. If the drive was not pristine, it was not possible to recover the data. Quoting from it (emphasis mine):
> The purpose of this paper was a categorical settlement to the controversy surrounding the misconceptions involving the belief that data can be recovered following a wipe procedure. This study has demonstrated that correctly wiped data cannot reasonably be retrieved even if it is of a small size or found only over small parts of the hard drive. Not even with the use of a MFM or other known methods. The belief that a tool can be developed to retrieve gigabytes or terabytes of information from a wiped drive is in error.
> Although there is a good chance of recovery for any individual bit from a drive, the chances of recovery of any amount of data from a drive using an electron microscope are negligible. Even speculating on the possible recovery of an old drive, there is no likelihood that any data would be recoverable from the drive. The forensic recovery of data using electron microscopy is infeasible. This was true both on old drives and has become more difficult over time. Further, there is a need for the data to have been written and then wiped on a raw unused drive for there to be any hope of any level of recovery even at the bit level, which does not reflect real situations. It is unlikely that a recovered drive will have not been used for a period of time and the interaction of defragmentation, file copies and general use that overwrites data areas negates any chance of data recovery. The fallacy that data can be forensically recovered using an electron microscope or related means needs to be put to rest.
Thanks. Yes, I should have included the "that had been written before" in my quote.
I didn't look too into the question of how to recover the contents of a hard disk with microscopy because I figured it would be possible, but expensive. Looking now, I quickly found a MS thesis at http://escholarship.org/uc/item/26g4p84b which recovered data from a disk using MFM. While the performance was poor, the author attributes that to the experimental setup.
Massive hard drives only useful for archival purposes, under this argument: if your hard drive is being used for live queries, you want to access all the data on it. Even if you have have all the pieces in place to stream 1gig per sec off of the drive, a 1pb drive would still take 1 million seconds = ~11.5 days to read.
So in practice in production it's more useful to have smaller hard drives in more places to work on the data in parallel. And in the truly archival cases there are other concerns (like redundancy) that mean there isn't as much demand for a single massive drive.
> Massive hard drives only useful for archival purposes, under this argument: if your hard drive is being used for live queries, you want to access all the data on it. Even if you have have all the pieces in place to stream 1gig per sec off of the drive, a 1pb drive would still take 1 million seconds = ~11.5 days to read.
Your premise is that anything you don't use on a regular basis belongs in archival storage, presumably in some kind of central archive.
Suppose there is 1PB of static data of which you access a different 50GB every day. You can call it archival if you like but it's still going to save you 50GB/day of network traffic to have a local copy.
So for example, Netflix could make a box that came loaded with all their content and new content is added using IP multicast or P2P during off-peak hours. The peak hours bandwidth savings would be immense and you would be completely immune to crappy or unreliable network connections.
I've toyed with the idea that if you buy a 1PB hard disk it will come pre-loaded with, say, a copy of archive.org and search indices, or a large collection of movies. Also, if everyone had the same reference data set then (handwaving!) it could make for some fantastic compression methods.
I don't think the economics works out, since at 1G/s it probably takes too long to load the data, and as this essay points out most people will stream what they want on demand. I also doubt there will be a standard content set which is around long enough to assure that my imagined on-the-fly compression model-building-by-corpus-reference will take root.
> (handwaving!) it could make for some fantastic compression methods.
You could have indexes of hashes and store any chunks of anything in a big flat address space. You wouldn't even need to know what you have. Just a massive amount of archived chunks of storage. (OK, that is more than hand-waving, maybe arm-waving?).
The problem is that new content is being generated at an increasing rate.
There was an estimate in 2011 that total storage of everything everywhere by everyone was more than 250 Exabytes, increasing by around 25% annually.
There's going to be a lot of duplication in that, and a lot of it won't be public. So as a ballpark guess a complete collection of public-only sources - including all available commercial content of all possible kinds ever recorded, academic papers, Wikis, news sites, forums, and such - is going to need 25-50 Exabytes, with maybe 25% compound of new content every year.
So you could get the entire Internet delivered by truck or two, but you probably wouldn't have anywhere to put it.
I seem to recall reading some worries that current sizes are already pushing the limits of RAID setups. This in that it takes so much time to rebuild a drive given current interfaces that you risk one of the others in the array to fail during the process, thus making recovery impossible.
... which is why many big storage systems keep 3 replicas of each chunk of data stored across different servers. A failed device's chunks can be rebuilt in parallel across the entire cluster.
Or, if you care capacity-constrained, you can use cross-server erasure coding. And even call it cross-server RAID, if you like.
It's not just a worry. I've interacted with users at multiple sites who have actually lost data because the rebuild windows are so long that they had three effectively-simultaneous failures within a RAID-6 set.
This is an important point. The Google white paper does address it, though only in the abstract: They want to maximize both capacity and IO bandwidth. One can imagine large disks with multiple independently actuated head assemblies. But quite possibly you're right that it's not worth the both.
The interesting thing about Google's storage infrastructure was that teams were optimizing IOPs per drive and talking to thousands of them over a gigabit link. I had an interesting conversation with Sean about that one day, asking him if he got 100 IOPs per drive, and had 10,000 drives, and a gigabit ethernet port, how much data on the drive could be part of any service being provided over that gigabit link? Plot that over a RPS (generic 'requests per second').
In my case I was trying to get him to sign off on powering down some of the drives that could not be reached to save power. But even with the data staring him in the face he could not go there. Network bandwidth gets better, and that exposes more data to the pipeline, but if you want < 500mS request response you have to balance the system.
W3 Total Cache generally works fine. I use it with memcached and PHP's new built-in caching module (replacement for APC) and get my site blitzed from time to time due to major mentions in the media with few problems. SSD on Digital Ocean probably helps with disk caching too.
I use Apache but falcolas' point of nginx (often in combination with Varnish and potentially hhvm; I've used this combo before with great success) is worth considering as well.
No replacement for good configuration of your database (try MySQLtuner.pl if you use MySQL/MariaDB) of course.
Excellent post. Great information. I have a question about SSDs, though. Google has published their information about hard drive and SSD survival in their data centers. It can be viewed here: http://www.datacenterdynamics.com/servers-storage/googles-ss... So my question is how we mere mortals can deal with all the maintenance. It may well be that spinning hard drives are best for us.
I'm not sure spinning hard drives are more reliable outside of data centers - at least not for laptops because you are carrying them around (anecdotaly, I recently had en employee break a hard drive in a laptop after dropping it).
It probably doesn't really matter as much as your intuition might suggest anyway because each drive can and will fail. And they fail at similar rates (as opposed to an exponential difference of a factor of 10x or more). So you need to take similar precautions for each type of drive.
For personal use - I use exclusively SSDs because they are much faster. The I put all the information I don't want to lose in dropbox.
For servers, all data that is important goes in a database cluster (Cassandra) with a replication factor of 3. Those drives are backed up daily offsite. For data that cannot be lost at all (even a days worth), I also copy each record to Amazon S3 every time it is changed. - I'm sure there are many other ways to tackle this problem.
"The I put all the information I don't want to lose in dropbox." I don't think its healthy to consider cloud storage backup. For example I am pretty sure you could loose everything to Ransomware.
I know Dropbox sells Extended Version History [1] as an add-on, but it'll be awfully nice if Pro users had maybe 60 days of file history, versus 30 days for free. Just a thought.
You assume HDs will die. Keep 3 copies of all data, at least 1 far away, and diff regularly. If a HD lasts 2 years vs 10 in average, doesn't really change the best practice to keep your data.
Just to elaborate a bit, it's pretty easy to keep three copies of stuff you care about. First, you have the working copy, on the machine itself. A big backup disk is pretty straightforward. I use time machine, which isn't super reliable, but it's very very easy.
Now, when it comes down to it, do you really need to backup the OS? or your installed software? if the machine and the backups fail, you're going to be reinstalling anyway (probably) So, for the third copy i rely on 3rd parties. Different people have different needs, you might want to do something fancy in house.
It pretty much boils down to finding a service for your stuff. I have a couple of private github repos. Photos on iCloud and whatever Alphabet is calling Picasa these days. 20 gigs of music to Amazon or Alphabet (or both). Administrative stuff, like taxes, i just email to myself. It's probably smarter to keep that in dropbox or something along those lines.
The key point is, there are the things you make or capture that are irreplaceable, save those lots of places. There's a bunch of other crap on your computer to make it be useful. That stuff is trivial to reinstall. Well, ok, it might cost you a day or two to redownload and reconfigure emacs just so - but with a little planning you can put that config in git, so it's easy to restore or set up on a new machine.
It's almost better to think in terms of, if i had to upgrade tomorrow, what would i need to copy over? that's the stuff to be really fussy about.
> Now, when it comes down to it, do you really need to backup the OS? or your installed software? if the machine and the backups fail, you're going to be reinstalling anyway (probably)
I'd counter this with what I do with my laptop. The OS is considerably smaller than the data I actually care about (<20gb), it takes almost no time to backup and so it leaves me with a very quick ability to restore the system to a known good state in the event of some kind of failure. I don't do constant backups of it, but maybe once a month i'll update the backup I have of the OS.
Yeah, I was trying to point out you don't really need 3 copies of everything, and really 1 is enough for some stuff that's easy to replace. But the stuff that matters, you should have lots of copies of that. 2 backup disks is another way to go, just swap them, say, weekly. Photos from a once in a lifetime trip? make a bunch of copies, local and remote.
He is wrong about 6TB being the biggest hard drive in the market. Now that SSDs have become the standard they are taking up the curve. Samsung announced a 16TB SSD last year. I don't expect people to devote as much effort to making spinning disks bigger.
> As the pace of magnetic disk development slackens, an alternative storage medium is coming on strong. Flash memory, a semiconductor technology, has recently surpassed magnetic disk in areal density; Micron Technologies reports a laboratory demonstration of 2.7 terabits per square inch. And Samsung has announced a flash-based solid-state drive (SSD) with 15 terabytes of capacity, larger than any mechanical disk drive now on the market. SSDs are still much more expensive than mechanical disks—by a factor of 5 or 10—but they offer higher speed and lower power consumption. They also offer the virtue of total silence, which I find truly golden.
> Is this notion of merging memory and storage an attractive prospect or a nightmare? I’m not sure. There are some huge potential problems. For safety and sanity we generally want to limit which programs can alter which documents. Those rules are enforced by the file system, and they would have to be re-engineered to work in the memory-mapped environment.
This was done back in the 80s in http://www.cis.upenn.edu/~KeyKOS/ . A favorite demo reportedly was to pull the plug on a running computer then start up again. They took the need to redesign security as an opportunity to make it better.
First done as far as I can remember with Multics in the 1960s. The people who worked on IBM's canceled successor to the System/360 used it for System/38 in 1979, with capabilities, a feature which was dropped for the successor AS/400/iSeries/System i.
I'm not sure his idea about "merging memory and storage" really makes sense. He says that he wants load instructions to be able to hit the disk in order to avoid "calls to input/output routines in the operating system." But you can't avoid the input/output routines --- he's effectively saying that we should hardcode our filesystems into a single machine instruction and let the processor figure it out. If anything, we're moving farther from this model, since VMs give us virtual address spaces inside virtual address spaces.
I don't think this is just a security issue; it really breaks all of the assumptions that we like to make in modern programming languages.
saying that we should hardcode our filesystems into a single machine instruction and let the processor figure it out
I think he was rather saying that the OS could do it: persistent virtual memory as the primary abstraction. In Unix, files and processes are different kinds of things; in KeyKOS there were only processes; RAM was effectively a cache. As Unix directories have links to files, KeyKOS processes could be given capabilities to invoke other processes (passing capabilities and data as arguments). The different security model makes this analogy misleading, but you can see how you could emulate a filesystem.
Could the reality be that storage needs for the average user have plateau'd for a combination of reasons such as network/internet speed limitations, rise of cloud computing, rise of streamed movie subscriptions (netflix, etc) thus muting the financial benefit to develop new technology to drive exponential drive storage technology forward. I think once we have a reason to grow our storage again, we could see storage technology pick up again.
One way you could fill up a hard drive orders of magnitude than the ones we have now is with fully immersive recorded experiences, of which the recent 'holoportation' demo from Microsoft gave an early example. Look at how much space regular ultra-HD video consumes, then imagine further increasing the resolution and recording from not just one viewpoint, but the entire 3D scene.
This is similar to how a cellphone camera of today can shoot video in a single minute that would completely fill hard drives from the 90s. (And even a single high-res image from today's digital cameras is larger than the install size of Windows 3.1.) Back then it would have been difficult to imagine these uses for storage.
The reason why spinning disks still stick around is not just because of the price. Often, it is the case that you do not need 200-500MBit/s and ~100+ are enough. I have my OS on a (small) SSD and my spinning disks filled with my documents, music, movies. All of this does not need high speed, so it would be wasted money to buy an SSD for it. I'd rather buy twice the storage and do RAID1 (which I did).
It's a mystery to me that hybrid SSD+HD drives aren't more ubiquitous. I can guess what data is going to be read off the drive more frequently, but the computer can collect statistics and make a way more accurate prediction than I can.
I don't think statistics will make better predictions than I can. For example, consider a moderately large file (eg. 100MB), which is only ever read occasionally, by sequential low speed streaming. It seems reasonable to place it on the HD, but it's music, and I want it on the SSD so I can have the HD powered down for lower background noise when I'm listening to it. And I could have another almost identical file which is a podcast instead, and that should go on the HD because I don't care about minimum background noise when listening to podcasts, so even looking at the file type won't help. The correct place for a file depends on what you intend to do with it, not any measurable property.
I'd assert that it could make a better prediction for the majority of the files stored on your drive. For example, even if it was technically feasable to do with separate drives, do you know which files in the global assembly cache should go on SSD versus HD? Which pieces of your registry hive files? Which bits of your browser cache?
You say browser cache always goes on the SSD (I do that too, whenever there's no configuration to put it into RAM, which I prefer). Aren't you concerned about wear? If you stream high resolution video files from, say, youtube or twitch, you write gigabyte after gigabyte into that SSD.
I suppose the need of volume of consumer disks will depend on future networks qualities. With low latency and high speed users won't need big local disks, services like cloud storage/streaming will be enough. We will always find ways to use higher capacity disks, i.e. high-def VR content, or biochemistry data of our bodies for personal healthcare/fitness
I often think people are underestimating the demand for storage. Let pick an example, If Apple were to offer free 30GB iCloud Space to every iOS customers. 650 Million customers equals to roughly 20EB of Data, that is 40 Dropbox! Are you will need Multiple copies to safe guard it.
You could also ask what ended the kink. The timing looks like it matches up with the Thai floods, which killed price decreases for years afterwards. The experience curve means you need bulk to achieve improvements, and the floods shattered a key part of the supply chain.
the main speed boost back then was innovations landing to usable drives. We are currently in a innovation phase with a whole new space to explore, once the winners of that innovating gets picked, refining happens, cost savings are found, etc, I think we'll probably see some return to the expectations
that being said, for the slashdotting issue, I can't see how bad the graph is
The concept that a Petabyte is a lot of storage is very strange?
50 gig per hour for a cinema screen quality setup (Most houses in the next 20 years...) would be 20,000 hours of entertainment, meh I might want access to that in a life.
Also remembering we might be heading towards an environment where we record everything at all times.
Certainly currently I'm buying a hard disk every year as quality goes up and it's easier than throwing stuff out.
Why did he write '\(2\frac{1}{2}\) or \(3\frac{1}{2}\)' (which presumably uses JavaScript to be rendered), when he could have just written '2½ or 3½'? It'd have taken 8 bytes to store instead of 36 — and that's not even counting the size of the JavaScript!
As long as folks keep on reinventing the wheel, only bigger, hard drives are going to have to keep increasing in size.
If you already have the maths rendering script in place for more complicated uses, and you don't already know a quick way to insert the ½ character, then it's just the path of least resistance.
I was wondering what you're talking about but then realized that the US keyboard layout does not have the ½ symbol right there in the keyboard like we have in Europe.
I know. I'm from Finland. Look, I'm not arguing that everyone has the ½ in their keyboards. On the contrary, most people don't. It's just funny that some people might think ½ being hard to type and some people think it's easy and it's really dependent on their keyboard layouts. :)
Besides laziness, perhaps there's a slight semantic justification—I imagine there exists at least a few mathematical solvers that can ingest data of the TeX form and work with it, but probably fewer that understand how to ingest Unicode codepoints.
Here is a graph I got from Dave Anderson [director of strategic planning at Seagate] years ago. It shows that what looks like a smooth Kryder's Law curve is actually the superposition of a series of S-curves, one for each successive technology generation. Naturally, because the easy transitions get done first, the cost of each successive transition increases, perhaps even exponentially. Since margins are constrained and so, these days, are volumes, to generate a return on the investment in each transition requires that the technology be kept in the market longer. The longer interval between transitions translates to a lower Kryder rate.
http://4.bp.blogspot.com/-bkuDDrBpcZE/TpMsLTEspsI/AAAAAAAAA9...
0. https://en.wikipedia.org/wiki/Mark_Kryder
1. http://www.lockss.org/locksswp/wp-content/uploads/2012/09/un...
2. http://blog.dshr.org/2012/10/storage-will-be-lot-less-free-t...