Hacker News new | ask | show | jobs
by cpg 3762 days ago
I just love how Google datacenters are somehow "the real world". Nice, cool and controlled temperatures, batch ordering from vendors knowing they are shipping to Google, stable/repeatable environments, not much mention about the I/O load, etc.

And then there's

> Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number. ...

> Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher

meaningless, it is.

The real world has wild temperature ranges, wilder temperature _changes_, mechanical variations well above and beyond datacenter use, and possibly wil wild loads (e.g. viruses, antiviruses, all sorts of updates, etc. etc.).

It is easier to do stats this way, though.

10 comments

>I just love how Google datacenters are somehow "the real world"

They totally are though, especially to the HN crowd where a lot of us may be putting hardware in data centers.

I agree that this isn't going to give us a clear picture of what to expect out of an SSD in say, a netbook or something. On the other hand, data from a million SSD's reported by one company in a controlled environment is a hell of a control group if you want to go test factors like temperature, etc.

Google is quite famous for running their data centers hotter than others.

* Title makes it obvious, but 95f : http://www.geek.com/chips/googles-most-efficient-data-center...

* Increased in dns request failures(likely due to said heat) and bad routing cause internal iGoogle services to request this guy's stuff: https://www.youtube.com/watch?v=aT7mnSstKGs

To save others the bother: 95°F = 35°C. Warm!
For the sake of anecdote, during an Australian summer my drives can get to 50c, and will usually be around 30c to 40c for the rest of the year.

For fun, my GPU will hover around 50-70c, occasionally hitting 80c. The CPU around 50c, and the rest of the machine is a mystery!

Thanks to some utterly awful cooling my laptop CPU idles at about 70c, peaking in the 80s. I don't know if it's within spec for an i7 or just dumb luck but its still running just fine.

  > https://www.youtube.com/watch?v=aT7mnSstKGs
Interesting talk.

I did find that <takes a drink> later in his talk he kept <takes a drink> taking a drink every 10 seconds or so <takes a drink>, which ended up being more than a <takes a drink> little <takes a drink> irritating to watch and listen <takes a drink> to.

> I just love how Google datacenters are somehow "the real world". ... It is easier to do stats this way, though.

From the paper's abstract (you did read the abstract, right?) :

"... While there is a large body of work based on experiments with individual flash chips in a controlled lab environment under synthetic workloads, there is a dearth of information on their behavior in the field. This paper provides a large-scale field study covering many millions of drive days, ten different drive models, different flash technologies (MLC, eMLC, SLC) over 6 years of production use in Google’s data centers. We study a wide range of reliability characteristics and come to a number of unexpected conclusions. For example, raw bit error rates (RBER) grow at a much slower rate with wear-out than the exponential rate commonly assumed and, more importantly, they are not predictive of uncorrectable errors or other error modes. The widely used metric UBER (uncorrectable bit error rate) is not a meaningful metric, since we see no correlation between the number of reads and the number of uncorrectable errors. We see no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes. Comparing with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field, however, they have a higher rate of uncorrectable errors." [0][1]

I guess it's easier to draw incorrect, snarky conclusions based on inaccurate summaries of papers than it is to take a moment to read a paper's abstract to double-check the work of a tech journalist. shrug :(

[0] https://www.usenix.org/conference/fast16/technical-sessions/...

[1] http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f...

> > Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number. ...

> > Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher

> meaningless, it is.

The first sentence is referring to "specs". It's the manufacturer's claims that are useless.

The second point was referring to the actual measured error rates, which are obviously more meaningful.

How are datacenters not "the real world"? They're the largest users of data storage devices!
According to this http://www.kitguru.net/components/hard-drives/anton-shilov/s... 125 milllion HDD shipped in Q1 2015.

Anyone has any idea on how many in Desktop, laptop vs Data Center?

Largest single users of drives, sure. When they have tens of millions of them, I'd trust their reliability studies a lot more than Aunt Mike who has maybe two or three.
> cool and controlled temperatures

Controlled, but likely not cool:

http://www.geek.com/chips/googles-most-efficient-data-center...

Google's data centers are definitely not cool (in the temperature sense). That was one of the big reveals about the original disk paper since Google runs things hotter, the drives experienced a much hotter environment but their bit error rates were not hugely affected.
>The real world has wild temperature ranges, wilder temperature _changes_, mechanical variations well above and beyond datacenter use, and possibly wil wild loads (e.g. viruses, antiviruses, all sorts of updates, etc. etc.).

I'm not sure what you consider "the real world" (SSDs in ToughBooks for research expeditions in the Amazon?), but seeing that HN is a startup/pro IT social site, most of us are interested to be running them in data centers...

I see your points but the reality is there is no real data on how an SSD acts and fails outside of torture tests from the manufacturer or review sites.

Data centre data is still good data to help us better understand SSD lifetime and failures.

I think it's very informative, and over time we'll see if this is representative of the normal world. Maybe temperature changes are not that important, maybe they are.

At the same time, they have Chrome OS. Most of those laptops have an SSD. All are connected to the cloud. How do they perform? When they have a problem, is it recorded by Google? I'm not 100% clear if I would want that, but it doesn't seem such a bad idea for a computer that is already 100% in the cloud.

The Chrome OS number would be less useful. A hard drive failure that totally destroys the ability to report the result back to Google is indistinguishable from a device that simply never turned on again, which over the course of years, I'd expect to dominate major drive failures.
In terms of the things that are going to make SSDs — or pretty much any other piece of hardware you can imagine — fall over, Google's environment is realer than anything you could possibly conceive.