[Beowulf] failure trends in a large disk drive population

Post by Eugen Leitl
http://labs.google.com/papers/disk_failures.pdf

this is awesome! my new new-years resolution is to be more google-like,
especially in gathering potentially large amounts of data for this kind
of retrospective analysis.

thanks for posting the ref.

Robert G. Brown

2007-02-16 22:53:25 UTC

Post by Eugen Leitl
http://labs.google.com/papers/disk_failures.pdf

this is awesome! my new new-years resolution is to be more google-like,
especially in gathering potentially large amounts of data for this kind of
retrospective analysis.
thanks for posting the ref.

Yeah, I already reposted the link to our campus-wide sysadmin list.
There go all sort of assumptions, guesses and deductions to be replaced
by -- gasp -- data!

rgb

Post by Mark Hahn
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:***@phy.duke.edu

Justin Moore

2007-02-16 20:17:25 UTC

Post by Eugen Leitl
http://labs.google.com/papers/disk_failures.pdf

Despite my Duke e-mail address, I've been at Google since July. While
I'm not a co-author, I'm part of the group that did this study and can
answer (some) questions people may have about the paper.

-jdm

Department of Computer Science, Duke University, Durham, NC 27708-0129
Email: ***@cs.duke.edu
Web: http://www.cs.duke.edu/~justin/

David Mathog

2007-02-16 20:50:49 UTC

Post by Eugen Leitl
http://labs.google.com/papers/disk_failures.pdf

Interesting. However google apparently uses:

serial and parallel ATA consumer-grade hard disk drives,
ranging in speed from 5400 to 7200 rpm

Not quite clear what they meant by "consumer-grade", but I'm assuming
that it's the cheapest disk in that manufacturer's line. I don't
typically buy those kinds of disks, as they have only a 1 year
warranty but rather purchase those with 5 year warranties. Even
for workstations.

So I'm not too sure how useful their data is. I think everyone here
would have agreed without the study that a disk reallocating blocks and
throwing scan errors is on the way out. Quite surprising about the
lack of a temperature correlation though. At the very least I would
have expected increased temps to lead to faster loss of bearing
lubricant. That tends to manifest as a disk that spun for 3 years
not being able to restart after being off for a half an hour.
Presumably you've all seen that. If they have great power and systems
management at their data centers the systems may not have been
down long enough for this to be observed.

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Joe Landman

2007-02-16 21:40:59 UTC

Hi David

Post by Eugen Leitl
http://labs.google.com/papers/disk_failures.pdf

serial and parallel ATA consumer-grade hard disk drives,
ranging in speed from 5400 to 7200 rpm
Not quite clear what they meant by "consumer-grade", but I'm assuming
that it's the cheapest disk in that manufacturer's line. I don't
typically buy those kinds of disks, as they have only a 1 year
warranty but rather purchase those with 5 year warranties. Even
for workstations.

Seagates.

Post by David Mathog
So I'm not too sure how useful their data is. I think everyone here

Quite useful IMO. I know it would be PC, but I (and many others) would
like to see a clustering of the data, specifically to see if there are
any hyperplanes that separate the disks in terms of vendors, models,
interfaces, etc. CERN had a study up about this which I had read and
linked to, but now it seems to be gone, and I did not download a copy
for myself.

Post by David Mathog
would have agreed without the study that a disk reallocating blocks and
throwing scan errors is on the way out. Quite surprising about the

"Tic tic tic whirrrrrrr" scares the heck out of me now :(

Post by David Mathog
lack of a temperature correlation though. At the very least I would
have expected increased temps to lead to faster loss of bearing
lubricant. That tends to manifest as a disk that spun for 3 years
not being able to restart after being off for a half an hour.
Presumably you've all seen that. If they have great power and systems
management at their data centers the systems may not have been
down long enough for this to be observed.

With enough disks, their sampling should be reasonably good, albeit
biased towards their preferred vendor(s) and model(s). Would like to
see that data. CERN compared SCSI, IDE, SATA, and FC. They found (as I
remember, quoting from a document I no longer can find online) that
there really weren't any significant reliability differences between them.

I would like to see this sort of analysis here, and see if the real data
(not the estimated MTBFs) shows a signal. I am guessing that we could
build a pragmatic and time dependent MTBF based upon the time rate of
change of the AFR. I think the Google paper was basically saying that
they wanted to do something like this using the SMART data, but found
that it was insufficient by itself to render a meaningful predictable
model. That is, in and of itself, quite interesting. If you could read
back reasonable sets of parameters from a machine and estimate the
likelihood of it going south, this would be quite nice (or annoying) for
admins everywhere.

Also good in terms of tightening down real support costs and the value
of warranties, default and extended.

Post by David Mathog
Regards,
David Mathog
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: ***@scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615

Joe Landman

2007-02-16 22:01:21 UTC

Post by Joe Landman
Quite useful IMO. I know it would be PC, but I (and many others) would

s/PC/non-PC/

my fault

Jim Lux

2007-02-16 22:15:49 UTC

Post by Eugen Leitl
http://labs.google.com/papers/disk_failures.pdf

But this is potentially a very interesting trade-off, and one right
in line with the Beowulf concept of leveraging cheap consumer gear...

Say you need 100 widgets worth of horsepower. Are you better off
buying 103 pro widgets at $500 and a 3% failure rate or 110 consumer
widgets at $450 and a 10% failure rate.... $51.5K vs $49.5K... the
cheap drives win.. And, in fact, if the drives fail randomly during
the year (not a valid assumption in general, but easy to calculate on
the back of an envelope), then you actually get more compute power
with the cheap drives (105 average vs 101.5 average over the year)

This also assumes that the failure rate is "small" and
"independent" (that is, you don't wind up with a bad batch that all
fail simultaneously from some systemic flaw.. the bane of a
reliability calculation)

One failing I see of many cluster applications is that they are quite
brittle.. that is, they depend on a particular number of processors
toiling on the task, and the complement of processors not changing
during the "run". But this sort of thing makes a 100 node cluster no
different than depending on the one 100xspeed supercomputer.

I think it's pretty obvious that Google has figured out how to
partition their workload in a "can use any number of processors" sort
of way, in which case, they probably should be buying the cheap
drives and just letting them fail (and stay failed.. it's probably
cheaper to replace the whole node than to try and service one)...

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

Douglas Eadline

2007-02-18 20:49:20 UTC

snip

Post by Jim Lux
One failing I see of many cluster applications is that they are quite
brittle.. that is, they depend on a particular number of processors
toiling on the task, and the complement of processors not changing
during the "run". But this sort of thing makes a 100 node cluster no
different than depending on the one 100xspeed supercomputer.

I had written a few columns about the "static" nature
of clusters (and how I would like to program). Thought you
might find it interesting:

http://www.clustermonkey.net//content/view/158/32/

It turned into a three part series.

--
Doug

Chris Samuel

2007-02-18 22:29:54 UTC

Post by Jim Lux
I think it's pretty obvious that Google has figured out how to
partition their workload in a "can use any number of processors" sort
of way, in which case, they probably should be buying the cheap
drives and just letting them fail (and stay failed.. it's probably
cheaper to replace the whole node than to try and service one)...

IIRC they also have figured out a way to be fault tolerant by sending queries
out to multiple systems for each part of the DB they are querying, so if one
of those fails others will respond anyway.

Apparently they use more reliable hardware for things like the advertising
service..

--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

matt jones

2007-02-18 21:49:47 UTC

i've read in the past somewhere that the Google File System is capable
of having many copies of the data. often having 4 copies on different
nodes. and as you say run the query to many of them. if one fails there
are still 3, if another there are still 2. i've also read somewhere else
that if one fails, it can automatically recreate the image from the
remaining ones on a spare node. bringing it back to 4. this approach is
rather ott, but it works and works well.

i suspect this sort of thing could be done cheaper by just using 3 per
copy and hoping that you never lose 2 or more nodes at once.

essentially this is a huge distributed files system with integrated RAID
software.

Post by Chris Samuel
IIRC they also have figured out a way to be fault tolerant by sending

queries out to multiple systems for each part of the DB they are
querying, so if one of those fails others will respond anyway.

Post by Chris Samuel
Apparently they use more reliable hardware for things like the

advertising service

--
matt.

momentics

2007-02-19 09:00:26 UTC

Post by matt jones
if one fails there
are still 3, if another there are still 2. i've also read somewhere else
that if one fails, it can automatically recreate the image from the
remaining ones on a spare node.

[...]

Post by matt jones
this approach is rather ott, but it works and works well.

not sure of Google gents; but we're using reliability model to
calculate number of nodes and their physical locations (continuous
scheduling) - to meet the expected reliability coefficient specified
by the system operator/deployer/configurator (for EE, SW and HW
parts).

HDD is unreliable system part, with the nearly known reliability
(expected -actually), moreover, as we know, most of HDDs have SMART
metrics - the good way to correct live coefficients within used math
model. The outcome here is to use adaptive techs.
So Googles are using the same way probably - a good company anyhow... ta-da! :)

***@Grid – http://sgrid.sourceforge.net/

//
(the perfect doc - the amazing work)

Richard Walsh

2007-02-20 15:39:36 UTC

Post by David Mathog
throwing scan errors is on the way out. Quite surprising about the
lack of a temperature correlation though. At the very least I would
have expected increased temps to lead to faster loss of bearing
lubricant. That tends to manifest as a disk that spun for 3 years

Not sure the vapor pressure of the perfluoroethers that they use as
lubricants
varies that much over the operating temperature regime of a disk
drive. If one
can assume it is insignificant, then age alone would be the major
contributing
factor here (that is "hot" drives would not lubricant-age any faster
that merely
wam drives).

rbw

Post by David Mathog
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

--
Richard B. Walsh

"The world is given to me only once, not one existing and one
perceived. The subject and object are but one."

Erwin Schroedinger

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
***@ahpcrc.org | 612.337.3467

-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted. If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------

Mark Hahn

2007-02-20 16:38:30 UTC

Post by Richard Walsh
Not sure the vapor pressure of the perfluoroethers that they use as
lubricants
varies that much over the operating temperature regime of a disk drive.

on the other hand, do these lubricants tend to get sticky or something
at lowish temperatures? the google results showed significantly greater
failures in young drives at low temperatures. (as well as extreme-temp
drives when at end-of-life.)

Richard Walsh

2007-02-21 15:06:17 UTC

Post by Richard Walsh
Not sure the vapor pressure of the perfluoroethers that they use as
lubricants
varies that much over the operating temperature regime of a disk drive.

Hey Mark,

Good question. The properties of perfluoropolyethers (Krytox, Fomblin,
Demnum)
must be well-known, but the Googling I did yielded only a single
reference on
point which I did not want to pay for. The ambient temperature range in
the study
is pretty small which would limit viscosity variation, but when the head
arrives
at a long unread location on a cooler disk maybe there are some shear
effects.

Any disk drive vendors read this list and care to comment?

rbw
--
Richard B. Walsh

"The world is given to me only once, not one existing and one
perceived. The subject and object are but one."

Erwin Schroedinger

Project Manager
Network Computing Services, Inc.
Army High Performance Computing Research Center (AHPCRC)
***@ahpcrc.org | 612.337.3467

-----------------------------------------------------------------------
This message (including any attachments) may contain proprietary or
privileged information, the use and disclosure of which is legally
restricted. If you have received this message in error please notify
the sender by reply message, do not otherwise distribute it, and delete
this message, with all of its contents, from your files.
-----------------------------------------------------------------------

David Mathog

2007-02-16 22:05:40 UTC

Subject: Re: [Beowulf] failure trends in a large disk drive population
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

Post by Eugen Leitl
http://labs.google.com/papers/disk_failures.pdf

Despite my Duke e-mail address, I've been at Google since July. While
I'm not a co-author, I'm part of the group that did this study and can
answer (some) questions people may have about the paper.

Dangling meat in front of the bears, eh? Well...

Is there any info for failure rates versus type of main bearing
in the drive?

Failure rate versus any other implementation technology?

Failure rate vs. drive speed (RPM)?

Or to put it another way, is there anything to indicate which
component designs most often result in the eventual SMART
events (reallocation, scan errors) and then, ultimately, drive
failure?

Failure rates versus rack position? I'd guess no effect here,
since that would mostly affect temperature, and there was
little temperature effect.

Failure rates by data center? (Are some of your data centers
harder on drives than others? If so, why?) Are there air
pressure and humidity measurements from your data centers?
Really low air pressure (as at observatory height)
is a known killer of disks, it would be interesting if lesser
changes in air pressure also had a measurable effect. Low
humidity cranks up static problems, high humidity can result
in condensation. Again, what happens with values in between?
Are these effects quantifiable?

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Mark Hahn

2007-02-16 23:01:43 UTC

Post by David Mathog
Is there any info for failure rates versus type of main bearing
in the drive?

I thought everyone used something like the "thrust plate" bearing
that seagate (maybe?) introduced ~10 years ago.

Post by David Mathog
Failure rate vs. drive speed (RPM)?

surely "consumer-grade" rules out 10 or 15k rpm disks;
their collection of 5400 and 7200 disks is probably skewed,
as well (since 5400's have been uncommon for a couple years.)

Post by David Mathog
Or to put it another way, is there anything to indicate which
component designs most often result in the eventual SMART
events (reallocation, scan errors) and then, ultimately, drive
failure?

reading the article, I did wish their analysis more resembled
one done by clinical or behavioral types, who would have evaluated
outcome attributed to all the factors combinatorially.

Post by David Mathog
Failure rates versus rack position? I'd guess no effect here,
since that would mostly affect temperature, and there was
little temperature effect.

funny, when I saw figure5, I thought the temperature effect was
pretty dramatic. in fact, all the metrics paint a pretty clear
picture of infant mortality, then reasonably fit drives suriving
their expected operational life (3 years). in senescence, all
forms of stress correlate with increased failure. I have to
believe that the 4/5th year decreases in AFR are either due to
survival effects or sampling bias.

Post by David Mathog
changes in air pressure also had a measurable effect. Low
humidity cranks up static problems, high humidity can result

does anyone have recent-decade data on the conventional wisdom
about too-low humidity? I'm dubious that it matters in a normal
machineroom where components tend to stay put.

regards, mark hahn.

Joel Jaeggli

2007-02-17 02:19:19 UTC

Post by David Mathog
Failure rate vs. drive speed (RPM)?

surely "consumer-grade" rules out 10 or 15k rpm disks;
their collection of 5400 and 7200 disks is probably skewed,
as well (since 5400's have been uncommon for a couple years.)

Ictually I'd bet that's most of the 5400rpm disks would be maxtor
maxline II nearline drives, netapp also used then in several filers.
They were the first 300GB drive by a couple of months and came with a 5
year warranty... I have several dozen of them, and for the most part
there still working though the warranties are all expiring at this point.

Post by Mark Hahn
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Eugen Leitl

2007-02-18 17:01:04 UTC

Post by Joel Jaeggli
Ictually I'd bet that's most of the 5400rpm disks would be maxtor
maxline II nearline drives, netapp also used then in several filers.
They were the first 300GB drive by a couple of months and came with a 5
year warranty... I have several dozen of them, and for the most part
there still working though the warranties are all expiring at this point.

I have two of these sitting here to be installed tomorrow for the
couple that failed within a few months of each other, and had to be RMAed.
They run pretty hot for 5400 rpm drives, maybe too many platters.
The falure was predicted by an increasing SMART failure rate, until
smartd sent error reports via email, indicating impending failure.
The drives were in a 2x mini-ITX HA configuration in a Travla C147 case,
which was poorly ventilated -- now the systems are to be recycled as a
CARP cluster with the pfSense firewall, an embedded version
which boots from CF flash -- that effectively solved the thermal
problems.

I wish Google's data did include WD Raptors and Caviar RE2 drives.
I would really like to know whether these are worth the price premium
over consumer SATA. Btw -- smartd doesn't seem to be able to handle
SATA, at least, last time I tried.

http://smartmontools.sourceforge.net/#testinghelp

How do you folks gather data on them?

Oh, and those of you who run GAMMA MPI on GBit Broadcoms, any
lockups? SunFire X2100 seems to be supported (it has a Broadcom
and an nForce NIC, the X2100 M2 seems to have two Broadcoms and two
nVidia NICs) by GAMMA, so I'd like to try it, but rather not risk
a lockup.

Mark Hahn

2007-02-18 18:45:50 UTC

Post by Eugen Leitl
over consumer SATA. Btw -- smartd doesn't seem to be able to handle
SATA, at least, last time I tried.
http://smartmontools.sourceforge.net/#testinghelp
How do you folks gather data on them?

I use smartctl - the smart support in libata
entered the mainstream 2.6.15 kernel (2006-01-03!)

Eugen Leitl

2007-02-18 21:09:16 UTC

Post by Mark Hahn
I use smartctl - the smart support in libata
entered the mainstream 2.6.15 kernel (2006-01-03!)

I've got
nitrogen:~# uname -a
Linux nitrogen 2.6.15-amd64-smp-vs #1 SMP Tue Apr 25 09:54:14 CEST 2006 x86_64 GNU/Linux

but

nitrogen:~# smartctl -a /dev/sda
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: ATA HDT722525DLA380 Version: V44O
Serial number: VDK41BT4D4TKTK
Device type: disk
Local Time is: Sun Feb 18 22:07:39 2007 CET
Device does not support SMART

Device does not support Error Counter logging

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging

and

nitrogen:~# smartctl -d sata -a /dev/sda
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=======> INVALID ARGUMENT TO -d: sata
=======> VALID ARGUMENTS ARE: ata, scsi, 3ware,N <=======

Use smartctl -h to get a usage summary

This is debian amd64, so probably the packages are way out of date.

Mark Hahn

2007-02-18 22:09:09 UTC

Post by Eugen Leitl
Linux nitrogen 2.6.15-amd64-smp-vs #1 SMP Tue Apr 25 09:54:14 CEST 2006 x86_64 GNU/Linux
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen

the machine I checked has 5.33, and 5.36 is the date on the sources I
grabbed in early dec. the machine in question is running HP XC 3.0,
based on RHEL4's 2.6.9 - obviously with some backports.

Post by Eugen Leitl
Device: ATA HDT722525DLA380 Version: V44O
Serial number: VDK41BT4D4TKTK
Device type: disk
Local Time is: Sun Feb 18 22:07:39 2007 CET
Device does not support SMART

well, afaikt, that's actually a pretty recent sata disk, and certainly
does support SMART. might you need smartctl -e? some bioses offer an
option to en/disable smart, but afaikt -e fixed that.

Post by Eugen Leitl
nitrogen:~# smartctl -d sata -a /dev/sda

I needed -d ata on my system. (libata, which has always been intended
to support both parallel and serial - and is now the preferred driver
for some pata disks...)

Bruce Allen

2007-02-20 06:15:59 UTC

Sorry guys, I have been distracted the last few days -- if this is a
smartmontools question please repeat it -- I can probably answer easily.

Cheers,
Bruce

Post by Eugen Leitl
Linux nitrogen 2.6.15-amd64-smp-vs #1 SMP Tue Apr 25 09:54:14 CEST 2006 x86_64 GNU/Linux
smartctl version 5.32 Copyright (C) 2002-4 Bruce Allen

the machine I checked has 5.33, and 5.36 is the date on the sources I grabbed
in early dec. the machine in question is running HP XC 3.0, based on RHEL4's
2.6.9 - obviously with some backports.

Post by Eugen Leitl
Device: ATA HDT722525DLA380 Version: V44O
Serial number: VDK41BT4D4TKTK
Device type: disk
Local Time is: Sun Feb 18 22:07:39 2007 CET
Device does not support SMART

well, afaikt, that's actually a pretty recent sata disk, and certainly
does support SMART. might you need smartctl -e? some bioses offer an
option to en/disable smart, but afaikt -e fixed that.

Post by Eugen Leitl
nitrogen:~# smartctl -d sata -a /dev/sda

I needed -d ata on my system. (libata, which has always been intended
to support both parallel and serial - and is now the preferred driver for
some pata disks...)
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Vincent Diepeveen

2007-02-19 17:17:43 UTC

Aren't those maxtors eating nearly 2x more power than drives from other
manufacturers?

Vincent

----- Original Message -----
From: "Joel Jaeggli" <***@bogus.com>
To: "Mark Hahn" <***@mcmaster.ca>
Cc: <***@beowulf.org>; "David Mathog" <***@caltech.edu>
Sent: Saturday, February 17, 2007 3:19 AM
Subject: Re: [Beowulf] Re: failure trends in a large disk drive population

Post by Joel Jaeggli

Post by David Mathog
Failure rate vs. drive speed (RPM)?

surely "consumer-grade" rules out 10 or 15k rpm disks;
their collection of 5400 and 7200 disks is probably skewed,
as well (since 5400's have been uncommon for a couple years.)

Post by Mark Hahn
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Mark Hahn

2007-02-19 18:13:47 UTC

Post by Vincent Diepeveen
Aren't those maxtors eating nearly 2x more power than drives from other
manufacturers?

I doubt it. disks all dissipate around 10W (+-50%) when active.
that stretches a bit lower for laptop disks, and a bit higher for
FC, high-rpm and/or high-platter-count disks.

unfortunately, seagate has gutted maxtor's website, so it seems
almost impossible to get a simple spec sheet for maxline drives.
I did find a japanese sheet that simply lists 9.97W (probably idle)
which is entirely typical for such a disk.

but I think the point is that putting any disk in a poorly
ventilated enclosure is asking for trouble. it's not really clear
what google's paper implies about this, since they basically
say that new, too-cold disks are at risk, and old too-hot ones.

Robert G. Brown

2007-02-16 23:13:45 UTC

Subject: Re: [Beowulf] failure trends in a large disk drive population
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

Post by Eugen Leitl
http://labs.google.com/papers/disk_failures.pdf

Despite my Duke e-mail address, I've been at Google since July. While
I'm not a co-author, I'm part of the group that did this study and can
answer (some) questions people may have about the paper.

Dangling meat in front of the bears, eh? Well...

Hey Justin. Are you going to stay in NC and move to the new facility as
they build it?

Let me add one general question to David's.

How did they look for predictive models on the SMART data? It sounds
like they did a fairly linear data decomposition, looking for first
order correlations. Did they try to e.g. build a neural network on it,
or use fully multivariate methods (ordinary stats can handle it up to
5-10 variables).

This is really an extension of David's questions below. It would be
very interesting to add variables to the problem (if possible) until the
observed correlations resolve (in sufficiently high dimensionality) into
something significantly predictive. That would be VERY useful.

rgb

Post by David Mathog
Is there any info for failure rates versus type of main bearing
in the drive?
Failure rate versus any other implementation technology?
Failure rate vs. drive speed (RPM)?
Or to put it another way, is there anything to indicate which
component designs most often result in the eventual SMART
events (reallocation, scan errors) and then, ultimately, drive
failure?
Failure rates versus rack position? I'd guess no effect here,
since that would mostly affect temperature, and there was
little temperature effect.
Failure rates by data center? (Are some of your data centers
harder on drives than others? If so, why?) Are there air
pressure and humidity measurements from your data centers?
Really low air pressure (as at observatory height)
is a known killer of disks, it would be interesting if lesser
changes in air pressure also had a measurable effect. Low
humidity cranks up static problems, high humidity can result
in condensation. Again, what happens with values in between?
Are these effects quantifiable?
Regards,
David Mathog
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:***@phy.duke.edu

Eric Thibodeau

2007-02-21 18:55:49 UTC

[snip]

Post by Robert G. Brown

Post by David Mathog
Dangling meat in front of the bears, eh? Well...

Hey Justin. Are you going to stay in NC and move to the new facility as
they build it?
Let me add one general question to David's.
How did they look for predictive models on the SMART data? It sounds
like they did a fairly linear data decomposition, looking for first
order correlations. Did they try to e.g. build a neural network on it,
or use fully multivariate methods (ordinary stats can handle it up to
5-10 variables).
This is really an extension of David's questions below. It would be
very interesting to add variables to the problem (if possible) until the
observed correlations resolve (in sufficiently high dimensionality) into
something significantly predictive. That would be VERY useful.
rgb

RGB, good idea, apply clustering/GA/MOGA analisys techniques to all of this data. Now the question is, will we ever get access to this data? ;)

poke--> Justin

Justin Moore

2007-02-21 23:50:41 UTC

Post by Eric Thibodeau

Post by Robert G. Brown
How did they look for predictive models on the SMART data? It sounds
like they did a fairly linear data decomposition, looking for first
order correlations. Did they try to e.g. build a neural network on it,
or use fully multivariate methods (ordinary stats can handle it up to
5-10 variables).
This is really an extension of David's questions below. It would be
very interesting to add variables to the problem (if possible) until the
observed correlations resolve (in sufficiently high dimensionality) into
something significantly predictive. That would be VERY useful.

RGB, good idea, apply clustering/GA/MOGA analisys techniques to all of
this data. Now the question is, will we ever get access to this data?
;)

As mentioned in an earlier e-mail (I think) there were 4 SMART variables
whose values were strongly correlated with failure, and another 4-6 that
were weakly correlated with failure. However, of all the disks that
failed, less than half (around 45%) had ANY of the "strong" signals and
another 25% had some of the "weak" signals. This means that over a
third of disks that failed gave no appreciable warning. Therefore even
combining the variables would give no better than a 70% chance of
predicting failure.

To make things worse, many of the "weak" signals were found on a
significant number of disks. For example, among the disks that failed,
many had a large number of seek error; however, over 70% of disks in the
fleet -- failed and working -- had a large number of seek errors.

About all I can say beyond what's in the paper is that we're aware of
the shortcomings of the existing work and possible paths forward. In
response, we are
<GOOGLE_NDA_BOT>
Hello, this is the Google NDA bot. In our massive trawling of the
Internet and other data sources, I have detected a possible violation of
the Google NDA. This has been corrected. We now return you to your
regularly scheduled e-mail.
[ Continue ] [ I'm Feeling Confidential ]
</GOOGLE_NDA_BOT>

So that's our master plan. Just don't tell anyone. :)
-jdm

P.S. Unfortunately, I doubt that we'll be willing or able to release the
raw data behind the disk drive study.

Department of Computer Science, Duke University, Durham, NC 27708-0129
Email: ***@cs.duke.edu
Web: http://www.cs.duke.edu/~justin/

Eric Thibodeau

2007-02-22 00:51:03 UTC

Justin,

Yes, I came across your previous post further down the intertwined thread. One other thing that could have been interesting to see then would be to have monitored _all_ of the system's "health" monitors such as voltage, powersupply fan speed. There may be some other correlations to be made from fluctuating/dying powersupplies... a shot in the dark but all is linked ;)

As for the [censored] GOOGLE_NDA_BOT.... LOL! :) Thanks, that felt good.

Eric

Post by Justin Moore

Post by Eric Thibodeau

RGB, good idea, apply clustering/GA/MOGA analisys techniques to all of
this data. Now the question is, will we ever get access to this data?
;)

As mentioned in an earlier e-mail (I think) there were 4 SMART variables
whose values were strongly correlated with failure, and another 4-6 that
were weakly correlated with failure. However, of all the disks that
failed, less than half (around 45%) had ANY of the "strong" signals and
another 25% had some of the "weak" signals. This means that over a
third of disks that failed gave no appreciable warning. Therefore even
combining the variables would give no better than a 70% chance of
predicting failure.
To make things worse, many of the "weak" signals were found on a
significant number of disks. For example, among the disks that failed,
many had a large number of seek error; however, over 70% of disks in the
fleet -- failed and working -- had a large number of seek errors.
About all I can say beyond what's in the paper is that we're aware of
the shortcomings of the existing work and possible paths forward. In
response, we are
<GOOGLE_NDA_BOT>
Hello, this is the Google NDA bot. In our massive trawling of the
Internet and other data sources, I have detected a possible violation of
the Google NDA. This has been corrected. We now return you to your
regularly scheduled e-mail.
[ Continue ] [ I'm Feeling Confidential ]
</GOOGLE_NDA_BOT>
So that's our master plan. Just don't tell anyone. :)
-jdm
P.S. Unfortunately, I doubt that we'll be willing or able to release the
raw data behind the disk drive study.
Department of Computer Science, Duke University, Durham, NC 27708-0129
Web: http://www.cs.duke.edu/~justin/

Mark Hahn

2007-02-22 02:44:26 UTC

weakly correlated with failure. However, of all the disks that failed, less
than half (around 45%) had ANY of the "strong" signals and another 25% had
some of the "weak" signals. This means that over a third of disks that
failed gave no appreciable warning. Therefore even combining the variables
would give no better than a 70% chance of predicting failure.

well, a factorial analysis might still show useful interactions.

number of disks. For example, among the disks that failed, many had a large
number of seek error; however, over 70% of disks in the fleet -- failed and
working -- had a large number of seek errors.

was there any trend across time in the seek errors?

So that's our master plan. Just don't tell anyone. :)

hah. well, if it were me, the M.P. would involve some sort of proactive
treatment: say, a full-disk read once a day. smart self-tests _ought_
to be more valuable than that, but otoh, the vendor probably munge the
measurements pretty badly.

regards, mark hahn.

Robin Harker

2007-02-22 08:10:49 UTC

So if we now know, (and we have seen similarly spirious behaviour with
SATA Raid arrays), isn't the real solution to lose the node discs?

Regards

Robin

Post by Justin Moore

Post by Eric Thibodeau

RGB, good idea, apply clustering/GA/MOGA analisys techniques to all of
this data. Now the question is, will we ever get access to this data?
;)

As mentioned in an earlier e-mail (I think) there were 4 SMART variables
whose values were strongly correlated with failure, and another 4-6 that
were weakly correlated with failure. However, of all the disks that
failed, less than half (around 45%) had ANY of the "strong" signals and
another 25% had some of the "weak" signals. This means that over a
third of disks that failed gave no appreciable warning. Therefore even
combining the variables would give no better than a 70% chance of
predicting failure.
To make things worse, many of the "weak" signals were found on a
significant number of disks. For example, among the disks that failed,
many had a large number of seek error; however, over 70% of disks in the
fleet -- failed and working -- had a large number of seek errors.
About all I can say beyond what's in the paper is that we're aware of
the shortcomings of the existing work and possible paths forward. In
response, we are
<GOOGLE_NDA_BOT>
Hello, this is the Google NDA bot. In our massive trawling of the
Internet and other data sources, I have detected a possible violation of
the Google NDA. This has been corrected. We now return you to your
regularly scheduled e-mail.
[ Continue ] [ I'm Feeling Confidential ]
</GOOGLE_NDA_BOT>
So that's our master plan. Just don't tell anyone. :)
-jdm
P.S. Unfortunately, I doubt that we'll be willing or able to release the
raw data behind the disk drive study.
Department of Computer Science, Duke University, Durham, NC 27708-0129
Web: http://www.cs.duke.edu/~justin/
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf

Robin Harker
Workstations UK Ltd
DDI: 01494 787710
Tel: 01494 724498

Chris Samuel

2007-02-23 05:31:44 UTC

Post by Robin Harker
So if we now know, (and we have seen similarly spirious behaviour with
SATA Raid arrays), isn't the real solution to lose the node discs?

Depends on the code you're running, if it hammers local scratch then either
you have to have them or you have to invest in the infrastructure to be able
to provide that through a distributed HPC filesystem instead..

cheers!
Chris

--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

Justin Moore

2007-02-17 00:28:39 UTC

Post by Justin Moore
Despite my Duke e-mail address, I've been at Google since July. While
I'm not a co-author, I'm part of the group that did this study and can
answer (some) questions people may have about the paper.

Dangling meat in front of the bears, eh? Well...

I can always hide behind my duck-blind-slash-moat-o'-NDA. :)

Post by David Mathog
Is there any info for failure rates versus type of main bearing
in the drive?
Failure rate versus any other implementation technology?

We haven't done this analysis, but you might be interested in this paper
from CMU:

http://www.usenix.org/events/fast07/tech/schroeder.html

They performed a similar study on drive reliability -- with the help of
some people/groups here, I believe -- and found no significant
differences in reliability between different disk technologies (SATA,
SCSI, IDE, FC, etc).

Post by David Mathog
Failure rate vs. drive speed (RPM)?

Again, we may have the data but it hasn't been processed.

One of the problems noted in the paper is that even if you assume that
*any* SMART event is indicative in some way of an upcoming failure --
and are willing to deal with a metric boatload of false positives --
over one-third of failed drives had zero counts on all SMART parameters.
And one of these parameters -- seek errors -- were observed on nearly
three-quarters of the drives in our fleet, so you really would be
dealing with boatloads of false positives.

Post by David Mathog
Failure rates versus rack position? I'd guess no effect here,
since that would mostly affect temperature, and there was
little temperature effect.

I imagine it wouldn't matter. Even if it did, I'm not sure we have this
data in an easy-to-parse-and-include format.

Post by David Mathog
Failure rates by data center? (Are some of your data centers
harder on drives than others? If so, why?)

The CMU study is broken down by data center. There is certainly the
case in their study that some data centers appear to be harder on drives
than others, but there may be age and vintage issues coming into play in
their study (an issue they acknowledge in the paper). My intuition --
again, not having analyzed the data -- is that application
characteristics and not data center characteristics are going to have a
more pronounced effect. There is a section on how utilization effects
AFR over time.

Post by David Mathog
Are there air pressure and humidity measurements from your data
centers? Really low air pressure (as at observatory height) is a known
killer of disks, it would be interesting if lesser changes in air
pressure also had a measurable effect. Low humidity cranks up static
problems, high humidity can result in condensation.

Once we start getting data from our Tibetan Monastery/West Asia data
center I'll let you know. :)

-jdm

Department of Computer Science, Duke University, Durham, NC 27708-0129
Email: ***@cs.duke.edu
Web: http://www.cs.duke.edu/~justin/

David Mathog

2007-02-22 16:22:34 UTC

Post by Justin Moore
As mentioned in an earlier e-mail (I think) there were 4 SMART variables
whose values were strongly correlated with failure, and another 4-6 that
were weakly correlated with failure. However, of all the disks that
failed, less than half (around 45%) had ANY of the "strong" signals and
another 25% had some of the "weak" signals. This means that over a
third of disks that failed gave no appreciable warning. Therefore even
combining the variables would give no better than a 70% chance of
predicting failure.

Now we need to know exactly how you defined "failed". Presumably
AFTER you have determined that a disk has failed various SMART
parameters have very high values. As you say, before there
are SMART indicators but no clear trend. What separates one set
of SMART values (indicator) from the other (failed)?

Is it possible that more frequent monitoring of SMART variables
could catch the early failure (chest pains, so to speak) before
the total failure (fatal heart failure)? This might give a few
more seconds or minutes warning before disk failure, possibly
enough time for a node to indicate it is about to fail and shutdown,
especially if it can do so without writing much to the disk.
Admittedly, this would not be nearly as useful as knowing that
a disk will fail in a week!

Disks that just stop spinning or won't spin back up (motor/spindle
failure) are another problem that presumably cannot be detected by
SMART. However this mode of failure is usually only seen in DOA disks
and old, old disks. What fraction of the failed disks were this
type of failure?

Were there postmortem analyses of the power supplies in the failed
systems? It wouldn't surprise me if low or noisy power lines led
to an increased rate of disk failure. SMART wouldn't give this
information (at least, not on any of the disks I have), but
lm_sensors would.

Thanks,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Jim Lux

2007-02-22 18:40:58 UTC

Now we need to know exactly how you defined "failed".

The paper defined failed as "requiring the computer to be pulled"
whether or not the disk was actually dead.
Were there postmortem analyses of the power supplies in the failed

Post by David Mathog
systems? It wouldn't surprise me if low or noisy power lines led
to an increased rate of disk failure. SMART wouldn't give this
information (at least, not on any of the disks I have), but
lm_sensors would.

I would make the case that it's not worth it to even glance at the
outside of the case of a dead unit, much less do failure analysis on
the power supply. FA is expensive, new computers are not. Pitch the
dead (or "not quite dead yet, but suspect") computer, slap in a new
one and go on.

There is some non-zero value in understanding the failure mechanics,
but probably only if the failure rate is high enough to make a
difference. That is, if you had a 50% failure rate, it would be
worth understanding. If you have a 3% failure rate, it might be
better to just replace and move on.

There is also some value in predicting failures, IF there's an
economic benefit from knowing early. Maybe you can replace computers
in batches less expensively than waiting for them to fail or maybe
your in a situation where a failure is expensive (highly tuned
brittle software with no checkpoints that has to run 1000 processors
in lockstep for days on end). I can see Google being in the former
case but probably not in the latter. Predictive statistics might
also be useful if there is some "common factor" that kills many disks
at once (Gosh, when Bob is the duty SA after midnight and it's the
full moon, the airfilters clog with some strange fur and the drives
overheat, but only in machine rooms with a window to the outside..)

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

David Mathog

2007-02-22 20:30:21 UTC

Post by Jim Lux

Post by David Mathog
Now we need to know exactly how you defined "failed".

The paper defined failed as "requiring the computer to be pulled"
whether or not the disk was actually dead.

That was sort of my point, if you're looking for indicators that
lead to "failed disk" there should be a precise definition of
what "failed disk" is. How am I to know what criteria Google uses
for classifying a machine as nonfunctioning? If the system is
pulled because the CPU blew up it's one thing, but if they pulled it
for any disk related reason, we need to know how bad "bad" was.

Post by Jim Lux
I would make the case that it's not worth it to even glance at the
outside of the case of a dead unit, much less do failure analysis on
the power supply. FA is expensive, new computers are not. Pitch the
dead (or "not quite dead yet, but suspect") computer, slap in a new
one and go on.

Well, they cared enough to do the study!

I think the heart of the problem is that disk failures are a bit like
airplane crashes: everything looks great until something snaps and then
the plane goes down shortly thereafter. Similarly, there's just
not that much time between the cause of the failure manifesting
itself and the final disk failure. Once the disk heads start
bouncing off the disk, or some piece of dirt or metal shaving
gets between the disks and the heads, its all over pretty quickly.
Until that point there may be a few weak indications that something
is wrong, but they may or may not have a relation to the final
failure event. For instance, a tiny bit of junk stuck to the
surface may cause a few blocks to remap and never do anything else.
It might or might not mean that a huge chunk of the same stuff is
about to wreak havoc. (It's absence is clearly preferred though, since
any remapped blocks can result in data loss.)

Regards,

David Mathog
***@caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

Jim Lux

2007-02-23 01:12:10 UTC

Post by Jim Lux

Post by David Mathog
Now we need to know exactly how you defined "failed".

The paper defined failed as "requiring the computer to be pulled"
whether or not the disk was actually dead.

True.. there's a paragraph or so of how they determined "failed"
(e.g. they didn't include drives removed from service because of
scheduled replacement).

Well, they cared enough to do the study!

Or, more realistically, that the small dollars spent on the study to
identify a possible connection was tiny enough that it's probably
down in the overall budgetary noise floor.

Post by David Mathog
I think the heart of the problem is that disk failures are a bit like
airplane crashes: everything looks great until something snaps and then
the plane goes down shortly thereafter.

I think one of the values of the study was that it actually did
demonstrate just that.. you really can't do a very good job
predicting failures in advance, so you'd better have a system in
place to deal with the inevitable failures while they're in service.

And, of course, they have some "real numbers" on failure rates, which
is useful in and of itself, regardless of whether the failures could
be predicted.

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

Joe Landman

2007-02-23 01:52:51 UTC

Post by Jim Lux