Hi David
Post by David MathogPost by Eugen Leitlhttp://labs.google.com/papers/disk_failures.pdf
serial and parallel ATA consumer-grade hard disk drives,
ranging in speed from 5400 to 7200 rpm
Not quite clear what they meant by "consumer-grade", but I'm assuming
that it's the cheapest disk in that manufacturer's line. I don't
typically buy those kinds of disks, as they have only a 1 year
warranty but rather purchase those with 5 year warranties. Even
for workstations.
Seagates.
Post by David MathogSo I'm not too sure how useful their data is. I think everyone here
Quite useful IMO. I know it would be PC, but I (and many others) would
like to see a clustering of the data, specifically to see if there are
any hyperplanes that separate the disks in terms of vendors, models,
interfaces, etc. CERN had a study up about this which I had read and
linked to, but now it seems to be gone, and I did not download a copy
for myself.
Post by David Mathogwould have agreed without the study that a disk reallocating blocks and
throwing scan errors is on the way out. Quite surprising about the
"Tic tic tic whirrrrrrr" scares the heck out of me now :(
Post by David Mathoglack of a temperature correlation though. At the very least I would
have expected increased temps to lead to faster loss of bearing
lubricant. That tends to manifest as a disk that spun for 3 years
not being able to restart after being off for a half an hour.
Presumably you've all seen that. If they have great power and systems
management at their data centers the systems may not have been
down long enough for this to be observed.
With enough disks, their sampling should be reasonably good, albeit
biased towards their preferred vendor(s) and model(s). Would like to
see that data. CERN compared SCSI, IDE, SATA, and FC. They found (as I
remember, quoting from a document I no longer can find online) that
there really weren't any significant reliability differences between them.
I would like to see this sort of analysis here, and see if the real data
(not the estimated MTBFs) shows a signal. I am guessing that we could
build a pragmatic and time dependent MTBF based upon the time rate of
change of the AFR. I think the Google paper was basically saying that
they wanted to do something like this using the SMART data, but found
that it was insufficient by itself to render a meaningful predictable
model. That is, in and of itself, quite interesting. If you could read
back reasonable sets of parameters from a machine and estimate the
likelihood of it going south, this would be quite nice (or annoying) for
admins everywhere.
Also good in terms of tightening down real support costs and the value
of warranties, default and extended.
Post by David MathogRegards,
David Mathog
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: ***@scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615