Skip to content. | Skip to navigation

Personal tools
You are here: Home Knowledge Summary of "Failure Trends in a Large Disk Drive Population"

Summary of "Failure Trends in a Large Disk Drive Population"

Summary of the Google published paper of the above title. Authors are: Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso of Google Inc. Paper was published in February, 2007.

The original paper can be found at: http://labs.google.com/papers/disk_failures.pdf


Summary

- [Disk] failures do not increase when the average temperature
increases. In fact, there is a clear trend showing that *lower*
temperatures are associated with higher failure rates.... What stands
out are the 3 and 4- year old drives, where the trend for higher
failures with higher temperature is much more constant and also more
pronounced.

- After the first scan error, drives are 39 times more likely to fail
within 60 days than drives without scan errors.

- After their first reallocation, drives are over 14 times more likely
to fail within 60 days than drives without reallocation counts, making
the critical threshold for this parameter also one.

- After the first *offline* reallocation, drives have over 21 times
higher chances of failure within 60 days than drives without offline
reallocations; an effect that is again more drastic than total
reallocations.

- The critical threshold for probational counts is also one: after the
first event, drives are 16 times more likely to fail within 60 days
than drives with zero probational counts.

The Bad News:

- Out of all failed drives, over 56% of them have no count in any of the
four strong SMART signals, namely scan errors, reallocation count,
offline reallocation, and probational count. In other words, models
based only on those signals can never predict more than half of the
failed drives.

- Even when we add all remaining SMART parameters (except temperature)
we still find that over 36% of all failed drives had zero counts on
all variables. [Note that] this population includes seek error rates,
which we have observed to be widespread in our population (> 72% of
our drives have it) which further reduces the sample size of drives
without any errors.

- In our study, we did not find much correlation between failure rate
and either elevated temperature or utilization. It is the most
surprising result of our study.

- Our annualized failure rates were generally higher than those reported
by vendors, and more consistent with other user experience studies.

I wish I could see their data correlated by vendor/model!
Document Actions
« August 2017 »
August
SuMoTuWeThFrSa
12345
6789101112
13141516171819
20212223242526
2728293031