Highly Reliable Systems: Removable Disk Backup & Recovery


SMART hard drive failure status – “OK (Prefail)” may simply mean “old”

By Darren McBride

Modern SATA hard drives have internal counters for soft errors, number of hours run, power cycles etc.  These counters were originally intended to help predict hard drive failure.  The SMART (Self-Monitoring, Analysis and Reporting Technology) data  can be accessed through our NetSWAP interface and may be available using various open source software for direct attached drives.  Sometimes SMART status or logs cause concern simply because a drive has been power cycled or run a certain number of hours.  While it’s true that a drive with lots of hours has a higher percentage chance of failure than a newer drive, it is not a reason that a drive can be replaced under warranty.  Specifically,  I wanted to document a message seen by some of our customers: OK? Prefail  (See nearby screen shot)

One of my favorite articles on this is from Robin Harris.  Robin discusses a Google study that provides insight into drive failures.  I highly recommend the article, which you can find here.  Robin points out SMART isn’t very smart.  “as Google found, and many in the industry already knew. SMART (Self-Monitoring, Analysis, and Reporting Technology) captures drive error data to predict failure far enough in advance so you can back up. Yet SMART focuses on mechanical failures, while a good deal of a disk drive is electronic, so SMART misses many sudden drive failure modes, like power component failure. The Google team found that 36% of the failed drives did not exhibit a single SMART-monitored failure. They concluded that SMART data is almost useless for predicting the failure of a single drive.”

So sometimes SMART misses predicting drive failures and sometimes SMART gives an alarming message like “OK (Prefail)” when the drive is simply old and has been “on” for a certain number of hours.   To me, it’s a bit like having a big red warning light in a car that says “100,000 miles Prefail”.  Which would try to tell the user something like “This car has 100K miles so may be more prone to failure – consider buying a new one”.   It would be better if the red light were telling us to perform some sort of routine maintenance.  Unfortunately, there is no maintenance to perform, and as far as I’m aware, this SMART data can’t be reset by an end user (in my car analogy the red warning light can’t be turned off).

Using the NetSwap “Smart Info” button we can drill into the counters of the drive – shown in the screenshot nearby.   The counters are somewhat cryptic (To learn more I recommend this article), but  what we can easily see is that for this 750GB drive, the reason for the “warning” is primarily that the hard drive is considered old and has had certain number of soft errors.  Yet the drive continues to work and does not cause problems. Such a drive can’t be replaced under warranty, extended warranty, or dynamic support contract because the SMART counters are simply saying the drive is old, not that there are specific prefailure indications.  When in doubt, We recommend you run a diagnostic on the hard drive that verifies that data can be written and read back with no errors.  If the drive fails the diagnostic, it’s time to replace it to retain reliable backups of your data.

Darren McBride

About Darren McBride

CEO, Highly Reliable Systems, Inc. View all posts by Darren McBride →


Comments are closed.