Highly Reliable Systems: Removable Disk Backup & Recovery

Why RAID-5 Stops Working in 2009 – Not Necessarily

By Darren McBride
Share: facebooktwittergoogle_plusredditpinterestlinkedinmail

Could you write and then read an entire 3TB drive five times without an error?

Suppose you were to run a burn in test on a brand new Seagate 3TB SATA drive, writing 3TB and then reading it back to confirm the data.   Our standards are such that if a drive fails during 5 cycles we won’t ship it.  Luckily, all 20 of 20 drives we tested last night passed.  In fact, most of the 3TB drives we test every week passed this test.  Why is that a big deal?  Because there is a calculation floating around out there that shows when reading a full 3TB drive there is a 21.3% chance of getting an unrecoverable read error.  Clearly the commonly used probability equation isn’t modeling reality.  To me this raises red flags on previous work discussing the viability of both stand alone SATA drives and large RAID arrays.

It’s been five years since Robin Harris pointed out that the sheer size of RAID-5 volumes, combined with the manufacturer’s Bit Error Rate (how often an unrecoverable read error occurs reading a drive) made it more and more likely that you would encounter an error while trying to rebuild a large (12TB) RAID-5 array after a drive failure.  Robin followed up his excellent article with another “Why RAID-6 stops working in 2019” based on work by Leventhal.  Since RAID-5 is still around it seems Mark Twain’s quote “The reports of my death are greatly exaggerated” is appropriate.

Why hasn’t it happened?  Certainly RAID-6 has become more popular in server storage systems.  But RAID-5 is still used extensively, and on 12TB and larger  volumes that Robin predicts don’t recover well from drive failures.  Before I get into some mind numbing math let me give away what I think might be an answer:  Because the Bit Error Rate (BER) for some large SATA drives are clearly better than what the manufacturer says.  The spec is expressed as a worst case scenario and in the real world experience is different.

Seagate’s BER on 3TB drives is stated as 10^14, but may be understated.  Hitachi’s bit error rate on their 4TB SATA drives are 10^15 and in my experience the two drives perform similarly from a reliability perspective.  That order of magnitude makes a big difference on the calculations of expected read errors.  Let me set the stage by going back over the probability equation used by Robin Harris and Adam Leventhal.

The probability equation they use for a successful read of all bits on a drive is
“b” = the Bit Error Rate (BER) also known as Unrecoverable Read Error(URE) rate
“a” = Number of Bits read (the amount of data on an entire volume or drive)

We can use sectors, bytes or bits for this calculation as long as we stay consistent.  In this article Leventhol uses sectors, which I think just complicates the calculation but let’s confirm his numbers.  He calculates that a 100GB disk array has 200 million 512byte sectors.  So a=2.0×10^8.  He uses b= 24 Billion (2.4×10^10) because he says the bit error rate is 10^14 which you divide by 512bytes per sector and 8bits per bite.  He determines the chance of array failure during a rebuild is only 0.8 percent.  From his article:  (1 — 1/(2.4 x 1010)) ^ (2.0 x 108) = 99.2%. This means that on average, 0.8 percent of disk failures would result in data loss due to an uncorrectable bit error.”

Now we can confirm whether we get the same results by typing the values into our formula on http://web2.0calc.com/.  I wanted to flip this equation from probability of success to calculate the probability of a failure directly and then express it as a percentage so I just subtract the success rate from 1 and multiply times 100:


Cut and paste the string below into the website calculator and hit the equal key:
100*(1-(1-1/(2.4E10))^(2E8))       = 0.83% (I’ve rounded off)

This is the same number the author got, indicating less than 1% chance of failure while reading an entire 100GB array or volume.

Personally I think it is easier to do all this by leaving all the numbers in bits. This is because the hard drive vendors express “b” or “BER” in bits rather than sectors.  For example, this Seagate data sheet shows for Barracuda SATA drives the number for “Nonrecoverable Read Errors per Bits read, Max” is 1014    This is the same number the author used above.  We express this in scientific notation for the variable “b” as 1E14 (1 times 10 to the 14th).  This is about 12.5TeraBytes.  The value for “a” on a 100GB volume can be written as 100 followed by 9 zeros. We must multiply by 8 bits per byte to get the number of bits.

Probability of a read error while reading all of a 100GB volume using SATA drives
100*(1-(1-1/(1E14))^(100E9*8))      = 0.80% (rounded off)

So we’re getting about the same answer using bits instead of  sectors (subject to some rounding errors) and I think using bits is a little less confusing don’t you?

Robin Harris did the calculation on a 12TB array and got a whopping 62% chance of data loss during a RAID rebuild.  Can we confirm his math using bits instead of sectors and the same formula?  As before copy and paste this into web2.0calc.com:

100*(1-(1-1/(1E14))^(12000E9*8))  =       61.68%   Yep.  At least our math is tracking. Now that we’re getting the same results as the experts we’re ready to try our own calculations using real world hard drives.  Let’s not even talk about RAID.  Let’s just take a stand alone Seagate 3TB drive and see what is the probability we’ll get a single non-recoverable read error if we fill and read the whole drive.

Probability of a read error while reading all of a Seagate 3TB SATA drive
100*(1-(1-1/(1E14))^(3000E9*8))      = 21.3% (rounded off)
So after all this math I get the number I started this article with – a 21.3% chance of a single read failure.  But wait a minute!!! Does that sound right to you?  Doesn’t that mean that if I fill a 3TB drive and read all the bits about 5 times that I will likely encounter an error that the drive can’t recover from?  If that were true it would mean 3TB drives would be unusable!  Just for grins lets try a Hitachi 4TB drive with it’s slightly better BER of 10^15:

Probability of a failure while reading all of a 4TB Hitachi SATA drive

100*(1-(1-1/(1E15))^(4000E9*8))      =3.14% (Honest-I didn’t try to get Pi)

Well that’s better!  But if my 4TB drive is going to fail 3% of the time when I read the whole thing I’m still pretty concerned.

I previously pointed out that our burn-in test alone disproves the calculated failure rates.  We use these drives (both 3TB and 4TB) a lot in a mirroring backup system where drives are swapped nightly and re-mirrored.  Since the mirroring is done at the block level the entire drive is ALWAYS read to create a new copy with its mirror partner.  Which means, we should see some of these read errors on a regular basis.  In fact, our media often goes for years without a problem.

Similarly, our failure rates rebuilding large 8TB RAIDPacs are nowhere near what this probability formula suggests (6%).  So should we believe real world results or the math?  Maybe a hard drive expert can suggest why the formula isn’t properly modeling the real world.  I’m not the only one to notice that the probability formula doesn’t map to real world results: http://www.raidtips.com/raid5-ure.aspx

There is no question that both probability of read error during rebuilds, along with some of the other concerns about large arrays taking a while to rebuild due to limitations of the interface and drive speed should play into future planning and product design.  Right now our RAID-5 three-drive RAIDPacs are rebuilding at about 250-300 Gigabytes per hour, so an 8TB RAIDPac can take about 26 hours to rebuild.  Luckily, we use RAID for backup media rather than primary storage so speed of rebuild isn’t as large an issue for our clients.  Drive failures are relatively infrequent (about 3 drive failures out of 100 per year) and multiple RAIDpac backup appliances are used to duplicate data.  Even when we raise the size of our RAIDPac to 12TB using three 6TB drives the issues will be manageable, though rebuild time will go to 1.67 days.  We’ll continue to watch this issue and continue to make backup reliable, but for now it’s safe to say RAID-5 is alive and well.

2015 UPDATE: Using RAID-5 Means the Sky is Falling!

7 Responses to Why RAID-5 Stops Working in 2009 – Not Necessarily

  1. John Dorian says:

    I can’t see any problem with the math, but always assumed manufacturers would make their products look as good as possible. So hard to believe Seagate would list a 10^14 BER if in reality it was closer to 10^15.

  2. Johannes says:

    But can you loose only one bit? To fix errors the disk controller uses crc data on sector level. Most errors can be corrected but If it can’t fix the read error we get an URE (unrecoverable read error). Then the whole 512k sector is lost. and we loose 512k bits.

    To sustain an BER of 10^14 we need to read 10^14 sectors until next error. That’s a lot of data.

  3. Tim says:

    I think everyone is forgetting that these large drives are not using 512b sectors; they use the Advance Format sector size of 4k.
    I’m not sure how this affects the math though and would love to see a new evaluation.

  4. Cody says:

    The math also changes considerably if this is an all-flash system. My eMLC drives (1.6 TB) have a Unrecovered Bit Error Rate of 10^16 (SanDisk) and 10^17 (HGST). These flash drives would also rebuild so fast there’s less of a window for an error to occur (or for another drive to fail due to the stress of taking days or weeks to rebuild). An all flash array can rebuild a drive in under an hour…

    Dell doesn’t even recommend using double redundancy for an all-flash array for the reasons above. They recommend double redundancy only for mechanical drives that are above 900 GB.

  5. Zenaan says:

    TLDR: if your filesystem does not detect flipped bits, you are backing up flipped bits and pretending that flipped bits don’t exist; this is not good!

    Long version: The tests in this blog article are insufficient:

    What filesystem was used?

    What errors does your hardware return, and in what way? E.g., in what situations does a URE return a “whole sector not readable”, and in what situations does a flipped bit get through to the filesystem layer?

    The author seems to imply that he believes UREs will always produce “hardware errors” – this is not true, and is a dangerous assumption!

    In other words in the latter case of a flipped bit getting through to the filesystem, how do you know you are detecting the flipped bit at all, and are you simply backing up error bits without realizing they are errors?

    That’s why you need ZFS (or BTRFS if you trust it yet) to checksum the data itself. If you can’t rely on the hardware (I believe we cannot rely on the disks, and that there is no hardware raid which guarantees flipped bits don’t arrive at the filesystem) then you must have software (the filesystem) to have any hope of detecting most of the flipped bits that occur! It might be rare, but you won’t know if you’re not detecting them, and “high reliability” does not exist otherwise…

    If you test disks with a burn in bench test, by say writing fixed patterns to the disk (which ultimately must test every single bit on the disk, and therefore requires a minimum of two full writes and reads to the disk, and possibly more to properly detect meta data sectors/ sector portions), then you can simply read back each bit on the disk (in each of it’s 0 and 1 states depending on which test run you are doing) and confirm that it is as it was supposed to be written.

    This MAY give you an idea of the actual URE rate for that particular drive or batch of drives.

    BUT this can NOT give you protection against flipped bits which go undetected by the drive and drive controller hardware.

    I am disappointed that Darren appears to either not have a proper grasp of these (to me, simple) concepts, or glosses over them in a way which appears dangerous and misleading. Although I do prefer empirical data to “manufacturer estimated” URE data. But in the real world, knowing even the empirical URE is seriously inadequate. And from this blog, without further information, we cannot even be sure of any empircal URE data regarding the drives supposedly tested.


  6. Cody says:

    Zenaan, I agree with you completely. That’s actually one reason why when I setup new disks in a system I don’t choose MBR. I go with GPT since it stores CRC values. GPT has built in redundancy.

    The future is even brighter when you combine GPT and slap a good file system on it.

    That was actually changed in VMFS. That’s why VMFS supports disks above 2 TB now. They simply changed the underlying partition style from MBR to GPT.

  7. David Lusty says:

    Reading back bits which were written today doesn’t produce as many errors as reading data written 5 years ago. To truly model this you’d need to write data today to lots of hard drives, wait 5 years and then read it back bit for bit identically to what was written.
    @Cody the rebuild time on SSD is irrelevant here and is exactly the issue. If you’ve already had a drive fail, you MUST read all data without error to recover the RAID set. If you have one error you’ve lost data. With RAID 6 you’d have a second parity, usually diagonal parity, which you can then use to recover following a lost disk and a read error.

What do you think?

Your email address will not be published. Required fields are marked *