QNAP Data Recovery Case Study: RAID-6 Failure
A RAID (redundant array of independent disks) is a way to string individual hard drives together so that they behave like a single giant hard drive. These drives are usually linked together in a network-attached storage device or storage area network. The drives comprising a RAID can be linked together in many different ways. Each level has its own strengths and weaknesses.
People tend to rely on RAID arrays to both expand their data storage capacity and provide a cushion of support in case of drive failure. Some RAID levels, such as RAID-3 and RAID-4, have one drive act as a container for all of the special “parity” data which the RAID controller can use to reconstruct any missing data. Other levels, such as RAID-5 and RAID-6, distribute this parity information across all hard drives in the array.
Different RAID levels have different tolerances for drive failure—for example, RAID-0 will fail if one drive fails, RAID-1 and RAID-5 will fail if two or more drives fail, and RAID-6 will fail if three or more drives fail. As soon as a single drive in a RAID-5 or 6 array fails, it is prudent to replace it before any subsequent failure puts the safety of the data in jeopardy. The RAID controller will use its parity data to integrate the new drive into the array and continue to function normally. This is called “rebuilding” the RAID.
Picture source
But a RAID array is also at its most vulnerable when it is being rebuilt. When a RAID array is being rebuilt, its performance decreases dramatically. This can put a big strain on an organization’s daily operations. Furthermore, the data is in danger if more hard drives fail during the rebuilding process.
In this QNAP data recovery scenario, the client went to replace one failed drive as soon as it went offline and began to rebuild the RAID. At this point, no data had actually been lost. The RAID-6 array could actually handle losing another drive without failing. But during the rebuilding process, disaster struck. First one drive failed. Then another drive failed, and then another. These were more failures than RAID-6 could bear, and the entire array went down.
What is QNAP?
QNAP Systems is a technology company based in Taiwan that offers enterprise grade network attached storage (NAS) appliances. QNAP products have applications in storage management, surveillance, file sharing, and virtualization. QNAP Systems offers a proprietary operating system called QTS. QTS consolidates functionality of a file system and volume manager, to improve the user experience for QNAP storage appliances.
RAID-6: A Brief Overview
At Gillware, many of the RAID arrays we receive for data recovery are RAID level 5 arrays. RAID-5 arrays can handle losing a single drive by using “exclusive OR” logic. Using XOR logic, if you have a set of values and one goes missing, the missing value can be reconstructed using the rest of the set.
If a single drive in a RAID-5 array fails, the controller uses the XOR function to take the parity data on the other drives and recreate the data on the missing drive. The RAID controller has to keep looking back at the parity data to access what had been on the failed drive. The array will keep chugging along, but its performance will be decreased. XOR on its own, however, only provides a single layer of insulation against data loss. This is why a RAID-5 array fails if more than one hard drive in it fails.
RAID-6 arrays are very similar to the RAID-5 arrays our RAID data recovery engineers are used to seeing. But RAID-6 arrays can lose two drives and still function. RAID-6 does this with an extra layer of parity data.
This extra layer of redundancy is referred to as “double parity”. In order to provide double parity, RAID-6 relies on XOR coding for one layer of parity, like RAID-5. The second layer of parity is provided by Reed-Solomon encoding.
What is Reed-Solomon Encoding?
Reed-Solomon encoding is a core concept for ensuring accuracy in complex data storage solutions. For a mathematically complete definition, you can visit the Wikipedia page for Reed-Solomon Error Correcting Code. If high level mathematics is not your forte, all you need to know is that Reed-Solomon encoding is typically used as an error-correction mechanism in data storage and data transmission technologies.
RS encoding is one of the reasons a scanner can read a bar code or QR code even if part of the code is damaged. Because of this second layer of parity data, even if a second hard drive fails, the RAID controller can still pick up the slack and reconstruct the missing pieces from the remaining drives.
The QNAP Data Recovery Process
Drives that have been connected in a RAID-6 array are useless on their own. When data is written to a RAID, the RAID controller stripes it across the disks. Each block tend to be 128 sectors, or 64 kilobytes in size. Everything written to the disks is chopped up into these fragments and pieced back together by the RAID controller. Unlike with single hard drives, there are no “used areas” to target. We must get everything we can possibly get off of the client’s drives, and then pass the drive images along to our RAID data recovery engineers to piece the data back together.
Of the failed drives, one had suffered a failure of its electrical components, one had suffered a failure of its firmware, and one had suffered a failure of both its read/write heads and its firmware. Fortunately, our cleanroom data recovery engineers were able to image 99.9% of the binary sectors on the failed hard drives. Our RAID engineer Cody handled the QNAP data recovery process from this point on.
Recovery Case Overview:
In this QNAP data recovery case, our client had eight three-terabyte hard drives arranged in a RAID-6 configuration using a NAS device and formatted with the Linux Ext3 filesystem. Several of the hard drives in this RAID array had failed, rendering all of the client’s data inaccessible.
QNAP TS-859U-RP Data Recovery Case Study: RAID-6 Failure During Rebuild
Total Capacity: 18 TB
RAID Level: 6
Device Brand: QNAP
Device Model: TS-859U-RP+
File System: Ext3 (Linux)
Situation: Multiple hard drive failures during rebuild of RAID-6 array
Type of Data Recovered: Documents, Images, Video
Case Rating: 9
In this case, the file recovery procedure for this failed RAID-6 array went fairly smoothly. Cody was able to piece the failed array back together and recover around 17 terabytes’ worth of the client’s critical data. Since a RAID-6 array with eight three-terabyte drives has about 18 terabytes of space, this particular array was full to bursting! Not all of the hard drives in the array had been 100% imaged, so as a result there was a small amount of data loss. The vast majority of the recovered data, however, appeared to be fully functional. We rated this QNAP data recovery case a 9 on our ten-point scale.
Our CEO Brian Gill has written an article laying out some best practices for setting up your RAID-5 array to reduce the chances of data loss, and much of his advice holds true for RAID-6 arrays as well. Ultimately, while a great many people think of their RAID arrays as sufficient backup, the only way to make absolutely certain that your data is protected and get your data back quickly and reliably following a RAID disaster is to keep and maintain an offsite backup of your critical data.
How to Avoid RAID-6 Failure?
RAID-0 is the only non-fault-tolerant RAID storage solution. Every other RAID configuration has some level of fault tolerance. Hard drives can fail and the array still functions. However if your RAID system is using the same model of hard drive, produced on the same day (potentially within minutes of each other), those hard drives are far more likely to have the same life-span. When you account for the possibility of two or more drives failing at essentially the same time, it may not matter that you are operating a fault-tolerant RAID configuration. It is somewhat inconvenient to acquire enterprise grade hard drives of different models, but Gillware strongly recommends using different hard drives in your RAID system.