RAID-5 Setup Tips: How to Avoid the Data Recovery Lab

By Brian Gill, CEO Gillware Inc. Data Recovery

Having run one of the world’s most successful data recovery labs for almost a decade, I’ve seen thousands of RAID-5 data loss situations that probably could have been avoided by following these simple guidelines.  This is not intended to be a comprehensive explanation of what RAID-5 is, but rather practical IT tips for setting one up.  (For a more detailed breakdown of RAID-5 and how it works, see Gillware’s explanation.) Briefly, RAID-5s use a number of hard drives working together, and they sacrifice some of their total potential capacity for a little redundancy.  For example, a RAID-5 setup of 3x3TB drives would have a total capacity of 6TB, instead of 9TB. A RAID-5 can lose any single drive in the array, run in a degraded state, and still read and write data.  If you notice this is happening, you can rebuild and restore redundancy. If two drives in the array die, the data will be non-accessible and likely gone forever without serious expertise. These are tips to prevent that scenario from happening to you.

 

Use a Variety of Drive Manufacturers

Use drives from various manufacturers.  This is getting harder as the hard drive manufacturers continue to consolidate.  If you can’t get different manufacturers, at the very least pick drives with significantly different manufacturing dates; I’d recommend at least a month variance.  Drives in a RAID live almost identical lives as far as number of shutdowns, startups, runtime, data read and written, environment, etc.  If they are the same model and were manufactured the same day they may have very similar life-spans or similar manufacturing defects, or similar reactions to power surges, sudden power losses, and other environmental events.  RAID-5 gives you the ability to have one drive worth of redundancy, so we definitely don’t want them dying the same day or same week.  Using drives that are similar capacity and speed but different make and model will help avoid some dual drive death situations. (If you are running Seagate drives and can’t figure out when your drives were made, this may help.

 

Write Down the RAID Configuration Information When You Set It Up

Most RAID cards can be setup in a variety of ways.  You’d be surprised how many calls we get from IT folks that send us a box of healthy drives simply because the RAID card exploded.  All  the configuration lived exclusively on the RAID card and they have absolutely no memory of the setup.  Are you running RAID-5, RAID-6, RAID-1?  What’s the stripe size on your RAID:  64KB, 128KB, 1 sector?   What’s the rotation?  Are there multiple volume groups or just one? If there’s more than one, which drives are in which group?  Offsets?  Which drive is the hot-spare?   What firmware version is your RAID card or software RAID running?

 

Make Sure the RAID Card Stores the RAID Configuration on the Card and on the Drives.

If this is the case and your RAID card dies, there’s a decent chance that simply ordering another one with the same firmware and plugging the drives back in will allow the array to remount.  This is because each drive has some meta-information stored somewhere (usually the first few sectors at the front or back of the drive) that explain its place in the universe.  The order in the array, the stripe size, the data offset, what physical group it’s in, etc. actually lives on the drives, allowing the new card to re-detect the array settings.

 

6 Drives Max

I’d recommend a maximum of 6 drives in a RAID-5.  I’ve seen setups where folks have used significantly more than 10 but this is to be avoided.  Simple math says the more drives you run, the higher the probability of a double-failure which is what we’re obviously always trying to avoid.   If you’re building a RAID for huge capacity needs, I’d highly recommend running RAID-6 and probably having at least one hot-spare.

 

Setup Notifications

While I’ve never seen an official study, I’d say that probably more than half of the small businesses out there running RAID-5 have not properly setup the RAID controller notifications.   When a drive is taken offline by the RAID controller you absolutely must have it email you or text you so you can promptly replace the failed drive and perform the necessary rebuild to restore redundancy.  I’d say probably more than 90% of small businesses and consumers running NAS (Network Attached Storage) RAID-5 units haven’t setup any notification.   When a drive fails and goes offline, the storage array will continue to function (the whole point of RAID-5) and will “emulate” data read from and written to the dead drive using parity calculations on all the other drives.  You might get lucky and notice a 20-30% slow-down in data access times, and think “Gee, my NAS is running a little slow, I wonder if I lost a drive?” But honestly, most users would never notice this.  Someone might wander by the unit and notice a little crimson LED on a drive instead of a green one but chances are they won’t know what it means or say anything.

So, if you’re running one of those NAS units in your small business, go grab the manual, connect to it via the little “website” it hosts, and configure the notifications.  If you’re running a small traditional server in your office or home check the RAID BIOS settings next time you boot and peek at the configurations tab.   Test the notifications (it should have a simple button to test it) to make sure you get that page or email. I’d recommend emailing an email group and not a single person, and make sure the message isn’t eaten by the junk mail filtering.

 

Use “Enterprise” Class Drives

While the guts of most drives are very similar, almost every manufacturer has distinctly different firmware on their enterprise series drives when compared to consumer class drives.   For example, a consumer class drive may be setup to do “offline” scans; it is scanning for sector-level platter defects while the drive is not currently “in use.”   A consumer class drive may actually spin-down the motor and “go to sleep” to save power when not “in use.”  In a single drive consumer system these may be optimal behavior.  However, when the RAID controller attempts to “talk” to a drive in these conditions, there may be an “unacceptable” latency in its response.  The RAID controller may be configured to take a drive offline after a certain timeout and now you’re running degraded even though the offline drive is actually healthy.  If 2 or more drives meet this condition you’re dead in the water. Enterprise class drives are going to alter their behavior to meet the performance and latency requirements of the average RAID controller.   Enterprise class drives also go through a much more comprehensive quality assurance process and use higher quality components during manufacturing.  As such, enterprise drives are typically rated for much longer lives in general.  Enterprise series drives of course will cost more and can be harder to source (you aren’t going to find them at most local consumer electronics stores) but the extra money and time to source the appropriate equipment is money well spent.

 

Beware the Convenience of the RAID-5 NAS Device.

As I mentioned previously RAID-5 NAS devices are typically not configured to notify anyone when they have a drive failure.  This is because people remove them from the box in the networking closet, plug them in, switch them on, and everyone in the office magically sees a new logical volume on the local network.    Then the victorious installer pats themselves on the back and gets on with their day, sometimes discarding the box and manual in the trash.

As convenient as these devices are, I’d say they are roughly ten times more failure prone than a legitimate RAID-5 in a big boy server.  Most of these NAS units are shipped with whatever drives were cheapest that morning, regardless of manufacturer.  Usually the drives will be one serial number apart, built within seconds of each other.  They certainly aren’t going to put expensive enterprise class drives in popular consumer NAS devices; they are competing primarily on price.  They are portable and easily stolen.  They don’t have anywhere near the independent fan power as a real server.  They probably live in a closet and not in a server room.   One more important failure point compared to a big boy server:  A NAS device must boot its own proprietary device operating system (again usually one-off Linux) in order to mount the data up to the network.  On a big boy server you’ll be running a real version of Linux or Windows that you have the disks for and understand how to troubleshoot.   When a NAS takes a dirt nap it may allow you to attempt to “repair” the operating system or “flash the firmware,” but these options may or may not involve the annihilation of all your data. Scary stuff.

When a NAS does take a dirt nap there’s a very high probability you’ll be sending it to Gillware or one of our competitors for data recovery if you didn’t have a solid backup.  All data recovery software needs the access to the logical array containing the data in order to scan for file signatures, iNodes, directory structure and so on.  When a RAID-5 NAS is a brick it’s truly a brick; there’s nothing to mount.  Even if you can figure out how to properly access the data volume, you won’t like what you find with data recovery software.  These devices typically run a proprietary flavor of Linux, sometimes with a fairly standard Linux file-system like XFS, but sometimes the file system will be fully proprietary (there isn’t any data recovery software for proprietary file systems, useless the person who wrote the file system was kind enough to write one or publish the spec).  We’ve seen some NAS device manufacturers that use standard file systems but actually encrypt the data (whether or not the consumer asked for it), we’ve seen others that reverse the bit order on a sector level and we had to write software to untwist it.   Essentially, as long as a NAS mounts a network file system up on the network they can and will do whatever they want under the covers.  It will not explain how it operates under the covers on their website or in the manual.

NAS devices are really, really convenient though!

 

A RAID-5 is Not a Backup

I commonly hear IT people tell me that they don’t need another backup; they are running RAID-5.  These misinformed people make excellent customers for our data recovery lab.  (Keep doing what you’re doing and please bookmark our website when you get a chance.)  A RAID-5, or any RAID for that matter, is still subject to numerous failures that will lead to data loss.  A RAID-5 will not protect your data from fires, floods, thefts, virus attacks, human error, malicious employee behavior or multiple drive failure.  It only protects you from data loss from a single-hard drive failure when a technician is paying attention and can replace it promptly.  Running a RAID-5, coupled with a cloud-backup for critical data, is a very solid and cost-effective solution for most small businesses.  Shameless plug:  Gillware remote backup is our solution and you can quickly and easily configure it to automatically encrypt and transmit your critical data up to your slice of our cloud.   For a small fee we’ll actually continuously monitor the account to make sure all critical data is being transmitted on a routine basis and that all critical data has been properly configured to get moved up to the cloud.

 

Ensure You Have a Complete Backup before Adding Storage or Flashing Firmware

A lot of data loss can happen when doing “routine” maintenance on an array.  If the meta-information about the array (drive order/rotation, stripe-size, offline drives, hot-spares, physical volume grouping) is lost during a flash you’ll be dead in the water.  Perhaps the array is full and you want to add more drives and a new volume group.  Perhaps there’s new firmware for your device that you think will add features or increase performance.  It’s always a good idea to ensure your backups are current and 100% complete before doing this type of maintenance.  Many an IT professional has been fired for doing routine maintenance without verification of the backup first.

 

Summary

A properly setup and continuously monitored RAID-5 array will protect you from single-drive failure costing you all your data.  If improperly setup or not monitored at all, RAID-5 can give you a false sense of security and you’ll probably be sending the array to us for data recovery someday.  A RAID-5 in and of itself is not a backup.  A single RAID in a single location will never protect you from fires, floods, thefts, malicious employees, human error or virus attacks.

 

Connect with Gillware CEO Brian Gill on LinkedIn here and join Gillware on facebook here.

About

CEO at Gillware, Inc.
After a successful IT consulting career I founded Gillware Inc. in Madison, WI to provide data recovery services from failed electronic media. Gillware is now one of the world's most successful data recovery labs, currently recommended by Dell and Western Digital.

Tags: , , , , , ,

One Response to “RAID-5 Setup Tips: How to Avoid the Data Recovery Lab”

  1. game
    March 1, 2012 at 12:32 am #

    When visiting blogs, I typically discover a quite excellent content like yours

Leave a Reply