Redundancy, Backups, and RAID: The story of lost data and downtime.
I have wanted to write this article for a little bit because with the progression of technology some of these concepts are getting muddied, especially by some “agile, big data, cloud native, blockchain based” marketing. While these concepts are inextricably related, they are also separate concepts that can be implemented in a variety of ways with foreseeably “unexpected” results. Which is why I want to start this topic off with the redundant department of redundancy and all the quirks that come along with it.
Most are familiar with the concept of redundancy whether by being declared redundant in your job or through that systems engineering class you had to take in college. Designing systems that are truly redundant is a difficult task because it requires looking at failure-modes and a little clairvoyance by seeing into the future (through past experiences or through advanced modeling and testing… your choice). Redundancy is an entire field in its self and there are a ton of interesting concepts and techniques (if you’re interested look at: Error detection and correction , Redundancy Engineering, and Triple Modular Redundancy) used in systems requiring differing levels of reliability . However, since I am not writing a thesis paper, I only want to bring in one additional related concept: fault-tolerance.
The fault-tolerance is basically something (a component, part, computer, etc.) that can fail but the system is still able to complete its main task (but you can lose some functionality). An example of this is the circuit breaker (or fuse box) in your house or apartment. When you overload one outlet you will trip the breaker causing the circuit to break and go “open”. This typically results in a whole room or a portion of a room to go dark (where all outlets and lights go out); however, the rest of the house/apartment (system) continues to have power and you can simply walk to another room to get power. If it was not fault tolerant one overload in a part of the house would take down the whole house (or neighborhood).Why is this important for redundancy… well there are a few places in information technology or software development where this comes in to play, and is especially important in “Critical” systems (whether meaning “this makes us money” or “people will die if this goes down”). When sysadmins hear this, they typically reach for and implement RAID… so we will start with that.
RAID (Redundant Array of Independent Disks). It is important for me to make clear right here that RAID is not a magic bullet and is one (of many) components to making a redundant system. It is often forgotten there is still numerous single-points of failure in modern servers and just because the drives are redundant does not mean the other components are. It is also dependent on the level of RAID you implement as well as there are quite a few quirks with the methods. So, I am going to go over some of them with my opinions and recommendations:
Well just like it says “Zero”, nothing, nada…. it is what you get back if you have a single failure, nothing. When would you use this? Well when you do not care about the data, it is not critical or even important, but you need the performance benefits as you basically can add the read/write speeds of the drives together and same with the size of the array. Just remember this has no redundancy or fault tolerance and any URE (Unrecoverable Read Error) error on any drive will result in corruption or failure; it is like lighting yourself on fire…. You should have a plan to put it out before you implement it.
Everything is copied to the other drives in the array and the size of the array is the size of the smallest drive. For example, if you have three 1TB SSD’s the size of the array is 1TB. Huh!? What is the benefit of that? Well first off if one of the drives fail you do not lose everything (At least not right away. do not leave failed drives in an array… it’s just asking for problems.) and you get a little bit of a performance boost to read speeds as it can read from any of the drives in the array (since all of the data “should” be the same). However, write speed still stays the same as it takes the same amount time to write to all the drives.
This is one of the first clever RAID implementations “block-level striping with distributed parity”; which is interesting because it has features of RAID 0 but with an additional feature that adds fault-tolerance through parity which when implemented correctly will allow for recovery of a single drive failure (but if you lose two drives then you lose everything). Now for the performance benefits it is a compromise on all fronts. It is worse than RAID 0 on writes, but better than RAID 1. For reads its worse than both RAID 0 and RAID 1 but it is better than a single drive and more reliable than RAID 0. There is one quirk though that you should be aware of. When a drive fails the controller rebuilds the missing data when you replace the drive by using the other drives data and the parity block. This is very read intensive process and if you already had one drive fail and the others are the same age (and have the same wear level)... there a good chance that at some point during the rebuild you will have a URE (Unrecoverable Read Error) which may result in corrupted data or a complete failure to rebuild (interesting calculator for this). You are probably starting to see how RAID is not a backup solution.
This is the practice of combining different levels of RAID in to one array like RAID 5 + RAID 0. This can get mathematically complex as the failure modes become more complex and dependent on the specific implementation… I generally recommend staying away from all nested RAID implementations (Unless you know exactly what you are doing and why) except RAID 1+0 (AKA RAID 10). If you are interested check out this document on modeling the reliability of RAID sets however this doses not take in to account URE and that just makes this solution untenable.
This is my second of RAID level as it provides the most fault-tolerance and good performance, however it does require a minimum of four drives and you lose ~50% of the storage to mirroring. In the best case scenario (losing drives 1&3 or drives 2&4 in a simple 4 drive setup) you can lose two drives and still have all the data…but in the worst case scenario you can only lose one ( as losing drives 1&2or drives 3&4 in a simple 4 drive setup would result in data loss).
Software RAID (like ZFS RAID z2):
I am a fan of RAID Z2 and it is my first choice, as it provides the ability to lose ANY two drives in an array vs. RAID 10 where only in the best-case scenario can you lose two otherwise you can only lose one drive in the worst case. Another benefit is it only requires three drives vs the four RAID 10 requires. There is however a performance penalty as double parity (similar to RAID 5 but with an additional parity block) must be calculated but with good spinning drives you can easily saturate a 1 Gb/s link (and with a good processor while compressed and encrypted) and with NVMe’s you can easily saturate a 10Gb/s link. ZFS also has some interesting features like self-healing and checksumming which can add additional error detection and data recovery options.
This is just about the time some people are thinking “ooohh, I should implement RAID and then I won’t have to back-up my drives as they are mirrored (or whatever)” … PLEASE NO, STOP, BAD, DOWN, OFF, NO! This is where I introduce the concept of backups being different than redundant drives. As I pointed out earlier drives are only one point of failure and there are numerous other problems such as CPUs, RAM, PSUs, etc. This is also the point some enterprising engineer comes up with the idea “Just have two; and we’ll sync data between them” While this is redundant as you have basically created a RAID 1 Array of computers it is not necessarily a backup because if files are deleted on one they are then deleted on the other. I understand the idea that some of these systems have shadow copy enabled and systems that hold deleted files for a few days/months after they have been deleted but this is basically a “backup lite” version that only works in specific scenarios (especially good for crypto-lockers) and I want something more robust. I am going to admit how I learned this lesson the hard way…
I had a nice NAS setup RAID 10 it was perfect, reliable, redundant network links, encrypted, the works. One specific quirk of this setup was that while it was encrypted, and the key for the encryption was stored on a separate boot drive to automatically decrypt the drives on bootup; this is important for later… Well I needed to update the OS with some of the new security features and packages. Well when it booted back up the array was locked. I was sure there was a glitch so I checked the boot drive and found that it had failed; No problem that’s why my boot drive was mirrored (RAID 1) so I replaced the drive and rebuilt the array, but the main RAID 10 array was still locked….weird…. At this point I decided to switch to manual and try to unlock it with the key file on the boot drive, failed over and over. I tried using recovery headers and trailers and checking the file system, everything, but I am guessing at some point the key got corrupted. No big deal I have a back up of the key saved on one of my thumb drives…. That is not working either (This also speaks to the old adage of test your backups and processes regularly). I tried recovery options to try and get in still no good, and everything with the drives were checking out as fully healthy and the data was matching and passing parity checks. Since it was encrypted with AES-128 it would take me a long time to crack it (approximately the current age of the universe multiplied by some large number) so it was basically an inaccessible bit-soup. While I got kind of lucky as I had stored most of my important data on my desktop as well, I still lost a lot of files forever.
Having multiple copies of my data is what saved me from complete failure and its one of the important cornerstones of backup implementations. I expect sites to go down, drives to fail, and rooms to flood because maintenance precariously placed bucket above your server to catch rainwater dripping though the ceiling and never emptied it (yeah it happens). Having multiple copies of data distributed around both online and offline (if you can) is important.
The next warning is don't let automation kill you; just like that engineer that came up with the idea for two systems that sync as a backup when one got crypto-locked it synced all the changes to the other.... Understanding your threats, environmental, malware, and human actors (as well as the way your system works in this case), is often overlooked when looking at a backup solution. Is it secure in transit and at rest (and in use if you want the full gamut, but this not really as important as the first two) as mistakes will be made and the idea is to having layered security be your saving grace not your downfall.
I also want to bring up location (or in real estate location, location, location) as it is an important component of a backup system. One of the most common mistakes I see is buying an external drive and copying all your family photos and important documents and then putting it in a drawer right next to the computer… This is NOT a backup; I do not care if you make five copies and store it all over the house still NOT a true backup. All it takes is one house fire, flood, tornado, earthquake, natural disaster, or civil unrest and you still lose everything. about this time in the conversation I get the question “what if we store it in a neighbors house or down the street at this lock box”… well here’s the problem; if your place is affected by the “disaster” something down the street or even across town is probably affected as well especially if it’s a hurricane, flood, earthquake, and . Which is why I recommend your backups are stored in multiple disparate geographic regions.
This is Where I turn to cloud providers; its easier today to get access to a completely different region for storing data than it ever was before. you can go with one of the major cloud providers such as AWS (S3 or Glacier), Azure, or Google Cloud which all provide datacenters in different regions. There are downsides to this as some of these bigger companies have a bit of a nickel and dime problem and costs can be hard to calculate unless you know a lot of your storage metrics. My personal choice is BackBlaze as it is pretty easy to get your data stored with them now (in the beginning I had to write a script and run a BSD Jail that integrated with one of their early beta APIs to upload my data to b2) and they have a few slick options for downloading data and some for uploading(similar to AWS Snowball). However, I understand if you are doing this for personal use or a startup this can get expensive and be a financial burden. You can use something like Dropbox, OneDrive, or iCloud, however these are closer to a live file system (with some backup features) than a true backup and should be treated as such. As a reminder crypto-lockers often attack anything mounted and sometimes specifically look for cloud drives to destroy.
There are a few things I caution users of cloud services about. First do not put all your eggs in one basket; I recommend using at least two different cloud providers. Why you may ask? (they all claim super high uptime and low data loss) well for the reason that they are a business run by people and mistakes do happen (AWS Says It’s Never Seen a Whole Data Center Go Down) and from a reliability standpoint I don’t want to find out in the middle of a restore they lost my data. Also not becoming dependent on one piece of software, API, or company is also a good business practice; as policies, politics, and security requirements change, sometimes requiring forced technical changes allowing you to pivot easier when you already have multiple providers set up. Please be sure to review your providers policies and take a look at the uptime SLA and data reliability SLA and make sure you are looking at the contract page not what they market (as they often market something like five or seven 9s and in the contract its only three 9s…)
You can also roll your own if you have multiple sites or an offsite putting an NAS (Network Attached Storage) that is a dedicated backup device and is hardened as such is an viable options just realize you have to provide protection for that as well and if its in the same geographic area its likely susceptible to the same environmental factors. If you do decide that rolling your own is the way to go remember you need redundancy as having a raspberry pi with a single external drive act as your backup is neither redundant or reliable. While this may work in a personal situation having a single spinning platter of rust (regardless of platform) as your backup is just asking for failure (while I use this currently, I do NOT consider it a backup appliance, and treat it as such). There is also an option to have offline backups; which work great if you have good policies for regularly backing up and ensuring its truly offline and redundant. With offline backups you do lose some availability of your data, especially if you need to get a drive back onsite to perform a restore (hopefully its not just a single drive you send off site). Another option for offline can be tapes which can be messy and complex which is why a lot of corporations are moving away from them but there is benefits to storing data on Tape such as price and security (mostly due to being offline).
You also have an option to implement a hybrid approach and this can often result in a better outcome, as having a mixed backup strategy often mitigates some of the issues with each of these solutions. While the upside is that your backup solution is more robust you the downside is often cost and potentially complexity.
As always have some redundancy, backup your stuff, and do not get raided.