Introduction

We all store more and more of our lives in digital form; spreadsheets, résumés, wedding speeches, novels, tax information, schedules, and of course digital photographs and video. All of this data is easy to store, transmit, copy, and share, but how easy is it to get back?

All of this data can be a harsh reminder that computers are not without fault. For years, storage costs have been dropping while at the same time the amount of storage in any one computer has been increasing almost exponentially. We are at a point where a single hard drive can contain multiple terabytes of information, and with a single mishap, lose it all forever. Everyone knows someone who has had the misfortune of having a computer stop working and wanting their information back.

It’s always been possible to safeguard your data, but now it’s not only necessary thanks to the explosion of personal data, it’s also more affordable than ever. When you think of the costs of backing up your data, just remember what it would cost you if you were to ever lose it all. This guide will walk you through saving your data in multiple ways, with the end goal being to have a backup system that is simple, effective, and affordable. In this day and age, you really can have it all.

It’s prudent at this point to define what a backup is, because there are a lot of misconceptions out there which can cause much consternation when the unthinkable happens, and people who thought they were protected find out they were not.

Backups are simply duplicates of data which are archived, and which can be restored to a previous point in time. The key is the data must be duplicated, and you have to be able to go back to an earlier time. Anything that doesn’t meet both of those requirements is not a backup.

As an example, many people trust their data to network storage devices with RAID (Redundant Array of Independent Disks). Without going into the intricacies of various forms of RAID, none of these Network Attached Storage (NAS) devices are any sort of a backup on their own. RAID is designed to protect a system from a hard disk failure and nothing more. Depending on the RAID level, it either duplicates disks, or uses a calculation to create a parity of the data which can be used to calculate the original value of the data if any part of the data is missing from a failed disk. While RAID is an excellent mechanism to keep a system operational in the event of a disk failure, it is not a backup because if a file is changed or deleted, it is instantly updated or removed on all disks, and therefore there is no way to roll back that change. RAID is excellent for use as a file share, and can even be effectively utilized as the target for backups, but it still requires a file backup system if important data is kept on the array.

Another similar example is cloud storage. Properly configured, cloud storage can be a backup target, and different services can even properly perform backups, but the average person with the average Google Drive or OneDrive account can’t copy their files there and hope they are protected. As with RAID, it is a more robust file storage than any single hard drive, but if you delete a file, or copy over another, it can be difficult or impossible to go back to a previous version.

Both RAID and cloud storage suffer from the same problem – you can’t go back to an earlier time, and therefore are not a true backup. True backups will allow you to recover from practically any scenario – fire, flood, theft, equipment failure, or the inevitable user error. This guide will walk you through several methods of performing backups starting at simple and moving up to elaborate systems that will truly protect your data. These methods work for home and business alike, just the type of equipment will likely differ.

There is some common terminology used in backups that should be defined before we start discussing the intricacies of backups:

  • Archive Flag: A bit setting on all files which states whether or not the file has been modified since the last time the flag was cleared.
  • Full Backup: A backup of all files which resets the archive flag.
  • Differential Backup: A backup of all files with the archive flag set, but it does not clear the archive flag.
  • Incremental Backup: A backup of all files with the archive flag set which resets the archive flag.
  • Image or System Based Backup: A complete disk level backup which would allow you to image a machine back to a previous state.
  • Deduplication: A software algorithm which removes all duplicate file parts to reduce the amount of storage required.
  • Source Deduplication: removing duplicate file information from files on the client end. This requires more CPU and memory usage on the client, but allows for a much smaller file size to be transferred to the backup target.
  • Target Deduplication: removing duplicate file information from files on the target end. This saves client CPU and memory usage, and is used to reduce the amount of storage space required on the backup target.
  • Block Level: A backup or system process which accesses a sequence of bytes of data directly on the disk.
  • File Level: A backup or system process which accesses files by querying the Operating System for the entire file.
  • Versioning: A list of previous versions of a file or folder.
  • Recovery Point Objective (RPO): The amount of time since the last backup deemed safe to lose in a disaster scenario. For example, if you perform backups nightly, your RPO would be the previous night’s backups. Anything created in between backups is assumed to be recoverable through other methods, or an acceptable loss.
  • Recovery Time Objective (RTO): The amount of time deemed acceptable between the loss of data and the recovery of data. For home use, there’s really no RTO but many commercial companies will have this defined either with in-house IT or with a Service Level Agreement (SLA) to a support company.
 
Plan Your Backups
POST A COMMENT

131 Comments

View All Comments

  • patrickjchase - Thursday, May 22, 2014 - link

    ZFS RAIDZ implements strong checksums within each drive, such that it can reliably detect if a drive is returning bad data and ignore it. In some respects it's actually stronger than RAID6 in terms of its ability to deal with silent corruption. That's why NonSequitor wrote "double parity OR checkums" (RAID6 is double-parity, RAIDZ is single-parity augmented with strong per-disk, per-chunk checksums).

    If you're halfway competent then it's not "extremely likely" that you'll encounter unreadable sectors during a RAID5 rebuild. There's a reason why both good RAID controllers and ZFS implement scrubbing (i.e. they can periodically read every disk end to end and remap any unreadable sectors). If you do that every couple days then the likelihood of encountering a new (since the last scrub) unreadable sector may or may not be high depending on your rebuild time.

    For example I have a 5-disk RAID5 array that I use for "cold" storage. I scrub it daily, and rebuilding to a hot-spare takes 6 hours (I've tested it several times, verifying the results against separate copies of the same files), which means that the maximum delay between the most recent scrub and the end of a rebuild is 30 hrs. The scrubs have only found one bad sector in ~2 years, so I respectfully submit that the likelihood of an additional failure within 30 hours of a scrub is pretty darned low.
    Reply
  • beginner99 - Thursday, May 22, 2014 - link

    exactly. anything above 2 TB drives becomes really problematic in this regard. With 4 TB drives it's almost guaranteed a RAID-5 rebuild will fail. IMHO if you do RAID, do RAID-1. Reply
  • patrickjchase - Thursday, May 22, 2014 - link

    RAID1 has exactly the same problems as RAID5 - In the case of silent corruption it can't determine which disk is bad, and it's vulnerable to a single disk failure during a rebuild. The likelihood of such a failure is obviously lower (now you only have to worry about 1 other disks instead of 2 or more) but not hugely so. RAID6/RAIDZ2 is statistically much better until you get up to really high drive counts.

    The "big boys" with truly mission-critical data do N-way replication, i.e. all critical data is replicated (n>=3) times on different systems.
    Reply
  • jimhsu - Wednesday, May 21, 2014 - link

    +1 to crashplan. For my most important data, I have n+2 backups: n being the number of computers I have (meaning that Onedrive automatically syncs them); 2 being a crashplan online subscription as well as a local crashplan backup. I also have restore previous versions running, and use it on occasion, but don't consider it a backup per se. Reply
  • pjcamp - Wednesday, May 21, 2014 - link

    I plowed through all the competitors a few years ago, and Crashplan was the one I selected. It had the cheapest unlimited storage with version history, and the (for me) killer feature that there exists a Linux client. I have a FreeNAS box that I use for media storage. I can mount it as a drive on my Linux machine and the Crashplan client will back it up just as if it were a local drive. There is also an Android client that gives you access to all your files, functioning as a sort of personal Dropbox, without sharing but with better security.

    I've had occasion to use my backups a couple of times and found it easy and speedy, much more so than I expected for a cloud service.
    Reply
  • cknobman - Wednesday, May 21, 2014 - link

    Everything in my house goes to a personally built server onto dedicated RAID storage drives. No accounts other than my personal administrator account have access to do anything but read.

    Those drives are then backed up to the cloud via CrashPlan.

    Simple, effective, and as foolproof as I can get for now.
    Reply
  • uhuznaa - Wednesday, May 21, 2014 - link

    One thing to note with Time Machine: You don't need to use the fancy interface for restoring files. Just browse your backup disk with the Finder (or on the command line), there's a directory for every backup from which you can just copy things over.

    To CrashPan: I used that for a while, but found it to be utterly uncontrollable. The log files are a joke, the status mails arrive at random times (or not at all) and are useless ("Files: 117k, Backed Up: 99.9%") and often enough when a backup didn't run for some reason it's impossible to debug because there's no real error reporting. It may work somehow, but it has all the marks of something I don't want to rely on.
    Reply
  • NCM - Wednesday, May 21, 2014 - link

    Yes, that's indeed a big point for Apple's TimeMachine — that and it being included free with every Mac. If necessary you can just go digging into a TM archive and pull out what you need. Reply
  • SkateboardP - Wednesday, May 21, 2014 - link

    Hi Brett,

    great guide but can you or others please check if the OneDrive backupsolution is throttled and capped at 355kb/s under win8.1 (Desktop)? I read that many people complaining about that. Thanks.
    Reply
  • plext0r - Wednesday, May 21, 2014 - link

    I've been using Duplicati for a few years and it has been good. "It works with Amazon S3, Windows Live SkyDrive, Google Drive (Google Docs), Rackspace Cloud Files or WebDAV, SSH, FTP (and many more)." It uses rsync under the covers and I use it to backup to my RAID-5 NAS in the basement. I also perform offsite backups (rotate 1TB disks to and from my workplace). Reply

Log in

Don't have an account? Sign up now