Andrew J. Nelson
Published: 2 August 2011
The objective is to create a series of daily, snapshot style, incremental backups with the following properties to each backup: it will be date stamped; it will reflect reflect the changes to the file structure since the prior backup; it will appear as a complete file structure. Additionally, the entirety of the backup will occupy only slightly more disk space than a single complete backup, even though each daily backup gives the appearance of a full backup. Finally, the entire backup process will be automated.
The incremental snapshot backup method with rsync was first published by Mike Rubel. It was an innovative concept, and he deserves all the credit for it. This document expands and refines the methods originated by Mike Rubel.
When learning rsync myself, one of the more confusing aspects was that many online tutorials used inconsistent terms. We will define keywords now, and use them consistently throughout the document.
"Rsync is a fast and extraordinarily versatile file copying tool." Rysnc can copy files either locally (files stay on the same computer) or remotely (files are copied from one computer to another computer via a network; the network can be a LAN or the internet). Rsync has a tremendous number of options that allow the user to control every aspect of the copy process.
The beauty of rsync is that it uses an algorithm to compare the source directory against the target directory, and only copies to the target what is different in the source. It doesn't even copy over an entire file if only a part of it has changed; it copies only those bytes which are different. This dramatically reduces your overhead for transferring files.
Assume a working directory called Alpha. On day one you create a full backup of Alpha called Alpha_Full. On day two you would create an incremental backup called Alpha_2. This backup only contains those files from the working directory Alpha which are different from the full backup, Alpha_Full. Because only what has changed is copied over, it is much faster and uses far less disk space than a full backup.
On day three, another backup is made named Alpha_3. This time, only what has changed in Alpha compared to the prior day's incremental backup (Alpha_2) is copied over, instead of what has changed in Alpha compared to the full backup (Alpha_Full). This is the key concept in incremental backups: only what has changed since the prior day is backed up.
Each day, a differential backup copies over what is different in the source directory as compared against the original full backup. As the amount of time increases since the full backup, the number of changes between the working source directory and the full backup correspondingly increase; therefore, the size of the daily backup increases.
Continuing from our prior example, if we were making differential backups Alpha_10 would consist of all the changes in the source directory in the last 10 days. This tutorial utilizes the incremental approach instead of the differential, but it is best to understand both forms.
It is easiest for most people to utilize a backup if it looks like the full file structure, instead of perhaps just a few files which have changed since yesterday sitting alone in a directory. Indeed, if the backup process accounts for deleted files, it is harder to identify that a file has been deleted without the context of the full file structure. It is also harder to differentiate files backed up because they were deleted instead of backed up because they were changed.
The most natural way to use a backup is for a user to see the backup as a snapshot: a picture of the way the file structure looked at a certain point in time.
We accomplish this with rsync by hard linking to all unchanged files from the incremental backup. The hard link creates the appearance of the unchanged file existing in the incremental backup; however it does not. It is just a link to the file. Thus, the appearance of a full daily backup is created without the overhead in time, network resources, or disk space.
The concept of hard links is integral to this process. The next section delves into hard links in detail, so that we may grok it fully before continuing.
This section will explain what a hard link is and walk through a couple exercises to reinforce the concept for the user. I know that I intellectually understood it, but didn't really get it, (or grok it, if you will), until I had played around with some commands looking at it.
An inode is an index of a file's attributes, such as file permissions, owner, group, file size, number of hard links and soft links, and times of access, modification, and deletion. (This is not an exhaustive list.) Typically, an inode is associated with exactly one directory entry. However, it is possible to associate an inode with more than one directory entry by creating a hard link.
Indeed, a file name is not the file itself, but a hard link to the file. So, lets play with hard links for a bit. Create a new file:
We can now use the stat command to examine the properties of that file. Run:
The output will show the inode number and the number of links to the inode. Now, lets create a hard link to foo.txt using the ln command:
ln foo.txt bar.txt
Now run stat on each filename. You will see that the output is exactly the same; they share the same inode number and have an equal number of links. You can also use ls -i to view just the inode number; foo.txt and bar.txt are the same thing.
So what happens if you operate on them? If you make changes to one, the same changes are made to the other. If you modify ownership or any other attribute of one, the same modifications are made to the other. However, if you use rm on one, you do not delete both. The rm command actually removes the hard link to the file; the other link will remain. The file itself will not be deleted until the number of links to it reaches zero.
Go ahead and edit the file, change its permissions, etc. After each change, run stat on each file until you are convinced in your soul that foo.txt and bar.txt are the same thing. Then use rm on one, stat the remainder, then rm it.
We make use of this with rsync to create the illusion of a full copy of a file structure, when in fact all we have done is hard linked. All the data is accessible as if the full structure were there, without the hard disk and network overhead of doing a full copy. But I'm getting ahead of myself, more on that later.
The basic usage of rsync is very simple:
rsync [options] [source directory] [target directory]
This compares the source directory against the target directory, and copies over everything that is different in the source to the target.
There is one point of syntax in this that gets a lot of new rsync users hung up. Given a source directory bravo containing file widget.txt (breaking conventions by not using foo and bar!), and a target directory charlie, then
rsync -a /bravo /charlie
is not the same thing as
rsync -a /bravo/ /charlie
The first command copies the directory bravo, and its contents, into charlie, resulting in /charlie/bravo/widget.txt. The second command copies only the contents of bravo into charlie, resulting in /charlie/widget.txt. So remember, a trailing slash on the source directory means that the source directory's contents are copied, not the directory itself.
Obviously, detailing the massive number of options that rsync has available is beyond the scope of this document. We will talk about the specific ones used in our method.
The first option we use is -a, which stands for archive mode. Archive mode is a compilation of other switches, which amount to preserving almost everything.
So you can see that by using archive mode, the files are copied in such a way that they could be restored to the source directory transparently; everything vital to their use is maintained.
The second option we utilize is -v, which as usual requests verbosity in reporting. Its always nice to see what is going on, and export it to a log if you so choose.
The third option is -h, which requests that transfer amounts are expressed in human readable format. Instead of 1256842136 bytes transferred, you will see 1.257 GB, or something like that.
The fourth option is --delete, which deletes files from the target directory that have been deleted in the source directory. So not only are files which are different in the source directory copied to the target directory, files which don't exist in the source but do exist in the target are deleted from the target.
Our last option is --link-dest=[link directory], and this takes a bit of explaining.
Rsync typically evaluates against the target directory. The --link-dest option instructs rsync to use the specified link destination directory as an additional file structure to evaluate against; files that are unchanged in either the target directory or the link destination directory are not sent. Additionally, unchanged files are hard linked from the link destination directory to the target directory.
Lets break that down, continuing our example. Our working directory, Alpha, has a full backup called Alpha_Full created on day one. On day two, we want to create our incremental backup in a new directory called Alpha_2. If we created a new empty directory called Alpha_2 and ran:
rsync -avh --delete /Alpha/ /Alpha_2
A new full backup of Alpha would be made in Alpha_2. That isn't incremental, and doesn't do us any good. By using the --link-dest option like this:
rsync -avh --delete --link-dest=/Alpha_Full /Alpha/ /Alpha_2
We are instructing rsync to backup the contents of Alpha into Alpha_2, evaluating for differences against Alpha_Full. It is as if Alpha_2 was Alpha_Full for the purpose of checking for changes. Then, after it transfers those files that have changed, it hard links everything else from Alpha_Full to Alpha_2. The hard links create the appearance of the files existing twice, once in both Alpha_Full and Alpha_2, but there is actually only one copy of the file. Finally, since the --delete option was used, files that were deleted in Alpha will have their hard links deleted in Alpha_2.
Thus, we have created a snapshot, incremental backup. The next day we would do the same thing, creating Alpha_3, but instead of evaluating against Alpha_Full for changes, we would check against Alpha_2.
A concern you may have is: what happens to a file that is originally stored in Alpha_Full, then on day 10 is changed? If Alpha_10 only has a hard link to the original file, and that file is changed, does rsync change the original, thus ruining the snapshot effect?
The answer is no. Rsync is "wicked smart", as they say in my neck of the woods. When evaluating for changes against a hard link, if the source file is different, the destination file is unlinked before transfer. A new version of the file is created in the target directory, and the historical version of the file is preserved in earlier backups. You have to give explicit options for rsync to do otherwise, and the man pages hint rather broadly that it might be a bad idea.
All of our examples thus far have used rsync to copy files locally. Rsync can be used to backup to another computer across a network, whether that network is a LAN or the internet. You can specify which remote shell to use with -e, followed by the shell name. The syntax for the source directory remains the same. The syntax for the target directory works much like scp:
Continuing our example, if we were making our backups of alpha to a remote computer at a domain named pluto.org, it would look like this:
rsync -avh -e ssh --delete /Alpha/ firstname.lastname@example.org:/Alpha_2
Substitute username with an account that has the appropriate rights. You would then be prompted for the password. You can obviate the need for passwords by using preshared keys, or setting up an rsync daemon server. Both of those options fall outside the scope of this document.
The remainder of the tutorial assumes that the backups are made to a network share or a separate partition on the same computer, not a remote machine. Even so, it is important to understand that rsync has this capability.
We now understand the methods and theory behind creating a snapshot style, incremental backup with rsync. The next step is implementation. Other methods use a backup naming convention such as backup1, backup2, backup3. The script used to automate the method then deletes backup3, renames backup2 to backup3, and likewise renames backup1 to backup2. It then creates a new backup1, using backup2 as the link destination directory.
That successfully creates a rotating backup structure. However, I found the steps required to do the process somewhat convoluted, and the number of steps increases with the number of backups you want to keep on hand.
I thought it would be simpler if the script could simply recognize yesterday's backup by name. I also thought it would be easier to recognize and find the backup you need by date, rather than by an abstract identifier. We can accomplish both these goals by using date stamps for the backup names.
Lets walk through the steps we would want our automated script to do.
So the first step in making this work is getting a shell script to be able to find today's date, yesterday's date, and a third date from x number of days ago. If you want to keep backups for 28 days, x would equal 29. I'll talk about why step three is optional later.
When we talk about date stamping, we mean using a date format such as 2011-07-24 to represent July 24, 2011. This format is unambiguous, universally recognizable, and guarantees that files named in that format will accurately sort by date ascending or descending.
The man page documentation for date is varied, and sadly, often incomplete. I am going to show some of the hard ways of doing things before showing the easier ways. (I am a firm believer that you never learn anything by doing it right.)
If we want the system to output today's date in date stamp format, one way of doing it is to use the date command like this with modifiers to explicitly state the format.
This is also know as an ISO-8601 compliant format. What is not well known about the date command is that it has a switch to output in ISO-8601: -I. So, this accomplishes the same thing with far fewer symbols:
To use this in our script, we can create a variable and assign it to that by running the date command within backticks.
We now have a variable which will always be equal to today's date in date stamp format. Useful to our objective, no? Next, we need a variable equal to yesterday's date in date stamp format.
I went through quite the process on this before I found a simpler method. The hard way was that I converted today's date to epoch time (the number of seconds since midnight, January 1, 1970) and assigned that value to a variable. I then subtracted 1 days worth of seconds from that variable and assigned that value to a new variable. I then converted the new variable back into date stamp format. All of this required some esoteric commands.
Now for the easy way. The date command also has an under-documented function where it will tell you a date in the past or future with simple syntax. For instance, if today's date is 2011-07-27, then the following code will output 2011-07-26 and assign it to a variable:
DAY1=`date -I -d "1 day ago"`
Lets flesh out the rest of the script. Make these assumptions: we are backing up a website to a directory named /backup located on a separate partition. The full initial backup was made manually to the directory with a backup name of 2011-07-01. We will set variables equal to our source directory, our target directory, our link destination directory, and our options. We will then execute the rsync command. Create a script called website_bak.sh. (Make sure you make it executable.)
#Website Backup Script
#Todays date in ISO-8601 format:
#Yesterdays date in ISO-8601 format:
DAY1=`date -I -d "1 day ago"`
#The source directory:
#The target directory:
#The link destination directory:
#The rsync options:
OPT="-avh --delete --link-dest=$LNK"
#Execute the backup
rsync $OPT $SRC $TRG
If the initial full back up is made on 2011-07-01, you can have this script executed as a cron job starting on 2011-07-02. Each day it will make an incremental date stamped backup, using the prior day's backup as the link destination directory. There is no need to rename backups or move them around. You will also note that the rsync command itself creates the target directory. This will work for one directory level; however, if the directory /website did not exist, such that rsync had to create /website and /$DAY0, the backup would fail. Rsync would exit with an error code.
The last optional step is to remove older backups. The reason I say this is optional is because it depends on how much data you are working with and the degree of change it experiences. This method is very efficient at reducing overhead; you could conceivably keep incremental backups for years without running out of space if your data doesn't change drastically.
However, lets say that you want to only keep the last 28 days of data. Add the following lines to the script.
#29 days ago in ISO-8601 format
DAY29=`date -I -d "29 days ago"`
#Delete the backup from 29 days ago, if it exists
if [ -d /backup/website/$DAY29 ]
Voila, a script that does everything we need for a backup. Our final step is to automate its use.
Linux has a scheduling utility called cron which will run operations at a fixed time. Going into how cron works and editing its settings is beyond the scope of this document. However, if you just want a script to run daily, it is simply a matter of placing it in the correct directory.
So, copy the backup script to the cron daily directory. In Slackware, this is:
cp -v website_bak.sh /etc/cron.daily/
After a long journey, we have reached our goal of having automated, snapshot style incremental backups using rsync. Along the way we learned the difference between incremental and differential backups, what hard links are, how rsync works, and how to pull it all together with a shell script. I hope that this article helps you in your endeavors.