The Linux Page

Copying large directories between computers

Twin Windows — perfect duplication

Copying Data between Computers

Since I'm moving to a new server, I need to copy all of Gb or Gb of data from my old computer to the new one. This is mainly three folders:

/home
/mnt/cvs
/var/www

There are two main problems here:

1. the files have various permissions and ownership which I do not want to lose, especially for websites and the cvs repositories (I still have a CVS, but that folder also include SVN and GIT repositories)

2. the files on the source computer require various permissions to be read, "namely", I have to be root to make sure I can read all those files...

In many cases, to copy files accross multiple computers one uses a tool such as `scp`. In our case, though, we need something that works better. That is, a solution which (1) can allow for root to do the transfer, (2) does not lose ownerhip info, (3) can transfer the data without having to make a copy on my old computer (because it is not unlikely I would not have enough room to make full copy of my data since my drives are quite full already).

So scp is no good. It works recursively, but it totally ignores all ownership and it can't be running as root (unless you less your root user use ssh and then copy the files in your root account. I guess that's okay on a set of local computers).

A solution is to use that good old tar tool. You can tar across an SSH tunnel and save the output on the other size or even immediately extract on the other side. I have one problem though, I'm 99.9% sure that I don't want to extract immediately because I am pretty sure that I have users and groups on my old server that do not yet exist on the new server. So I want to make sure the extraction works right and for that I need to get the whole file and extract those names. Then I'll be able to verify that all exist on my new computer.

So far, my home folder is some 400Gb and my www folder is over 15Gb... (still going) So as you can imagine, it's taking time to transfer everything.

Here is the command line I used:

sudo tar cjf - /var/www/ | ssh destination 'cat > www-data.tar.bz2'

As we can see, I use sudo to run tar. Since there is a pipe in between, ssh is not affected by sudo.

tar is ask to send its output in a file (f) and that file is stdout (-). That's what gets piped. It also compresses everything using bz2 (j option) so the transfer is a bit faster, although that kills one of my processors on my source computer... (I have only 4 here!)

The ssh command creates the shell tunnel and then execute the command shown between the quotes. Here I want to save the data to a tar file so I just use cat to send the output to a file that corresponds to the input. On my new computer, I have a ton of space (i.e. I'm going from 2Tb to 22Tb so I think I'll be fine for a little while, but I'll be doing videos, so it will fill up fast anyway.)

Note that you can check the file on the destination using another ssh or directly on the destination with a simple

ls -l www-data.tar.bz2

to see it growing.

If you'd like to extract immediately (i.e. you know that you can do that) then use the tar tool on the other side too:

... ssh 'tar xf -'

You may want to look at additional options such as the -C <dir> and various preserve flags to make sure permissions and such are kept as expected.

Note that if you need to be root to create the tar, you will need to be root to extract it. That may be complicated in one go. So you may need a lot more space on the destination computer to get the compressed archive and then extract it to the right folder.

In case you wanted to copy a folder from a remote computer to your local computer, you can also use the following:

ssh remote-source tar czf - path/to/files >remote-files.tar.gz
ssh remote-source tar cjf - path/to/files >remote-files.tar.bz2

Usually, bzip2 compresses better, so you will use less bandwidth using that compression format. But even better is xz in which case the following can be used:

ssh remote-source 'tar cf - path/to/files | xz -9' >remote-files.tar.xz

Notice the quotations around the command. In this case it is necessary because we pipe the tar output to the xz tool. Note that I like to use the -9 option. Know, however, that it is very slow. It reduces your bandwidth usage more, but probably not that much more. If you are in a hurry, you may want to use a lot smaller number like -3.

Warning: Side Effects

One important aspect to copying large swash of files this way is the impact on the source file system, CPU, and memory.

With the xz utility, a very large amount of memory is going to be used. In my case, I saw about 700Mb of memory used between tar and xz.

The compression is going to make use 100% of one of your CPUs.

And of course, reading all those files from disk is a lot of I/O while doing the transfers.

If you somehow have a server with just and only one CPU, it's going to put your server down on its knees while doing that job. If your drives are good old rotational hard drives, then your other I/O is likely going to be going slow. Finally, the pipe is going to be loaded with data meaning that other users may not have much bandwidth left to play with.