rsync with multiple threads

Discussion:

Bill Yang via luv-main

2017-02-16 01:12:44 UTC

Hi there

I need to transfer 200+ TB data from one storage server (Red Hat Linux
based) to another (FreeBSD). I am planning to use rsync with multiple
threads in a script. There are a number of suggestions on the Internet
(find + xargs + rsync), but none of them worked well so far. I also need a
reliable way to check whether all files/directories from the source server
have been copied to the destination server. Any suggestions/help would be
appreciated.

Regards

Bill

Russell Coker via luv-main

2017-02-16 03:08:50 UTC

Permalink

Post by Bill Yang via luv-main
I need to transfer 200+ TB data from one storage server (Red Hat Linux
based) to another (FreeBSD). I am planning to use rsync with multiple
threads in a script. There are a number of suggestions on the Internet
(find + xargs + rsync), but none of them worked well so far. I also need a
reliable way to check whether all files/directories from the source server
have been copied to the destination server. Any suggestions/help would be
appreciated.

If the files are reasonably large and can be relied on not to change file data
without changing metadata then checking is easy via a final run of rsync -va
without the -c option. If the files are small then a lot of the rsync time
will be taken up by seeking for metadata so that might not be viable (EG
before SSDs became popular you couldn't just run anything like a find / on a
large mail server).

As for the multiple threads, the common way of doing this is copying by parent
directory. For example copying a server you might copy /var and /usr
separately. That has the obvious problem that the sizes are often
significantly different.

If you have lots of files in one directory you could transfer /directory/[a-k]*
in one process and /directory/[l-z]* in another. This wouldn't support
deleting directories that have been removed from the source but that can be
easily fixed with a later pass of rsync -va as long as the files are reasonably
large.

Maybe it would help if you attached the scripts you tried using with xargs etc
so we could see what you tried.

--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/

Nic Baxter via luv-main

2017-02-16 05:01:49 UTC

Permalink

This may help. http://moo.nac.uci.edu/~hjm/parsync/
scripts running on top of rsync with some crude load balancing and
throttling.

Nick Evans via luv-main

2017-02-17 03:15:26 UTC

Permalink

Hi Bill,

I see you mention that you have tried the find | xargs rsync option without
a lot of luck so i will just put this here in case it is different to what
you have tried but we are using this quite sucessfully

cd <source>; find . -maxdepth 1 -mindepth 1 ! -path './.*' -print0 | xargs
-0 -n1 -P<no_of_threads> -I% rsync -irlt % <destination>/.

I am however keen to look at the parsync linked by Nic. Thanks for the link

Nick Evans

Post by Bill Yang via luv-main
Hi there
I need to transfer 200+ TB data from one storage server (Red Hat Linux
based) to another (FreeBSD). I am planning to use rsync with multiple
threads in a script. There are a number of suggestions on the Internet
(find + xargs + rsync), but none of them worked well so far. I also need a
reliable way to check whether all files/directories from the source server
have been copied to the destination server. Any suggestions/help would be
appreciated.
This may help. http://moo.nac.uci.edu/~hjm/parsync/
scripts running on top of rsync with some crude load balancing and
throttling.
_______________________________________________
luv-main mailing list
https://lists.luv.asn.au/cgi-bin/mailman/listinfo/luv-main

Joel W. Shea via luv-main

2017-02-17 07:25:38 UTC

Permalink

Post by Bill Yang via luv-main
I need to transfer 200+ TB data from one storage server (Red Hat Linux
based) to another (FreeBSD).

I'm curious about a few things;

How much does this data change over time?
What speed/distance of link you are transferring over?
Are you maxing out your disk/network bandwidth already?

"Never underestimate the bandwidth of a station wagon full of tapes
hurtling down the highway." â Tanenbaum, Andrew S. (1989)

Craig Sanders via luv-main

2017-02-21 05:58:52 UTC

Permalink

Post by Joel W. Shea via luv-main
Are you maxing out your disk/network bandwidth already?

This is key, IMO, to whether running multiple rsyncs in parallel is
worth it or not. Almost all of the time, rsync is going to be I/O
bound (disk and network) rather than CPU bound - so adding more rsync
processes is just going to slow them all down even more. A single rsync
process can saturate the disk and I/O bandwidth of most common disk
subsystems and network connections.

about the only time more rsync processes might help is if you're
transferring between two servers with SSD storage arrays via a
direct-connect 10+Gbps link....and even then, only if the disk + network
throughput is at least a few multiples of what a single rsync job (incl.
child processes for ssh and/or compression if any) can cope with.

or if the source AND destination of each of the multiple rsyncs are
on completely separate disks/storage-arrays so they don't compete
with each other for disk i/o. e.g. rsync from server1/disk1 to
server2/disk1 can run at the same time as an rsync from server1/disk2
to server2/disk2...especially if you can use separate network interfaces
for each rsync.

splitting up the transfer into multiple smaller rsync jobs to be
run consecutively, not simultaneously, can be useful....especially
if you intend to run the transfers multiple times to get
new/changed/deleted/etc files since the last run. There's a lot
of startup overhead (and RAM & CPU usage) with rsync on every run,
comparing file lists and file timestamps and/or checksums to figure
out what needs to be transferred. Multiple smaller transfers (e.g. of
entire subdirectory trees) tend to be noticably much faster than one
large transfer.

in other words, multiple parallel rsyncs is usually a false
optimisation.

craig

--
craig sanders <***@taz.net.au>

Russell Coker via luv-main

2017-02-21 06:43:03 UTC

Permalink

Post by Craig Sanders via luv-main

Post by Joel W. Shea via luv-main
Are you maxing out your disk/network bandwidth already?

If you have a RAID-1 array then you should be able to benefit from having as
many processes as there are mirrors of the data for reading (IE the
transmitting end and the receiver for updating previous data).

If you have a RAID-5 then you should get some benefits from multiple readers
but it's not as easy to predict. The same applies for command queuing in a
single device, but for a much smaller benefit.

Linux does some queuing of requests and it's theoretically possible to get
some benefits from multiple processes accessing a single disk at one time. But
the benefits will probably be small.

If you have a process that does some CPU operations as well as some IO there
is potential for performance improvement from running multiple processes at
once if nothing else is using the disk. For example if the process is using
10% CPU time and 90% iowait then you could get a 10% performance increase by
using a second process as there will almost always be a process blocked on
disk IO.

Apart from the case of 2 processes reading from a RAID-1 device the benefits
from all these are small. But for example if you want to transition a server
to new hardware or a new DC in an 8 hour downtime window and the transfer
looks like it will take 9 hours these are things you really want to do.

Post by Craig Sanders via luv-main
splitting up the transfer into multiple smaller rsync jobs to be
run consecutively, not simultaneously, can be useful....especially
if you intend to run the transfers multiple times to get
new/changed/deleted/etc files since the last run. There's a lot
of startup overhead (and RAM & CPU usage) with rsync on every run,
comparing file lists and file timestamps and/or checksums to figure
out what needs to be transferred. Multiple smaller transfers (e.g. of
entire subdirectory trees) tend to be noticably much faster than one
large transfer.

Yes, especially if you are running out of dentry cache.

Post by Craig Sanders via luv-main
in other words, multiple parallel rsyncs is usually a false
optimisation.

The thing that concerns me most about such things is the potential for
mistakes. For everything you do there is some probability of stuffing it up.
Is the probability of a stuff-up a reasonable trade-off for a performance
improvement?

--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/

Tim Connors via luv-main

2017-02-22 00:46:13 UTC

Permalink

Post by Craig Sanders via luv-main

Post by Joel W. Shea via luv-main
Are you maxing out your disk/network bandwidth already?

Not quite. It matters on read (which can be both sides when you're
rewriting data), and not just for arrays with multiple spindles.

A single rsync issues one read, the array does it's seek and finds the
relevant spindles, and the reading rsync then sends that to the remote,
which can cache and reorder as necessary when writing.

If you have multiple independent rsyncs, then one rsync blocks on read, a
second rsync blocks on read but its required data is closer to the heads,
and a third rsync finds another disk in the array that the other rsyncs
haven't made busy yet. You can get benefit running more rsyncs than the
number of spindles because your block scheduler/raid controller/disk
controller knows that one bit of data is closer than another, if there are
multiple inflight scsi commands.

For writes, you get no benefit unless rsync issues blocking fsyncs (I
can't remember if it does - if I had to optimise for data transfer, I'd
investigate this and consider using libeatmydata with the caveat that I'd
need to manually rerun rsync in the event of a hardware fault soon after
any transfers were run).

--
Tim Connors