[lopsa-tech] Backing up sparse files ... VM's and TrueCrypt ... etc

Discussion:

Edward Ned Harvey

2010-02-17 03:43:55 UTC

Does nobody backup sparse files? I can't believe there's no good way to do
it. Of particular interest, I would like to backup:

. TrueCrypt sparse files in Windows (Truecrypt calls this
"Dynamic.")

. Virtualbox, or VMWare Workstation sparse ("expanding") virtual
disks in windows

. VMWare Fusion or Parallels sparse virtual disks in Mac

I would like to back these up frequently, and efficiently. If I have a 50G
container file that occupies 200M on disk, the backup should be close to
200M, and when I modify 1M in the middle of the file and then save, I don't
want the incremental backup trying to send the whole 50G again.

On the mac, the Sparsebundle concept solves this problem. It's just like a
truecrypt image, but it's broken up into a whole bunch of little 8M chunks.
So when I modify 1M in the middle of the volume and save, my next backup
will send one updated 8M chunk for backup. A little bit of waste, but well
within reason.

I currently have Virtual Machines and TrueCrypt images excluded from the
regular Time Machine and Acronis True Image backups of peoples' laptops.
But I'm not comfortable simply neglecting the VM's and TrueCrypt volumes, as
if they're not important.

I haven't found anything satisfactory yet. The closest I found so far was
Crashplan. It does "byte pattern differential" and "continuous real-time
backup," which means it can detect blocks changing in the middle of a file,
and only send the changed blocks of a sparse file during incrementals,
instead of sending the whole 50G again. Unfortunately, crashplan can't
restore a sparse file. D'oh!!! :-( Actually, that's a fib. It can
restore sparse files, but they won't be sparse anymore. So . IMHO . that's
not useful.

I've also tried rsync. People all over the place say it should do well, but
in practice, I found that doing a single incremental takes 2x longer than
doing the whole image. So again, IMHO, not useful. Unless I am simply
using it wrong. But I put plenty of effort into making sure I was using it
right, so I'm really pretty sure I didn't get that wrong.

Anybody doing anything they're happy with, to backup sparse files on a
regular basis, quickly, efficiently, frequently?

Thanks.

Brian Mathis

2010-02-17 04:05:08 UTC

Permalink

Does nobody backup sparse files? I cant believe theres no good way to
· TrueCrypt sparse files in Windows (Truecrypt calls this
Dynamic.)
· Virtualbox, or VMWare Workstation sparse (expanding) virtual
disks in windows
· VMWare Fusion or Parallels sparse virtual disks in Mac
I would like to back these up frequently, and efficiently. If I have a 50G
container file that occupies 200M on disk, the backup should be close to
200M, and when I modify 1M in the middle of the file and then save, I dont
want the incremental backup trying to send the whole 50G again.
On the mac, the Sparsebundle concept solves this problem. Its just like a
truecrypt image, but its broken up into a whole bunch of little 8M chunks.
So when I modify 1M in the middle of the volume and save, my next backup
will send one updated 8M chunk for backup. A little bit of waste, but well
within reason.
I currently have Virtual Machines and TrueCrypt images excluded from the
regular Time Machine and Acronis True Image backups of peoples laptops.
But Im not comfortable simply neglecting the VMs and TrueCrypt volumes, as
if theyre not important.
I havent found anything satisfactory yet. The closest I found so far was
Crashplan. It does byte pattern differential and continuous real-time
backup, which means it can detect blocks changing in the middle of a file,
and only send the changed blocks of a sparse file during incrementals,
instead of sending the whole 50G again. Unfortunately, crashplan cant
restore a sparse file. Doh!!! :-( Actually, thats a fib. It can
restore sparse files, but they wont be sparse anymore. So IMHO thats
not useful.
Ive also tried rsync. People all over the place say it should do well,
but in practice, I found that doing a single incremental takes 2x longer
than doing the whole image. So again, IMHO, not useful. Unless I am simply
using it wrong. But I put plenty of effort into making sure I was using it
right, so Im really pretty sure I didnt get that wrong.
Anybody doing anything theyre happy with, to backup sparse files on a
regular basis, quickly, efficiently, frequently?
Thanks

rsync can handle sparse files by using the -S (--sparse) option. However,
it can take longer as it's doing a bunch of processing instead of blindly
sending everything over the wire. You are trading off bandwidth use for CPU
use. On a local network, this tradeoff may not be worth it. Personally I
use rsync on the local network since it also gives me that ability to
resume, preserve ownership, etc...

Have you looked into piping your backups through gzip? Any sparse file
would get compressed down to almost nothing, though it also might take a
little more time.

Edward Ned Harvey

2010-02-17 13:56:51 UTC

Permalink

Post by Brian Mathis
rsync can handle sparse files by using the -S (--sparse) option.
However, it can take longer as it's doing a bunch of processing
instead of blindly sending everything over the wire. You are trading
off bandwidth use for CPU use. On a local network, this tradeoff may
not be worth it. Personally I use rsync on the local network since it
also gives me that ability to resume, preserve ownership, etc...

That's the problem. When I used rsync to backup the files, the incremental
took over 2x longer than the initial. Why would anybody ever do an
incremental in that case?

The goal is to do incrementals, and minimize the blocks sent, down to
something in the vicinity of the number of blocks changed. And to do this
in a length of time which is reasonably short, so you're comfortable doing
it frequently.

Post by Brian Mathis
Have you looked into piping your backups through gzip? Any sparse file
would get compressed down to almost nothing, though it also might take
a little more time.

Yes, I've done this, and you're right it does compress all the serial 0's
down to essentially zero size. However, you're still sending the whole
file, not just a subset of changed blocks. So it still takes the time of a
full backup, not just an incremental.

Brian Mathis

2010-02-17 14:19:48 UTC

Permalink

Thats the problem. When I used rsync to backup the files, the incremental
took over 2x longer than the initial. Why would anybody ever do an
incremental in that case?
The goal is to do incrementals, and minimize the blocks sent, down to
something in the vicinity of the number of blocks changed. And to do this
in a length of time which is reasonably short, so youre comfortable doing
it frequently.

rsync *is* minimizing the number of blocks sent, that's why it takes longer
-- it needs to figure out which blocks are the ones that changed. But in
one sentence you're talking about time, and the next you're talking about
minimizing blocks sent (bandwidth use). You need to figure out which one
you want.

The amount of time it takes is not the only indicator you should be looking
at. An rsync copy is going to take significantly less bandwidth than
copying the whole file. If bandwidth is not your primary concern (like on a
LAN), then this trade-off may not be worth it for you, but don't confuse the
"amount of time it takes" with the "amount of bandwidth it uses".

Post by Brian Mathis
Have you looked into piping your backups through gzip? Any sparse file
would get compressed down to almost nothing, though it also might take
a little more time.

Yes, Ive done this, and youre right it does compress all the serial 0s
down to essentially zero size. However, youre still sending the whole
file, not just a subset of changed blocks. So it still takes the time of a
full backup, not just an incremental.

Only other thing I can think of is doing a split or something, but you're
right that's not the best way either. You might also be able to create a
binary diff, but I'm not sure how long that would take.

Edward Ned Harvey

2010-02-19 02:43:18 UTC

Permalink

Post by Brian Mathis
rsync *is* minimizing the number of blocks sent, that's why it takes
longer -- it needs to figure out which blocks are the ones that
changed. But in one sentence you're talking about time, and the next
you're talking about minimizing blocks sent (bandwidth use). You need
to figure out which one you want.

Both.
The ultimate goal is to have the job completed quickly. But that can only
be done if the number of blocks sent is minimized. Presently, rsync reads
the whole local file, and also reads the whole remote file to diff them, and
send only the changed blocks. "Read the entire remote file" is the fault
here. You could write the entire remote file, faster and with less traffic,
than reading it and sending changes.

If rsync, during the initial send, stored checksums of the internal blocks
of a file, then on subsequent sends, rsync would only need to read the local
file and recalculate checksums to see which blocks needed to be sent. This
would occur entirely at local disk speeds, with little or no network
traffic, and certainly no need to read the entire remote file.

This leaves room for improvement - it cannot compare against ZFS incremental
sends, but the point is to say, you're wrong if you think "minimizing the
time," and "minimizing the blocks sent" are mutually exclusive.

Brad Knowles

2010-02-19 04:10:06 UTC

Permalink

Post by Edward Ned Harvey
The ultimate goal is to have the job completed quickly. But that can only
be done if the number of blocks sent is minimized. Presently, rsync reads
the whole local file, and also reads the whole remote file to diff them, and
send only the changed blocks. "Read the entire remote file" is the fault
here. You could write the entire remote file, faster and with less traffic,
than reading it and sending changes.

If you have rsyncd running on the remote machine instead of mounting it as a remote filesystem on the local client, then the rsync local client will communicate with the remote daemon, and they will each calculate their own respective checksums, which can then be compared.

Post by Edward Ned Harvey
If rsync, during the initial send, stored checksums of the internal blocks
of a file, then on subsequent sends, rsync would only need to read the local
file and recalculate checksums to see which blocks needed to be sent. This
would occur entirely at local disk speeds, with little or no network
traffic, and certainly no need to read the entire remote file.

But rsync has no place to store the block checksums, so there would be no way to keep that information across invocations. If you wanted to store the block checksums, you'd have to add a whole database module to handle the storage requirements for that, and then you get into the issue of how do you store and sync this rsync internal metadata?

--
Brad Knowles <***@shub-internet.org>
LinkedIn Profile: <http://tinyurl.com/y8kpxu>

Phil Pennock

2010-02-19 10:37:58 UTC

Permalink

Post by Brad Knowles

If you have rsyncd running on the remote machine instead of mounting
it as a remote filesystem on the local client, then the rsync local
client will communicate with the remote daemon, and they will each
calculate their own respective checksums, which can then be compared.

Actually, I believe this happens anyway.

The normal mode of operation, where you use ssh/rsh to connect to a
remote host, invokes { rsync --server --sender } on the remote side.

I'm not aware of any mode of operation in which rsync pulls the entire
contents of a file from remote in order to minimise what is sent to the
remote; that seems counter-intuitive, but I'm not expert in the workings
of rsync. (I looked into it a little a few years ago, to have restricted
rsync-over-ssh via command="rsync --server [...]" in authorized_keys, to
permit only very restricted rsync access with a dedicated ssh key;
worked nicely.)

-Phil

Edward Ned Harvey

2010-02-19 13:14:42 UTC

Permalink

Post by Phil Pennock

Post by Brad Knowles
If you have rsyncd running on the remote machine instead of mounting
it as a remote filesystem on the local client, then the rsync local
client will communicate with the remote daemon, and they will each
calculate their own respective checksums, which can then be compared.

Actually, I believe this happens anyway.
The normal mode of operation, where you use ssh/rsh to connect to a
remote host, invokes { rsync --server --sender } on the remote side.

This was my understanding as well. That when I use rsync over ssh, it would
create a new single-use rsync server at the remote side, which should
theoretically allow the local rsync to read the local file, and the remote
rsync to read the remote file, so the two can be diff'd at DAS speed without
transmitting the whole file across the network. However ...

When I did this, my initial rsync to send the whole file took 30 minutes.
Then I changed 1M in the middle of the sparse file, and did an incremental.
I expected it to complete in 6-7 minutes. I waited an hour and cancelled
the process because obviously it wasn't working. I tried several variations
of switches, and consulted the rsync discussion list, google, man pages,
etc. So far I haven't had any success at this...

During those failed tests, which take 2x longer or more than doing a full
image, I only checked the time. I did not monitor the network to see if the
file was flying across the wire. So I don't know what it was doing or why
it was taking so long.

It seems to be worth while, to try updating to the latest rsync, and to try
starting rsyncd at the remote side, to see if it will behave better than
what I've seen. Thanks for the suggestions, I'll plan to give that a try.

Brian Mathis

2010-02-19 14:52:01 UTC

Permalink

Post by Edward Ned Harvey

Post by Phil Pennock

Actually, I believe this happens anyway.
The normal mode of operation, where you use ssh/rsh to connect to a
remote host, invokes { rsync --server --sender } on the remote side.

This was my understanding as well. That when I use rsync over ssh, it would
create a new single-use rsync server at the remote side, which should
theoretically allow the local rsync to read the local file, and the remote
rsync to read the remote file, so the two can be diff'd at DAS speed without
transmitting the whole file across the network. However ...
When I did this, my initial rsync to send the whole file took 30 minutes.
Then I changed 1M in the middle of the sparse file, and did an incremental.
I expected it to complete in 6-7 minutes. I waited an hour and cancelled
the process because obviously it wasn't working. I tried several variations
of switches, and consulted the rsync discussion list, google, man pages,
etc. So far I haven't had any success at this...
During those failed tests, which take 2x longer or more than doing a full
image, I only checked the time. I did not monitor the network to see if the
file was flying across the wire. So I don't know what it was doing or why
it was taking so long.
It seems to be worth while, to try updating to the latest rsync, and to try
starting rsyncd at the remote side, to see if it will behave better than
what I've seen. Thanks for the suggestions, I'll plan to give that a try.

Running rsync as a daemon would give you the same results as running it via
ssh, as you are right that the remote side runs a copy of rsync and they
talk to each other.

The other usage scenario Brad was talking about is if you had the remote
filesystem mounted via NFS or something on the local machine. If you were
then to run rsync in that scenario, rsync would be reading the remote file
as if it were a local file over NFS, and that would cause the whole file to
be transferred.

As I've already said, the goal of rsync is to get lower bandwidth use at the
expense of CPU and possibly time. If that is not your goal, then rsync may
not be the right tool for this.

Ted Cabeen

2010-02-17 18:43:51 UTC

Permalink

Is there any current file system or software for the OSs in question
that maintains a list of what blocks in a sparse file were modified and
when? If not, there's no real way to do what you want, as some program
is going to have to walk the entire file to find any changes that have
occurred since the last backup.

--Ted

Yes, I’ve done this, and you’re right it does compress all the serial
0’s down to essentially zero size. However, you’re still sending the
whole file, not just a subset of changed blocks. So it still takes the
time of a full backup, not just an incremental.
------------------------------------------------------------------------
_______________________________________________
Tech mailing list
http://lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Robert Hajime Lanning

2010-02-17 19:04:29 UTC

Permalink

Post by Ted Cabeen
Is there any current file system or software for the OSs in question
that maintains a list of what blocks in a sparse file were modified and
when? If not, there's no real way to do what you want, as some program
is going to have to walk the entire file to find any changes that have
occurred since the last backup.

The only thing I can think of, is putting the files on a separate
filesystem. Then do a image based backup, as opposed to a file based
backup.

I know Acronis can backup the filesystem and do a fairly quick
incremental/differential. That would keep it down to specific changes
and keep the sparse file intact as a sparse file.

--
END OF LINE
--MCP

Edward Ned Harvey

2010-02-19 02:27:14 UTC

Permalink

Post by Robert Hajime Lanning
The only thing I can think of, is putting the files on a separate
filesystem. Then do a image based backup, as opposed to a file based
backup.
I know Acronis can backup the filesystem and do a fairly quick
incremental/differential. That would keep it down to specific changes
and keep the sparse file intact as a sparse file.

Oh - I use Acronis -

I'd like to clarify this. Acronis does intelligent
incremental/differential, down to the granularity of which files have
changed. But it relies on the file timestamps, and it sends the whole file.
This means Acronis has two problems:

TrueCrypt images intentionally leave the timestamps unchanged. So Acronis
never sends the updated TC file during incrementals.
and
Whenever a Full image is sent ... Acronis is not able to handle sparse files
intelligently. The whole file will send, and it will take forever, and it
cannot be restored sparsely, even if you try.

The same is true for Time Machine.

Robert Hajime Lanning

2010-02-19 02:56:07 UTC

Permalink

Post by Edward Ned Harvey

I was not talking about file based backup. Yes, every file based backup
does this. (That I know of.)

Backup the block device. Acronis knows NTFS very well. And will do the
incremental/differential backup at the block level, just fine.

This is why I said "put the files on a separate filesystem."

To restore, you can mount the .tib as a virtual drive and copy back.

--
END OF LINE
--MCP

Edward Ned Harvey

2010-02-19 03:08:47 UTC

Permalink

Post by Robert Hajime Lanning
Backup the block device. Acronis knows NTFS very well. And will do the
incremental/differential backup at the block level, just fine.

How do you do that? In my TrueImage Home 2010, I just have checkmarks next
to the partitions I want to backup. I have selected the whole disk (all
partitions.) There is a checkbox for "backup sector by sector" but I didn't
check this, because it would backup all the unused space in the disk, as
well as the files. Is that what you're talking about?

Post by Robert Hajime Lanning
This is why I said "put the files on a separate filesystem."

I think you're saying ... Create a new partition. Put the sparse files into
that partition. Backup that partition using sector-by-sector. I must
admit, I have not tried this.

Post by Robert Hajime Lanning
To restore, you can mount the .tib as a virtual drive and copy back.

Unfortunately, when you mount the .tib image, the only tool available for
you to copy a file out of there is Windows Explorer ... And unfortunately,
WE doesn't know how to copy a sparse file. Yes you can restore a file from
the mounted TIB file, but it will not be sparse anymore after it comes out.
Ask me how I know. :-(

I did submit that as a request with Acronis support, but they are useless.
Person after person after person couldn't understand what I was talking
about, didn't get the word "sparse."

Robert Hajime Lanning

2010-02-19 03:22:58 UTC

Permalink

Post by Edward Ned Harvey

Post by Robert Hajime Lanning
Backup the block device. Acronis knows NTFS very well. And will do the
incremental/differential backup at the block level, just fine.

Yes that is the option. Iy knows which blocks are unused and deals with
them.

Post by Edward Ned Harvey

Post by Robert Hajime Lanning
This is why I said "put the files on a separate filesystem."

I think you're saying ... Create a new partition. Put the sparse files into
that partition. Backup that partition using sector-by-sector. I must
admit, I have not tried this.

Post by Robert Hajime Lanning
To restore, you can mount the .tib as a virtual drive and copy back.

Unfortunately, when you mount the .tib image, the only tool available for
you to copy a file out of there is Windows Explorer ... And unfortunately,
WE doesn't know how to copy a sparse file. Yes you can restore a file from
the mounted TIB file, but it will not be sparse anymore after it comes out.
Ask me how I know. :-(
I did submit that as a request with Acronis support, but they are useless.
Person after person after person couldn't understand what I was talking
about, didn't get the word "sparse."

hrm... I guess baring finding a copy utility that understands sparse files,
you would be left with the restore partition option.

Have you tried robocopy to copy from the mounted .tib? (I haven't tried it.)

--
END OF LINE
--MCP

Edward Ned Harvey

2010-02-19 03:44:32 UTC

Permalink

Post by Edward Ned Harvey

Post by Edward Ned Harvey
partitions.) There is a checkbox for "backup sector by sector" but I

didn't

Post by Edward Ned Harvey
check this, because it would backup all the unused space in the disk,

Post by Edward Ned Harvey
well as the files. Is that what you're talking about?

Yes that is the option. It knows which blocks are unused and deals with
them.

Ummm... Maybe I'm misunderstanding what you're saying, because the way I
got it, you're not making any sense. You're saying to do sector-by-sector,
but you're also saying Acronis knows which blocks are unused and will skip
them.

The point of the "sector by sector" option is to tell Acronis, "I don't want
you to think about or care about NTFS or anything. Eliminate all your
intelligence, and simply copy every byte from the device." This means "I
want you to backup unused space," and it means "Even if there's an unknown
filesystem in there, which is not NTFS or anything you recognize, back it up
anyway, every single byte."

Normally a sector-by-sector backup is only done for unknown filesystems, or
filesystems which are suspected of corruption, or if you have some reason
you think there's valuable information stored in the unused space. For
example, if a virus did a quick format on your hard drive ... All the data
still exists, but the filesystem is gone so you can't access any of it. So
then you would want some utility to scan all the bits, saying "these blocks
look like they might be a jpg image ... and these blocks look like they
might be a word doc ..." and so on, attempting to reconstruct your deleted
files. If you're paranoid, you might do a sector-by-sector backup of the
disk before you allow any utility in the world to start reading from it or
working on it.

Post by Edward Ned Harvey
hrm... I guess baring finding a copy utility that understands sparse files,
you would be left with the restore partition option.
Have you tried robocopy to copy from the mounted .tib? (I haven't tried it.)

There are plenty of copy utilities that recognize sparse files. You can "cp
--sparse=always" or something like that ... and you can "tar cf - somefile |
(cd /destination ; tar xf - --sparse)" and various other incantations ...

Yes, I tried this. Again, "ask me how I know." :-( At one point, I did in
fact restore a 50G file that was supposed to be sparse, just to get one tiny
txt file out of it. It only took overnight.

However, when you mount a TIB file, it does not become a drive letter. You
can't browse there with cygwin, or winzip, or any other application. The
TIB mounter is a Windows Explorer extension, so only WE is able to browse
through the image to grab things out. WE is the weak point here ... Simply
dragging and dropping a file is your only option, and that doesn't do
sparse.

Robert Hajime Lanning

2010-02-19 05:01:24 UTC

Permalink

Post by Edward Ned Harvey
Ummm... Maybe I'm misunderstanding what you're saying, because the way I
got it, you're not making any sense. You're saying to do sector-by-sector,
but you're also saying Acronis knows which blocks are unused and will skip
them.
The point of the "sector by sector" option is to tell Acronis, "I don't want
you to think about or care about NTFS or anything. Eliminate all your
intelligence, and simply copy every byte from the device." This means "I
want you to backup unused space," and it means "Even if there's an unknown
filesystem in there, which is not NTFS or anything you recognize, back it up
anyway, every single byte."
Normally a sector-by-sector backup is only done for unknown filesystems, or
filesystems which are suspected of corruption, or if you have some reason
you think there's valuable information stored in the unused space. For
example, if a virus did a quick format on your hard drive ... All the data
still exists, but the filesystem is gone so you can't access any of it. So
then you would want some utility to scan all the bits, saying "these blocks
look like they might be a jpg image ... and these blocks look like they
might be a word doc ..." and so on, attempting to reconstruct your deleted
files. If you're paranoid, you might do a sector-by-sector backup of the
disk before you allow any utility in the world to start reading from it or
working on it.

Sorry, you are right... I was getting a couple of options mixed up.
http://www.acronis.com/backup-recovery/comparison.html

I am really talking about the "Block level image backup", not the
"Sector-by-sector backup".

I used to be a developer at a small company (that went under), that made
an appliance that was used for the target for the backup files. We had
automation to encrypt and transfer the .tib files to a secured
datacenter as a DR service for the SMB market.

We had a client side piece that was a wrapper around Acronis.

Since this was designed for DR with bare metal restore, we used block
level backups.

Post by Edward Ned Harvey

Post by Robert Hajime Lanning
hrm... I guess baring finding a copy utility that understands sparse files,
you would be left with the restore partition option.
Have you tried robocopy to copy from the mounted .tib? (I haven't tried it.)

There are plenty of copy utilities that recognize sparse files. You can "cp
--sparse=always" or something like that ... and you can "tar cf - somefile |
(cd /destination ; tar xf - --sparse)" and various other incantations ...
Yes, I tried this. Again, "ask me how I know." :-( At one point, I did in
fact restore a 50G file that was supposed to be sparse, just to get one tiny
txt file out of it. It only took overnight.
However, when you mount a TIB file, it does not become a drive letter. You
can't browse there with cygwin, or winzip, or any other application. The
TIB mounter is a Windows Explorer extension, so only WE is able to browse
through the image to grab things out. WE is the weak point here ... Simply
dragging and dropping a file is your only option, and that doesn't do
sparse.

Ya, I haven't messed with this side much. Though now there are
utilities that will convert .tib into virtual drives. (.vhd, .vmdk,...)

--
END OF LINE
--MCP

Edward Ned Harvey

2010-02-19 13:17:29 UTC

Permalink

Post by Robert Hajime Lanning
Sorry, you are right... I was getting a couple of options mixed up.
http://www.acronis.com/backup-recovery/comparison.html

Also, it seems, we're talking about different products. Backup & Recovery
... versus TrueImage. Not the same thing. I'll look into it ...

Doug Hughes

2010-02-17 23:52:01 UTC

Permalink

There's zfs, but it may not suit your other needs, specifically. It will
do incremental snapshot sends from primary to secondary very easily and
efficiently without any find. It knows what has changed.

Edward Ned Harvey

2010-02-19 02:28:37 UTC

Permalink

Post by Doug Hughes
There's zfs, but it may not suit your other needs, specifically. It will
do incremental snapshot sends from primary to secondary very easily and
efficiently without any find. It knows what has changed.

I love ZFS, and use it regularly. But alas, not in Windows or Mac. Yes, I
know, some people will say you can use it in the mac, but I beg to differ.

Edward Ned Harvey

2010-02-19 02:22:39 UTC

Permalink

There are two answers to this question.

#1 Yes. I forget what the underlying function call or API or whatever is
called, but there is *some* method available to monitor the filesystem
activity, and notice which blocks change in some file or files. I presume
this is what crashplan is using, because they claim they're able to notice
in real-time when blocks are changing, and then back up using byte
differential. Again, crashplan seems to do a good job of creating the
incremental backups of sparse files, but they have no option to restore them
sparsely. I am conversing with their support team, hoping they'll somehow
rectify this, but who knows.

#2 Even with something less intelligent, an acceptable or incremental
improvement could be made over the backup solutions that I'm currently aware
of. Today, the only backup option I know of is to do a full image, every
time. For example, via tar or rsync, they can both efficiently create full
images of sparse files, and then restore sparsely. But they have no way to
do incrementals on subsequent runs.

Suppose there's a tool, which works like this:

. On the first run, the whole file is sent. Meanwhile, a checksum
is calculated for lots of little chunks, and stored somewhere.

. On a subsequent run, the whole file must be read locally and the
chunks all get checksummed again, but all the unchanged chunks don't need to
be sent.

The time required to read and checksum the file is much faster than sending
the whole file to the destination every time. Although this leaves obvious
room for improvement, it is a huge improvement over what I'm currently able
to find.

I benchmarked this, because I was curious. On my mac, I have a 40G virtual
machine, which is 18G used. It took about 30 minutes to backup the whole
image across the LAN. It took about 6 minutes to md5sum it. If I were able
to create an incremental in 6-7 minutes, I would do it regularly. Once
every couple of days. But when it takes a half an hour ... I'll only do it
once every 2-4 weeks, at most.

Actually, this makes perfect sense. SATA disk reads 500Mbit/s. This is 5x
higher than the 100Mbit LAN. So the performance ends up being 5x higher,
and my file reads in 6min instead of 30.