Questions about ZFS

Discussion:

Questions about ZFS

J Metro

2012-07-27 21:03:20 UTC

Hi all, Just some questions about ZFS in general.

1. Is there any way to provide interoperability between a Windows 2008
server and a ZFS pool [on a linux server?] to store the files? [email, file
hosting, active directory, anything a windows server would provide]

2. I've read a lot of posts that say ZFS is not horizontally scalable, is
this still true? It seems like it is not true, since you can still add
disks to a ZFS pool [instead of just replacing a disk with a larger volume
disk] and it will function properly...

3. Is there any way to prove that my ZFS pool is actually running ZFS?
Like, without a doubt prove?

4. It seems like ZFS auto-stripes, at least, according to zfsbuild.com. Is
this true? If i take 8 10gb disks, and set up 2 raidz's with 4 disks each,
is that really an 8 disk raidz? is it smart enough to put data on drive 1
then 2 then 3 then 4 then 5 then 6 then 7 then 8, and not 1 2 3 4 1 2 3
4.... until full, then 5 6 7 8?

5. In question 4, i have 2 raidz's with 4 disks each. Is that a Raid 50?
How do I setup a Raid 51? What is ZFS smart enough to do on its own? I just
need examples like "4 groups of disks, each group has 2 disks setup as
mirror" or "2 groups of disks, each group has 4 disks as a raidz" not full
page explanations.

I had more questions, but I thought about them in the shower before work
today, and of course, forgot them by now. Maybe tomorrow's shower will
bring them back. I dreamt about ZFS last night.

Bryn Hughes

2012-07-27 21:31:53 UTC

Permalink

You can export files hosted in a ZFS filesystem using Samba, just like
any other filesystem. All the 'normal' Linux stuff applies. Samba has
some abilities regarding Active Directory and other such things but I'd
recommend reading the Samba documentation in detail to see what does and
doesn't work these days.

If you mean "Can I provision space that a Windows server will use for
its own storage" for those types of applications, then again yes - you
can create zvols and export them with iSCSI, which can then in turn be
mounted by the Windows server as if it were a direct attached disk.
However when you do that you won't have nice happy friendly snapshots
and things, nor do you get file-level checksums or any of the other ZFS
goodness. On the Linux side it would essentially appear as a single
file, with Windows itself handling all the filesystem stuff using NTFS
inside of that.

Post by J Metro
2. I've read a lot of posts that say ZFS is not horizontally scalable,
is this still true? It seems like it is not true, since you can still
add disks to a ZFS pool [instead of just replacing a disk with a
larger volume disk] and it will function properly...

If you mean "add a single disk to an existing raidz/raidz2/raidz3 pool"
then no, you can't do that. If you mean "add another new entire
raidz/raidz2/raidz3 vdev of disks to an existing ZFS pool" then yes you
can. I do not believe there is any mechanism under ZFS to redistribute
existing data across the additional disks though (NetApp can do that for
example with a background job). I believe new files get distributed
across all while existing data stays as it is. So if you are adding
spindles for performance you aren't going to get the results you are
looking for, while if you just want straight capacity it is no problem.

Post by J Metro
3. Is there any way to prove that my ZFS pool is actually running ZFS?
Like, without a doubt prove?

If you don't load the ZFS kernel modules, you won't have any ZFS
filesystems. That proves it pretty well... :) It is pretty easy to
tell what mountpoint belongs to what back end device be it ZFS, LVM or
any other Linux native stuff.

Post by J Metro
4. It seems like ZFS auto-stripes, at least, according to
zfsbuild.com. Is this true? If i take 8 10gb disks, and set up 2
raidz's with 4 disks each, is that really an 8 disk raidz? is it smart
enough to put data on drive 1 then 2 then 3 then 4 then 5 then 6 then
7 then 8, and not 1 2 3 4 1 2 3 4.... until full, then 5 6 7 8?

I'm 97% sure that it does in fact stripe across all vdevs. So if you
have a pool with 2x4 raidz vdevs, it will stripe across both vdevs,
which will in turn have 'raid5-like' properties internally, yielding
something like RAID50. I say "something like" since other RAID systems
don't have any knowledge of the data itself, whereas raidz most
definitely knows where every individual file is and does checksums and
parity at the file level, rather than just the block level.

Post by J Metro
5. In question 4, i have 2 raidz's with 4 disks each. Is that a Raid
50? How do I setup a Raid 51? What is ZFS smart enough to do on its
own? I just need examples like "4 groups of disks, each group has 2
disks setup as mirror" or "2 groups of disks, each group has 4 disks
as a raidz" not full page explanations.

You can do a 'raid50' like configuration by just adding two 4-disk raidz
groups in your example there. If you wanted 'raid51' like then you
would use the 'mirror' command when creating your pool to mirror across
two raidz vdevs.

Post by J Metro
I had more questions, but I thought about them in the shower before
work today, and of course, forgot them by now. Maybe tomorrow's shower
will bring them back. I dreamt about ZFS last night.

In those examples above, if I had 8 disks that I wanted to use I'd most
likely build a single raidz2 vdev rather than messing with raid50 or
raid51. You will have better data security than either raid50 or raid51
and better performance than raid51. With an 8-device raidz2 pool you can
loose any two disks without compromising your data. RAID50 can loose 1
disk each in the RAID5 groups, but two disks in one RAID5 group would
result in total data loss. RAID51 would allow you to loose any two
drives without loosing data, but at the cost of much reduced performance
since you would only have 4 spindles worth of IOPs. Capacity for an
8-disk raidz2 vdev would be identical to a raid50 volume made from the
same disks.

While auto hot-spares aren't working in ZFS on Linux currently, ZFS
rebuilds do have the potential to be quite a lot faster than traditional
RAID5 / RAID50 / RAID51 which is something else to consider. Since ZFS
uses file-level RAID, it only needs to resilver the actual data during a
rebuild. RAID5/50/51 need to rebuild the free space as well since they
only know about blocks on disk, and have no knowledge of what is
actually in those blocks or whether the blocks contain valid data, free
space, deleted files, etc. So an 8TB volume with 2TB of data on
traditional RAID will take much longer to rebuild than an 8TB ZFS pool,
since ZFS only needs to worry about the 2TB of actual data.

Since ZFS is actually aware of the data on disk, and is creating
checksums and parity data for each individual file data security is
already better than it would be with any of the other 'traditional' RAID
arrays. You can have flipped bits or other errors in traditional RAID
that won't ever be caught since the filesystem doesn't know it is
serving up bad data, and the RAID controller or software MD layer
doesn't know since it is just returning whatever is written. Read about
the "raid5 write hole" to get what I'm talking about.

e-t172

2012-07-27 21:59:38 UTC

Permalink

Post by Bryn Hughes
If you mean "add a single disk to an existing raidz/raidz2/raidz3 pool"
then no, you can't do that. If you mean "add another new entire
raidz/raidz2/raidz3 vdev of disks to an existing ZFS pool" then yes you
can. I do not believe there is any mechanism under ZFS to redistribute
existing data across the additional disks though (NetApp can do that for
example with a background job). I believe new files get distributed
across all while existing data stays as it is.

Not quite. You're correct in that ZFS won't rebalance existing data
(this would require the famous "block pointer rewrite" vaporware
feature). However, if you add a new, empty top-level vdev, ZFS will use
it in priority for new writes because the allocator always tries to
equalize used space in all vdevs. See:
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/metaslab.c#L1439

--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82

e-t172

2012-07-27 22:06:03 UTC

Permalink

Post by Bryn Hughes
If you mean "Can I provision space that a Windows server will use for
its own storage" for those types of applications, then again yes - you
can create zvols and export them with iSCSI, which can then in turn be
mounted by the Windows server as if it were a direct attached disk.
However when you do that you won't have nice happy friendly snapshots
and things, nor do you get file-level checksums or any of the other ZFS
goodness. On the Linux side it would essentially appear as a single
file, with Windows itself handling all the filesystem stuff using NTFS
inside of that.

To clarify: you can snapshot zvols, and the zvol blocks are in fact
checksummed by ZFS. It's just that ZFS isn't aware of the internal
structure of the zvol (i.e. NTFS), so you can't use these features with
file-level granularity.

--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82

J Metro

2012-07-28 00:46:59 UTC

Permalink

OK, I think I got some things clarified, but it brought up more questions
regarding your answers (!)

1. Has anyone here ever used a Windows 2008 server in a production
environment [im thinking voice, email, data, AD, web servers] and ported
over to ZFS? Or put any of those functions in such a manner that they were
being supported by ZFS? If so, how did you get it to work? or maybe it
would just be easier to backup the data to ZFS and keep the data itself on
the server and just perform a nightly backup?
I'm thinking that might be the easiest solution instead of trying to get a
bunch of Windows clients already on Windows servers to hook up to a Linux
server, or to try and do that Zvol -> Iscsi thing...or use Samba and deal
with issues every day [most likely]

2. is solved. ZFS is horizontally expandable as far as im concerned.

3. What i mean is, if you type "mount" and it shows "zfs" that could just
be something specifying it as ZFS... theres no visible proof that youre
using ZFS. I'm likening it to a paranoid person thinking that all the stuff
saying "zfs" is just smoke and mirrors disguising normal operation.
Maybe if you put a 1gb file on the pool and could see parts of the file on
each individual drive? Im not sure how you could even prove it.

4. is answered. ZFS auto-stripes. yay.

5. This one is still unclear. It seems like since ZFS auto-stripes, you can
definitely setup a raid 10 by making say, 5 groups of disks, each being a
'mirror' with two disks in it. But you couldnt do a raid 51 [z1?] because
you cant designate one of the mirror groups to be a parity group.

Post by e-t172

2012-07-28 01:26:07 UTC

Permalink

3. What i mean is, if you type "mount" and it shows "zfs" that could just be
something specifying it as ZFS... theres no visible proof that youre using
ZFS. I'm likening it to a paranoid person thinking that all the stuff saying
"zfs" is just smoke and mirrors disguising normal operation.
Maybe if you put a 1gb file on the pool and could see parts of the file on
each individual drive? Im not sure how you could even prove it.

I'm unsure what you're trying to prove. If you set ZFS volumes'
"mountpoint" attribute, they mount at points other than the pool's
root (traditonally under the root filesystem with the same name as the
pool) and you see "zfs" as the filesystem type:

[***@test ~] sudo zfs get mountpoint z/home
NAME PROPERTY VALUE SOURCE
z/home mountpoint /home local
[***@test ~] mount | grep /home
z/home on /home type zfs (rw,xattr)

Otherwise, you won't see them in /etc/mtab (the source of "mount"
output) but you do in /proc/mounts (the actual source):

[***@test ~] grep ^z /proc/mounts
z/home /home zfs rw,relatime,xattr 0 0
z /z zfs rw,relatime,xattr 0 0

To go deeper than that you're going to have to get in and forensically
prove it to yourself, which is going to be hard since there aren't any
forensic tools for ZFS. Look at the output of 'zpool status' to get a
good view of the actual structure of your pools. Hell, if it's smoke
and mirrors, it's a great fake and one that I'm happy to use.

J Metro

2012-07-28 03:21:29 UTC

Permalink

Ah alright, I wasnt sure what was going on there. I just wanted a way to
prove that zfs was running and the commands you gave should do that
sufficiently.

Post by J Metro

Post by J Metro
3. What i mean is, if you type "mount" and it shows "zfs" that could

just be

Post by J Metro
something specifying it as ZFS... theres no visible proof that youre

using

Post by J Metro
ZFS. I'm likening it to a paranoid person thinking that all the stuff

saying

Post by J Metro
"zfs" is just smoke and mirrors disguising normal operation.
Maybe if you put a 1gb file on the pool and could see parts of the file

Post by J Metro
each individual drive? Im not sure how you could even prove it.

I'm unsure what you're trying to prove. If you set ZFS volumes'
"mountpoint" attribute, they mount at points other than the pool's
root (traditonally under the root filesystem with the same name as the
NAME PROPERTY VALUE SOURCE
z/home mountpoint /home local
z/home on /home type zfs (rw,xattr)
Otherwise, you won't see them in /etc/mtab (the source of "mount"
z/home /home zfs rw,relatime,xattr 0 0
z /z zfs rw,relatime,xattr 0 0
To go deeper than that you're going to have to get in and forensically
prove it to yourself, which is going to be hard since there aren't any
forensic tools for ZFS. Look at the output of 'zpool status' to get a
good view of the actual structure of your pools. Hell, if it's smoke
and mirrors, it's a great fake and one that I'm happy to use.

Stephane Chazelas

2012-07-28 05:49:09 UTC

Permalink

Post by e-t172

[...]

Speaking of which.

Does anybody know of any preferably opensource way to freeze a
filesystem on Windows prior to taking the zfs snapshot (short of
shutting down or suspending the Windows machine).

Some commercial backup solutions for VMs seem to be able to do
that and make use of Windows VSS system, but I didn't find any
simple tool like Linux' fsfreeze to do that.

What do you guys use to snapshot you iSCSI zvols?

--
Stephane

e-t172

2012-07-28 08:38:27 UTC

Permalink

Post by Stephane Chazelas
Does anybody know of any preferably opensource way to freeze a
filesystem on Windows prior to taking the zfs snapshot (short of
shutting down or suspending the Windows machine).
Some commercial backup solutions for VMs seem to be able to do
that and make use of Windows VSS system, but I didn't find any
simple tool like Linux' fsfreeze to do that.
What do you guys use to snapshot you iSCSI zvols?

You don't need to use anything. Just snapshot it. ZFS is smart enough to
make the snapshot consistent in regard to sync() semantics, and will
take care of the freezing itself. If the OS and applications you're
running on top of the zvol know how to use sync() and cache flushes
(which is true for Windows, NTFS, and most well-written applications
like databases), then they will recover cleanly from a snapshot taken at
any point of time, even during heavy writes to the zvol.

See: https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zil.c#L1826

--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82

Jason Pfingstmann

2012-07-28 16:44:56 UTC

Permalink

e-t172,

One thing I'm confused about, mostly because the link you sent was to zil.c
- does ZFS require a ZIL in order to do effective snapshotting of a ZVOL?
I've been under the impression that it was not needed and only useful in
case of a power disruption to save data in-flight, but now I'm wondering if
I missed something.

Post by e-t172

You don't need to use anything. Just snapshot it. ZFS is smart enough to
make the snapshot consistent in regard to sync() semantics, and will take
care of the freezing itself. If the OS and applications you're running on
top of the zvol know how to use sync() and cache flushes (which is true for
Windows, NTFS, and most well-written applications like databases), then
they will recover cleanly from a snapshot taken at any point of time, even
during heavy writes to the zvol.
See: https://github.com/zfsonlinux/**zfs/blob/master/module/zfs/**
zil.c#L1826<https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zil.c#L1826>
--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82

Stephane Chazelas

2012-07-28 17:40:03 UTC

Permalink

Post by Jason Pfingstmann
e-t172,
One thing I'm confused about, mostly because the link you sent was to zil.c
- does ZFS require a ZIL in order to do effective snapshotting of a ZVOL?
I've been under the impression that it was not needed and only useful in
case of a power disruption to save data in-flight, but now I'm wondering if
I missed something.

[...]

My understanding is that there's always a zil. That zil can be
put on its own device, but it's not necessary, just better from
a performance standpoint.

--
Stephane

e-t172

2012-07-28 19:25:22 UTC

Permalink

Post by Jason Pfingstmann
One thing I'm confused about, mostly because the link you sent was to
zil.c - does ZFS require a ZIL in order to do effective snapshotting of
a ZVOL? I've been under the impression that it was not needed and only
useful in case of a power disruption to save data in-flight, but now I'm
wondering if I missed something.

To clarify: there is one ZIL per dataset. Every write operation on a
dataset goes to the ZIL for this dataset (unless it has
"sync=disabled"). Note that I'm talking in-memory ZIL here. I'm not
writing anything to disk for the moment. Let me repeat that: the ZIL is
first and foremost an in-memory structure, which *sometimes* (not
always) get written to disk. Alas, some people say "ZIL" when they
really mean "on-disk ZIL", which tends to cause some confusion.

At any point in time, the user of the dataset can request a ZIL commit
(this is usually triggered by fsync() for filesystems or SYNCHRONIZE
CACHE requests for zvols). This will write the entire in-memory ZIL to
disk, flush the disk caches to make sure it's really written, erase the
in-memory ZIL, and start a new one. The on-disk ZIL is never read unless
ZFS is recovering from a crash, in which case the contents of the
on-disk ZIL are replayed. When ZFS completes a TXG sync (which happens
roughly every five seconds), the ZIL (both in-memory and on-disk) is
discarded because its contents are now redundant with the contents of
the pool itself. This basically means that the lifetime of a ZIL is no
more than five seconds or so. It's just a transient construct designed
to speed up synchronous writes. If the ZIL didn't exist, ZFS would have
to trigger a TXG sync instead of a ZIL commit each time a user requests
it, which would result in very poor performance.

When a snapshot is being taken, several things happen. First, ZFS syncs
the TXG (thus discarding both the in-memory and on-disk ZIL). Second,
the snapshot is written. The reason why ZFS does that is to avoid having
to keep the on-disk ZIL for the snapshot, because ZIL blocks are not
supposed to linger around for longs periods of time (it's not space
efficient).

This procedure is implemented in the piece of code I linked to. I used
it to prove my point about ZFS making sure snapshots are consistent in
regard to fsync()/synchronize cache semantics.

--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82

Stephane Chazelas

2012-07-28 17:33:35 UTC

Permalink

Post by e-t172

[...]

Hi Etienne,

I've got no doubt that zfs freezes the zvol consistently (true
snapshot), my question was not really a zfs question, but more
like a MS Windows asked to zfs users.

When you do a snapshot of the disk under a live system, you've
got no warranty that the data on disk is consistent, as there's
data not flushed to disk by the applications and by the FS on
top of zvol.

When I've got a Linux VM with a disk on a host LVM logical
volume, prior to taking a snapshot on the host, if there's mysql
running on the guest, I do a "FLUSH TABLES WITH READ LOCK"
(which tells mysql to flush everything to disk and block its
clients from committing anything new), and then a fsfreeze which
tells Linux VFS to flush everything to disk and block every
application from writing anything new to the FS. Then, I can
take the snapshot and after thaw the FS and release the mysql
lock. This way, I know the FS on the LVM snapshot is consistent
and the database as well.

MS Windows has a nice feature called VSS to do that same thing
which was introduced to be able to do consistent backups as
well. It's even got a plugin mechanism where every application
(database, DNS server...) can specify how they flush their own
data.

In Windows, it's linked to volume shadow copy (some sort of
copy-on-write snapshot mechanism withing NTFS), but I beleive it
can be used in cases like zfs where something under the OS can
snapshot the drive. That's what is done by NetApp and many
commercial enterprise Virtualisation solutions.

I was wondering what zfs users did?

Of course, I can just do the snapshot without bothering freezing
the FS (after all, it's like taking a snapshot of a machine's
hard disk just after a power cut or a system crash and MS Windows
has to be robust to those), but I'm looking for a cleaner
solution if only to avoid having the snapshot FS marked unclean.

--
Stephane

e-t172

2012-07-28 18:35:52 UTC

Permalink

Post by Stephane Chazelas
When you do a snapshot of the disk under a live system, you've
got no warranty that the data on disk is consistent, as there's
data not flushed to disk by the applications and by the FS on
top of zvol.

Actually, yes you do, if the applications are written correctly. That's
why modern systems are able to reboot cleanly after a sudden power
failure, for example.

Well-written applications (e.g. Windows, databases) are well-aware that
the data they are sending to disk is not guaranteed to be persistent or
even consistent until they issue a flush (e.g. using the fsync() system
call under Linux).

Post by Stephane Chazelas
When I've got a Linux VM with a disk on a host LVM logical
volume, prior to taking a snapshot on the host, if there's mysql
running on the guest, I do a "FLUSH TABLES WITH READ LOCK"
(which tells mysql to flush everything to disk and block its
clients from committing anything new), and then a fsfreeze which
tells Linux VFS to flush everything to disk and block every
application from writing anything new to the FS. Then, I can
take the snapshot and after thaw the FS and release the mysql
lock. This way, I know the FS on the LVM snapshot is consistent
and the database as well.

You shouldn't need to do that. In the specific case of MySQL, it will
itself issue cache flushes in order to ensure durability (the "D" in
ACID) if it needs to, and it will not tell MySQL clients the data is
securely written until the cache flush succeeds.

For example, if you're using transactions with your MySQL database, then
the MySQL server will issue a cache flush when an application uses the
COMMIT command. It will not return "OK" to the commit command until the
cache flush returns successfully.

In the end there are two possible endgames:
- Either the crash/reboot/power failure/snapshot happened *before* the
commit succeeded, in which case the transaction data is lost, but that's
okay because well-written applications know that a commit can fail ;
- Or the crash/reboot/power failure/snapshot happened *after* the commit
succeeded, in which case we are sure the data is still there because the
cache was flushed before the aforementioned event happened.

So, in both cases, everything is still consistent. If your application
is able to survive a power failure, it is likewise guaranteed to play
well with on-the-fly snapshots, because that's basically the same thing
from the point of view of the guest.

This is nothing new. Cache flushes are being used by filesystems,
databases, applications, etc. to ensure consistency since long ago. In
most cases this is implemented using a log: for example, when the MySQL
server receives a commit, it will append the transaction to a log file,
then fsync() the log file (which, at the filesystem level, will trigger
a metadata write followed by a cache flush). It considers the
transaction committed as soon as the fsync() completes, not before. If
the server is recovering from a crash/power failure/snapshot, it will
replay the contents transaction log file before accepting new
transactions from clients. It's fully designed to handle this kind of
situations.

If recovering from a snapshot breaks your data, the problem is with your
applications, not with the snapshot.

Post by Stephane Chazelas
MS Windows has a nice feature called VSS to do that same thing
which was introduced to be able to do consistent backups as
well. It's even got a plugin mechanism where every application
(database, DNS server...) can specify how they flush their own
data.

Again, with well-written applications, this shouldn't be necessary. It
could, however, be more efficient, because it could allow applications
to empty their log files before the backup, which would result in
(slightly) smaller and (slightly) faster backups, and avoids the log
replay when recovering. That's nitpicking for most people.

--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82

Stephane Chazelas

2012-07-28 19:10:16 UTC

Permalink

2012-07-28 20:35:52 +0200, e-t172:
[...]

Post by e-t172

Again, with well-written applications, this shouldn't be necessary.

[...]

I see your point. There are still a few things that bother me:

- All commercial backup solutions do it (quiesce the FS before
snapshotting), there has to be reason.
- If you've got two applications that work together, but have
different in-memory cache. At any given point in time, what's on
disk may be in a consistent state for each one of them taken
separately, but combined together, it won't necessarily be true
- If you snapshot a live ext4, you won't be able to mount the
snapshot unless you make the snapshot writable (with the device
mapper or by cloning it) as it wants to replay the log and write
to disk and complains if it can't.
- NTFS FS will be marked unclean, which means a fsck when you
restore the backup.

A side (Windows again) question: what would be the best approach
to avoid backing up the swap and hibernation files?

For the moment, I'm using zfs on a backup server. To backup
Windows machines, I use vshadow to make a snapshot of the NTFS
FS, then export the block device by a cygwin nbd-server in copy
on write mode, then have the backup server connect that nbd
device, remove the swap and hiber files and use a patched
ntfsclone to transfer the modified clusters (and also do some
BLKDISCARDs for unallocated data) since the last backup. Not
very bandwidth nor disk I/O efficient, but space efficient on
the backup server and that means I've got full backups that can
be used to boot a VM out of them.

Now, I'm trying to do something similar but doing live snapshots
of zvols that are actually Windows VM disks. If I just do
snapshots, I backup the swap and hiber files and also deleted
data (unless I can figure out to have Windows cause kvm to do
BLKDISCARDs?). I could do the same as above (ntfsclone to
another zvol) but I was wondering if there was any clever way to
avoid it. Probably not.

--
Stephane

e-t172

2012-07-28 19:38:15 UTC

Permalink

Post by Stephane Chazelas
- All commercial backup solutions do it (quiesce the FS before
snapshotting), there has to be reason.

Well, I outlined potential reasons at the end of my post. I think it has
to do with performance, space efficiency and "cleanliness", not consistency.

Post by Stephane Chazelas
- If you've got two applications that work together, but have
different in-memory cache. At any given point in time, what's on
disk may be in a consistent state for each one of them taken
separately, but combined together, it won't necessarily be true

Such applications should synchronize with each other. If they don't, it
means you can't guarantee consistency in case of a crash, which is a
much bigger problem than snapshots. Again, fix the applications.

Post by Stephane Chazelas
- If you snapshot a live ext4, you won't be able to mount the
snapshot unless you make the snapshot writable (with the device
mapper or by cloning it) as it wants to replay the log and write
to disk and complains if it can't.

That's ext4's problem. Besides, I'm not sure why having to clone the
dataset would be an issue, as it doesn't cost you anything.

Post by Stephane Chazelas
- NTFS FS will be marked unclean, which means a fsck when you
restore the backup.
Again, with well-written applications, this shouldn't be necessary. It
could, however, be more efficient, because it could allow applications
to empty their log files before the backup, which would result in
(slightly) smaller and (slightly) faster backups, and avoids the log
replay when recovering.

In this case, a fsck is a log replay at the filesystem level.

Post by Stephane Chazelas
A side (Windows again) question: what would be the best approach
to avoid backing up the swap and hibernation files?

Store them on a different volume?

Post by Stephane Chazelas
(unless I can figure out to have Windows cause kvm to do
BLKDISCARDs?)

qemu doesn't support it, unless you use the IDE driver, the
"discard_granularity" ide-hd device option (set it to 512), and this
patch: http://lists.gnu.org/archive/html/qemu-devel/2011-11/msg01659.html

--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82

Ryan How

2012-07-29 11:26:49 UTC

Permalink

Wow that is impressive :).

I've been looking for a good backup solution, that would be the best
I've seen in terms of space effeciency and consistency (not performance).

The only other thing I've done is pause and save VM state, snapshot,
then resume the VM. But that is dangerous coz you need to restore it on
the same hardware to resume. Although even that few seconds to pause and
save the VM state is too long for some servers and causes timeouts for
clients :(

I think your vss solution and exporting is the best you are going to get
unless you move to something more commercial and windows orientated.

And saying fix the application doesn't help the millions of Windows
Servers for which VSS is the standard for live backup :).

Ryan

Post by Stephane Chazelas
For the moment, I'm using zfs on a backup server. To backup
Windows machines, I use vshadow to make a snapshot of the NTFS
FS, then export the block device by a cygwin nbd-server in copy
on write mode, then have the backup server connect that nbd
device, remove the swap and hiber files and use a patched
ntfsclone to transfer the modified clusters (and also do some
BLKDISCARDs for unallocated data) since the last backup. Not
very bandwidth nor disk I/O efficient, but space efficient on
the backup server and that means I've got full backups that can
be used to boot a VM out of them.

Fajar A. Nugraha

2012-07-29 23:03:06 UTC

Permalink

Received: by 10.224.186.143 with SMTP id cs15mr6893772qab.3.1343602988766;
Sun, 29 Jul 2012 16:03:08 -0700 (PDT)
X-BeenThere: zfs-discuss-VKpPRiiRko4/***@public.gmane.org
Received: by 10.224.210.198 with SMTP id gl6ls7755136qab.4.gmail; Sun, 29 Jul
2012 16:03:06 -0700 (PDT)
Received: by 10.224.184.204 with SMTP id cl12mr20274474qab.55.1343602986788;
Sun, 29 Jul 2012 16:03:06 -0700 (PDT)
Received: by 10.224.184.204 with SMTP id cl12mr20274472qab.55.1343602986778;
Sun, 29 Jul 2012 16:03:06 -0700 (PDT)
Received: from mail-qa0-f41.google.com (mail-qa0-f41.google.com [209.85.216.41])
by mx.google.com with ESMTPS id hn3si5365137qab.83.2012.07.29.16.03.06
(version=TLSv1/SSLv3 cipher=OTHER);
Sun, 29 Jul 2012 16:03:06 -0700 (PDT)
Received-SPF: neutral (google.com: 209.85.216.41 is neither permitted nor denied by best guess record for domain of fajar-***@public.gmane.org) client-ip=209.85.216.41;
Received: by qabg27 with SMTP id g27so554649qab.7
for <zfs-discuss-VKpPRiiRko4/***@public.gmane.org>; Sun, 29 Jul 2012 16:03:06 -0700 (PDT)
Received: by 10.224.189.83 with SMTP id dd19mr20550912qab.45.1343602986428;
Sun, 29 Jul 2012 16:03:06 -0700 (PDT)
Received: by 10.229.136.199 with HTTP; Sun, 29 Jul 2012 16:03:06 -0700 (PDT)
In-Reply-To: <50151DF9.9010803-lLJSCgkhrZi6c6uEtOJ/***@public.gmane.org>
X-Gm-Message-State: ALoCoQnzNxtSsIdez6WC5ne7d/53h2TqaZqAZsW/FjEFJg69corxaprDrQ7se4Ku+1IR0nfP6xlW
X-Original-Sender: list-***@public.gmane.org
X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com:
209.85.216.41 is neither permitted nor denied by best guess record for domain
of fajar-***@public.gmane.org) smtp.mail=fajar-***@public.gmane.org
Precedence: list
Mailing-list: list zfs-discuss-VKpPRiiRko4/***@public.gmane.org; contact zfs-discuss+owners-VKpPRiiRko4/***@public.gmane.org
List-ID: <zfs-discuss.zfsonlinux.org>
X-Google-Group-Id: 321364807731
List-Post: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/post?hl=en_US>,
<mailto:zfs-discuss-VKpPRiiRko4/***@public.gmane.org>
List-Help: <http://support.google.com/a/zfsonlinux.org/bin/topic.py?hl=en_US&topic=25838>,
<mailto:zfs-discuss+help-VKpPRiiRko4/***@public.gmane.org>
List-Archive: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/?hl=en_US>
List-Subscribe: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/subscribe?hl=en_US>,
<mailto:zfs-discuss+subscribe-VKpPRiiRko4/***@public.gmane.org>
List-Unsubscribe: <http://groups.google.com/a/zfsonlinux.org/group/zfs-discuss/subscribe?hl=en_US>,
<mailto:googlegroups-manage+321364807731+unsubscribe-/***@public.gmane.org>
Archived-At: <http://permalink.gmane.org/gmane.linux.file-systems.zfs.user/3927>

The only other thing I've done is pause and save VM state, snapshot, then
resume the VM.

Since you already know that windows (and most applications it runs)
can survive a power loss (thus can survice snapshots as well), you
could always just do the snapshot without pause/unpause.

But that is dangerous coz you need to restore it on the same
hardware to resume. Although even that few seconds to pause and save the VM
state is too long for some servers and causes timeouts for clients :(

Exactly.

You're trying to avoid the fs being marked unclean, but introduce a
new set of (IMHO more severe) problems. In this case, KISS is best.

I think your vss solution and exporting is the best you are going to get
unless you move to something more commercial and windows orientated.

True.

But it also requires extra, possibly time-consuming steps, compared to
simple zfs snapshot, where a "restore" or "clone" is basiacally just a
"zfs clone" plus assign the new zvol to a vm/server.

Which one is "best" might be different for each person, I guess.

--
Fajar

Manuel Amador (Rudd-O)

2012-07-28 00:52:46 UTC

Permalink

This is incorrect. Data checksums and snapshots are supported for
ZVols.

J Metro

2012-07-28 01:05:25 UTC

Permalink

I see no mount points of type "zfs" when I type "mount" in terminal yet
zfs list and zpool status show its running...Screenshot in link.
Loading Image...

/

On Fri, Jul 27, 2012 at 7:52 PM, Manuel Amador (Rudd-O)

Post by Bryn Hughes
**
If you mean "Can I provision space that a Windows server will use for
its own storage" for those types of applications, then again yes - you
can create zvols and export them with iSCSI, which can then in turn be
mounted by the Windows server as if it were a direct attached disk.
However when you do that you won't have nice happy friendly snapshots
and things, nor do you get file-level checksums or any of the other ZFS
goodness. On the Linux side it would essentially appear as a single
file, with Windows itself handling all the filesystem stuff using NTFS
inside of that.
This is incorrect. Data checksums and snapshots are supported for ZVols.

e-t172

2012-07-28 08:43:18 UTC

Permalink

Post by J Metro
I see no mount points of type "zfs" when I type "mount" in terminal
yet zfs list and zpool status show its running...Screenshot in link.
http://imageshack.us/photo/my-images/528/capture2uu.png/

That's strange. The datasets should be mounted by default. Try:

# zfs mount -a

Also, check /proc/mounts which is more reliable than the mount command.

--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82

Bryn Hughes

2012-07-30 03:20:43 UTC

Permalink

Post by Bryn Hughes

This is incorrect. Data checksums and snapshots are supported for ZVols.

ZFS can do checksums for the zvol itself (like I said above), but you
don't get checksums for individual files within the zvol. If the zvol's
filesystem (NTFS in the example here) becomes confused about the
contents of the zvol, there is nothing you can do with ZFS to fix that.
As long as the data matches what ZFS thinks it should, the zvol's
checksum will match. If the data internally is corrupt, if the
filesystem within the zvol became corrupt or if the application wrote a
whole string of random bits to the zvol there would be no way for ZFS to
know. ZFS will at least know if the data written to disk changed - ie
bit flips on the hard drive or random read errors, etc, but you won't
have file-level checksums like you would if you used ZFS "natively".

Likewise with snapshots - you can definitely snapshot a zvol, but that
doesn't mean that any filesystem within the zvol was in a consistent
state when you took the snapshot. Much like how LVM snapshots are only
so useful - a snapshot taken at the block level needs a mechanism to
ensure that the application data is written completely to disk and the
copy on disk means something. If you snapshot a database part way
through committing a transaction, the data is not in a valid state. A
filesystem in a zvol is just another kind of database, so you need to
treat it the same way.

Bryn

e-t172

2012-07-30 07:34:00 UTC

Permalink

Post by Bryn Hughes
ZFS can do checksums for the zvol itself (like I said above), but you
don't get checksums for individual files within the zvol. If the zvol's
filesystem (NTFS in the example here) becomes confused about the
contents of the zvol, there is nothing you can do with ZFS to fix that.
As long as the data matches what ZFS thinks it should, the zvol's
checksum will match. If the data internally is corrupt, if the
filesystem within the zvol became corrupt or if the application wrote a
whole string of random bits to the zvol there would be no way for ZFS to
know. ZFS will at least know if the data written to disk changed - ie
bit flips on the hard drive or random read errors, etc, but you won't
have file-level checksums like you would if you used ZFS "natively".

ZFS protects against hardware errors. This means the only way to corrupt
a filesystem inside a zvol is if there is a bug in the filesystem.
Indeed, there might be bugs inside NTFS. But there might be bugs inside
ZFS, too. So I don't see your point.

Post by Bryn Hughes
Likewise with snapshots - you can definitely snapshot a zvol, but that
doesn't mean that any filesystem within the zvol was in a consistent
state when you took the snapshot.

With any decent, modern, journalling filesystem like NTFS or ext3, yes
it does.

Post by Bryn Hughes
If you snapshot a database part way
through committing a transaction, the data is not in a valid state.

Yes it is. Database servers can handle crashes and power failures.
Likewise they can handle volume snapshots. Please read other posts in
this thread for more thorough explanations.

--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82

e-t172

2012-07-27 21:50:49 UTC

Permalink

Post by J Metro
1. Is there any way to provide interoperability between a Windows 2008
server and a ZFS pool [on a linux server?] to store the files? [email,
file hosting, active directory, anything a windows server would provide]

Well, you can use Samba. I believe there are some issues with ACL
interoperability however, as ZoL currently do not support extended ACLs
which are used by Samba.

See: https://github.com/zfsonlinux/zfs/issues/170

Post by J Metro
2. I've read a lot of posts that say ZFS is not horizontally scalable,
is this still true? It seems like it is not true, since you can still
add disks to a ZFS pool [instead of just replacing a disk with a larger
volume disk] and it will function properly...

I'm not quite sure what you mean exactly by "horizontally scalable" in
this context.

Post by J Metro
3. Is there any way to prove that my ZFS pool is actually running ZFS?
Like, without a doubt prove?

What do you mean? You can't have a ZFS pool which is "not running ZFS".
If you have a zpool, then by definition, you're using ZFS to access it.
If "mount" shows a mount point with type "zfs", then you can be 100%
sure it's stored in a zpool and you're using ZFS to access it.

Post by J Metro
4. It seems like ZFS auto-stripes, at least, according to zfsbuild.com.
Is this true? If i take 8 10gb disks, and set up 2 raidz's with 4 disks
each, is that really an 8 disk raidz? is it smart enough to put data on
drive 1 then 2 then 3 then 4 then 5 then 6 then 7 then 8, and not 1 2 3
4 1 2 3 4.... until full, then 5 6 7 8?

Well, a raidz is a top-level vdev, and data is stripped across top-level
vdevs. In your example, if ZFS has to write 2 blocks, the first block
will go to the first raidz, and the second block will go to the second
raidz.

In any case ZFS never does "concatenation", i.e. it will never wait for
a device to become full before moving to the next one. The allocator
algorithm actually does the opposite: it tries to make sure all devices
are equally full.

Post by J Metro
5. In question 4, i have 2 raidz's with 4 disks each. Is that a Raid 50?

Yes.

Post by J Metro
How do I setup a Raid 51? What is ZFS smart enough to do on its own? I
just need examples like "4 groups of disks, each group has 2 disks setup
as mirror" or "2 groups of disks, each group has 4 disks as a raidz" not
full page explanations.

mirrors and raidz cannot be nested either way, i.e. you can't have RAID
51 nor RAID 15. You have to choose between mirror and raidz (well,
strictly speaking you can have both mirrors and raidz as top-level
vdevs, but that would be a strange configuration). Also, you can do RAID
50 (see above), but you can't do RAID 05.

So to summarize, here's what you can do with ZFS:
- RAID 0
- RAID 1
- RAID 5
- RAID 6
- RAID 10
- RAID 50
- RAID 60
- Some weird, asymmetric combination like RAID (1,5)0 or RAID (1,6)0

That's all. There are ways to work around these limitations, for example
using Linux software RAID in combination with ZFS, but that's kinda ugly
and you lose some ZFS features in the process.

Post by J Metro
I had more questions, but I thought about them in the shower before work
today, and of course, forgot them by now. Maybe tomorrow's shower will
bring them back. I dreamt about ZFS last night.

It's sure worth dreaming about ;)

--
Etienne Dechamps / e-t172 - AKE Group
Phone: +33 6 23 42 24 82

J Metro

2012-08-01 15:24:10 UTC

Permalink

Did anyone reply to my original question? This is the most important one --
Has anyone here ever used a Windows 2008 server in a production environment
[im thinking voice, email, data, AD, web servers] and ported over to ZFS?
Or put any of those functions in such a manner that they were being
supported by ZFS? If so, how did you get it to work? or maybe it would just
be easier to backup the data to ZFS and keep the data itself on the server
and just perform a nightly backup?
I'm thinking that might be the easiest solution [a nightly backup to a ZFS
volume ] instead of trying to get a bunch of Windows clients already on
Windows servers to hook up to a Linux server, or to try and do that Zvol ->
Iscsi thing

Post by J Metro
Hi all, Just some questions about ZFS in general.
1. Is there any way to provide interoperability between a Windows 2008
server and a ZFS pool [on a linux server?] to store the files? [email, file
hosting, active directory, anything a windows server would provide]
2. I've read a lot of posts that say ZFS is not horizontally scalable, is
this still true? It seems like it is not true, since you can still add
disks to a ZFS pool [instead of just replacing a disk with a larger volume
disk] and it will function properly...
3. Is there any way to prove that my ZFS pool is actually running ZFS?
Like, without a doubt prove?
4. It seems like ZFS auto-stripes, at least, according to zfsbuild.com.
Is this true? If i take 8 10gb disks, and set up 2 raidz's with 4 disks
each, is that really an 8 disk raidz? is it smart enough to put data on
drive 1 then 2 then 3 then 4 then 5 then 6 then 7 then 8, and not 1 2 3 4
1 2 3 4.... until full, then 5 6 7 8?
5. In question 4, i have 2 raidz's with 4 disks each. Is that a Raid 50?
How do I setup a Raid 51? What is ZFS smart enough to do on its own? I just
need examples like "4 groups of disks, each group has 2 disks setup as
mirror" or "2 groups of disks, each group has 4 disks as a raidz" not full
page explanations.
I had more questions, but I thought about them in the shower before work
today, and of course, forgot them by now. Maybe tomorrow's shower will
bring them back. I dreamt about ZFS last night.

Björn Kahl

2012-08-01 16:19:04 UTC

Permalink

Honestly, I do not understand your question.

What do you mean by "... [im thinking voice, email, data, AD, web
servers] and ported over to ZFS?"
or by "Or put any of those functions in such a manner that they were
being supported by ZFS"?

ZFS is just a file system (with integrated volume management), just as
NTFS, Ext3/4, FAT etc. are just file systems. File systems do not
"support" "voice, email, data, AD, web servers". They just store
files. (Of course, different file systems may be more or less
suitable to a given workload, i.e. regarding max. file size,
scalability, fault tolerance etc., but that's a different question.)

So what is your question? And most important: Why do you want to use
ZFS? ZFS is a great file system, suitable for many different
workloads, but I simply fail to see the connection between the
services you mentioned and your question of being supported by ZFS.

And just in case it hasn't been mentioned so far: There is no native
ZFS implementation for Windows. You can not have your Windows Server
2088 use a ZFS pool as a local disk.

Post by J Metro
If so, how did you get it to work? or maybe it would just
be easier to backup the data to ZFS and keep the data itself on the server
and just perform a nightly backup?
I'm thinking that might be the easiest solution [a nightly backup to a ZFS
volume ] instead of trying to get a bunch of Windows clients already on
Windows servers to hook up to a Linux server, or to try and do that Zvol ->
Iscsi thing

2012-08-01 17:00:48 UTC

Permalink

I'm running virtual 2k8 servers on Linux KVM, and their disk images
are stored on a Linux ZFS server via NFS. Hourly/daily snapshots,
etc. The application owner is still doing their own backups because
of the criticality of the data, but guess what - the backup is on a
ZFS/CIFS share. it's not a high-volume or large environment, but it
works fine.

Christ Schlacta

2012-08-01 19:06:28 UTC

Permalink

I have a WIMP development server running on a zfs zvol

Post by J Metro
Did anyone reply to my original question? This is the most important one --
Has anyone here ever used a Windows 2008 server in a production
environment [im thinking voice, email, data, AD, web servers] and
ported over to ZFS? Or put any of those functions in such a manner
that they were being supported by ZFS? If so, how did you get it to
work? or maybe it would just be easier to backup the data to ZFS and
keep the data itself on the server and just perform a nightly backup?
I'm thinking that might be the easiest solution [a nightly backup to
a ZFS volume ] instead of trying to get a bunch of Windows clients
already on Windows servers to hook up to a Linux server, or to try and
do that Zvol -> Iscsi thing
Hi all, Just some questions about ZFS in general.
1. Is there any way to provide interoperability between a Windows
2008 server and a ZFS pool [on a linux server?] to store the
files? [email, file hosting, active directory, anything a windows
server would provide]
2. I've read a lot of posts that say ZFS is not horizontally
scalable, is this still true? It seems like it is not true, since
you can still add disks to a ZFS pool [instead of just replacing a
disk with a larger volume disk] and it will function properly...
3. Is there any way to prove that my ZFS pool is actually running
ZFS? Like, without a doubt prove?
4. It seems like ZFS auto-stripes, at least, according to
zfsbuild.com <http://zfsbuild.com>. Is this true? If i take 8 10gb
disks, and set up 2 raidz's with 4 disks each, is that really an 8
disk raidz? is it smart enough to put data on drive 1 then 2 then
3 then 4 then 5 then 6 then 7 then 8, and not 1 2 3 4 1 2 3 4....
until full, then 5 6 7 8?
5. In question 4, i have 2 raidz's with 4 disks each. Is that a
Raid 50? How do I setup a Raid 51? What is ZFS smart enough to do
on its own? I just need examples like "4 groups of disks, each
group has 2 disks setup as mirror" or "2 groups of disks, each
group has 4 disks as a raidz" not full page explanations.
I had more questions, but I thought about them in the shower
before work today, and of course, forgot them by now. Maybe
tomorrow's shower will bring them back. I dreamt about ZFS last night.