Discussion:
Freeing unused space in thin provisioned zvols
Lutz Schumann
2010-02-26 19:42:00 UTC
Permalink
Hello list,

ZFS can be used in both file level (zfs) and block level access (zvol). When using zvols, those are always thin provisioned (space is allocated on first write). We use zvols with comstar to do iSCSI and FC access - and exuse me in advance - but this may also be a more comstar related question then.

When reading from a freshly created zvol, no data comes from disk. All reads are satisfied by ZFS and comstar returns 0's (I guess) for all reads.

Now If a virtual machine writes to the zvol, blocks are allocated on disk. Reads are now partial from disk (for all blocks written) and from ZFS layer (all unwritten blocks).

If the virtual machine (which may be vmware / xen / hyperv) deletes blocks / frees space within the zvol, this also means a write - usually in meta data area only. Thus the underlaying Storage system does not know which blocks in a zvol are really used.

So reducing size in zvols is really difficult / not possible. Even if one deletes everything in guest, the blocks keep allocated. If one zeros out all blocks, even more space is allocated.

For the same purpose TRIM (ATA) / PUNCH (SCSI) has bee introduced. With this commands the guest can tell the storage, which blocks are not used anymore. Those commands are not available in Comstar today :(

However I had the idea that comstar can get the same result in the way vmware did it some time ago with "vmware tools".

Idea:
- If the guest writes a block with 0's only, the block is freed again
- if someone reads this block again - it wil get the same 0's it would get if the 0's would be written
- The checksum of a "all 0" block dan be hard coded for SHA1 / Flecher, so the comparison for "is this a "0 only" block is easy.

With this in place, a host wishing to free thin provisioned zvol space can fill the unused blocks wirth 0s easity with simple tools (e.g. dd if=/dev/zero of=/MYFILE bs=1M; rm /MYFILE) and the space is freed again on the zvol side.

Does anyone know why this is not incorporated into ZFS ?
--
This message posted from opensolaris.org
Tomas Ögren
2010-02-26 19:48:20 UTC
Permalink
Post by Lutz Schumann
Hello list,
ZFS can be used in both file level (zfs) and block level access (zvol). When using zvols, those are always thin provisioned (space is allocated on first write). We use zvols with comstar to do iSCSI and FC access - and exuse me in advance - but this may also be a more comstar related question then.
When reading from a freshly created zvol, no data comes from disk. All reads are satisfied by ZFS and comstar returns 0's (I guess) for all reads.
Now If a virtual machine writes to the zvol, blocks are allocated on disk. Reads are now partial from disk (for all blocks written) and from ZFS layer (all unwritten blocks).
If the virtual machine (which may be vmware / xen / hyperv) deletes blocks / frees space within the zvol, this also means a write - usually in meta data area only. Thus the underlaying Storage system does not know which blocks in a zvol are really used.
So reducing size in zvols is really difficult / not possible. Even if one deletes everything in guest, the blocks keep allocated. If one zeros out all blocks, even more space is allocated.
For the same purpose TRIM (ATA) / PUNCH (SCSI) has bee introduced. With this commands the guest can tell the storage, which blocks are not used anymore. Those commands are not available in Comstar today :(
However I had the idea that comstar can get the same result in the way vmware did it some time ago with "vmware tools".
- If the guest writes a block with 0's only, the block is freed again
- if someone reads this block again - it wil get the same 0's it would get if the 0's would be written
- The checksum of a "all 0" block dan be hard coded for SHA1 / Flecher, so the comparison for "is this a "0 only" block is easy.
With this in place, a host wishing to free thin provisioned zvol space can fill the unused blocks wirth 0s easity with simple tools (e.g. dd if=/dev/zero of=/MYFILE bs=1M; rm /MYFILE) and the space is freed again on the zvol side.
Does anyone know why this is not incorporated into ZFS ?
What you can do until this is to enable compression (like lzjb) on the
zvol, then do your dd dance in the client, then you can disable the
compression again.

/Tomas
--
Tomas Ögren, ***@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
Lutz Schumann
2010-02-26 19:55:43 UTC
Permalink
This would be an idea and I thought about this. However I see the following problems:

1) using deduplication

This will reduce the on disk size however the DDT will grow forever and for the deletion of zvols this will mean a lot of time and work (see other threads regarding DDT memory issues on the list)

2) compression

AS I understand it - if I do zfs send/receive (which we do for DR) data is grown to the original size again on the wire. This makes it difficult.

Regards,
Robert
--
This message posted from opensolaris.org
Richard Elling
2010-02-26 22:22:20 UTC
Permalink
Post by Lutz Schumann
1) using deduplication
This will reduce the on disk size however the DDT will grow forever and for the deletion of zvols this will mean a lot of time and work (see other threads regarding DDT memory issues on the list)
Use compression and deduplication. If you are mostly concerned about the zero
fills, then the zle compressor is very fast and efficient.
Post by Lutz Schumann
2) compression
AS I understand it - if I do zfs send/receive (which we do for DR) data is grown to the original size again on the wire. This makes it difficult.
uhmmm... compress it, too :-)
-- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 16-18, 2010)
Lutz Schumann
2010-05-08 12:04:09 UTC
Permalink
I have to come back to this issue after a while cause it just hit me.

I have a VMWare vSphere 4 test host. I have various machines in there to do tests for performance and other stuff. So a lot of IO/ benchmarks are done and a lot of data is created during this benchmarks.

The vSphere test machine is connected via FC to the storage box (Nexenta 3.0). Currently the data visible withing the VM's is ~ 10 GB (mostly OS only). ON the ESX host it is a lot more (no thin provisioning). On the Nexenta Box a lot more is used (~500G) because it has been used once.

The test machine has not a lot of memory because I don't need it. So enabeling deduplication is not an option. Also benchmarks have shown that performance really sucks.

So compression is an option yes, and I yield GREAT comression ratios (> 10x).

However on the wire the full 500 GB are transfered. There is NO way to shring the volumes currently. I could compress the wire with gzip, however this would be VERY unefficient:


(SOURCE) Disk Read (compressed) -> decrompression to memory -> comress on wire -> (TARGET) decompress on the other side -> compress for disk -> Write Disk (compressed)

This means a lot of CPU is used. Not nice.

Now If I think there would be "write all zero, clear the block again" in ZFS (which I suggest in this thread) I could do the following:

Fill the disks within the VM's with all zero (dd if=/dev/zero of=/MYFILE bs=1M ...). This could effectibly write a lot for all-zero blocks to the comstar and zfs. ZFS could then free the blocks and I could then send the zvol to backup. This could mean 10 GB used in VM -> zvol size of ~10 GB -> 10 GB transfered.

Doesn't this sound better ?

Of course we could also wait for punch/trip support, tune the whole stack and wait until components in such a setup support trim ... but I believe this will take years. So the approach mentioned above sounds like a programmatic solution.

Where can proposals like this be placed ? bugs.opensolaris.com ?
--
This message posted from opensolaris.org
Brandon High
2010-05-08 14:23:07 UTC
Permalink
On Sat, May 8, 2010 at 5:04 AM, Lutz Schumann
Post by Lutz Schumann
Fill the disks within the VM's with all zero (dd if=/dev/zero of=/MYFILE bs=1M ...). This could effectibly write a lot for all-zero blocks to the comstar and zfs. ZFS could then free the blocks and I could then send the zvol to backup. This could mean 10 GB used in VM -> zvol size of ~10 GB -> 10 GB transfered.
If you set compression=zle, you'll only be compressing strings of 0s.
It's low overhead for the zfs system to maintain.

Doing a dd like you suggest works well to reclaim most freed space
from the zvol. You can unlink the file that you're writing to before
the dd is finished so that the VM doesn't see the disk as full for
more than a split second.

Windows VMs can use SDelete
(http://technet.microsoft.com/en-us/sysinternals/bb897443.aspx) to
zero-fill their disks to free space on the server.

-B
--
Brandon High : ***@freaks.com
Bill Sommerfeld
2010-02-26 19:53:14 UTC
Permalink
Post by Lutz Schumann
- If the guest writes a block with 0's only, the block is freed again
- if someone reads this block again - it wil get the same 0's it would get if the 0's would be written
- The checksum of a "all 0" block dan be hard coded for SHA1 / Flecher, so the comparison for "is this a "0 only" block is easy.
With this in place, a host wishing to free thin provisioned zvol space can fill the unused blocks wirth 0s easity with simple tools (e.g. dd if=/dev/zero of=/MYFILE bs=1M; rm /MYFILE) and the space is freed again on the zvol side.
You've just described how ZFS behaves when compression is enabled -- a
block of zeros is compressed to a hole represented by an all-zeros block
pointer.
Post by Lutz Schumann
Does anyone know why this is not incorporated into ZFS ?
It's in there. Turn on compression to use it.


- Bill
Marc Nicholas
2010-02-26 20:06:37 UTC
Permalink
On Fri, Feb 26, 2010 at 2:42 PM, Lutz Schumann
Post by Lutz Schumann
Now If a virtual machine writes to the zvol, blocks are allocated on disk.
Reads are now partial from disk (for all blocks written) and from ZFS layer
(all unwritten blocks).
If the virtual machine (which may be vmware / xen / hyperv) deletes blocks
/ frees space within the zvol, this also means a write - usually in meta
data area only. Thus the underlaying Storage system does not know which
blocks in a zvol are really used.
Your're using VMs and *not* using dedupe?! VMs are almost the perfect
use-case for dedupe :)

-marc
Datnus
2013-02-10 09:57:21 UTC
Permalink
I run dd if=/dev/zero of=testfile bs=1024k count=50000 inside the iscsi vmfs
from ESXi and rm textfile.

However, the zpool list doesn't decrease at all. In fact, the used storage
increase when I do dd.

FreeNas 8.0.4 and ESXi 5.0
Help.
Thanks.
Koopmann, Jan-Peter
2013-02-10 12:01:44 UTC
Permalink
Why should it?

Unless you do a shrink on the vmdk and use a zfs variant with scsi unmap support (I believe currently only Nexenta but correct me if I am wrong) the blocks will not be freed, will they?

Kind regards
JP


Sent from a mobile device.
Post by Datnus
I run dd if=/dev/zero of=testfile bs=1024k count=50000 inside the iscsi vmfs
from ESXi and rm textfile.
However, the zpool list doesn't decrease at all. In fact, the used storage
increase when I do dd.
FreeNas 8.0.4 and ESXi 5.0
Help.
Thanks.
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Darren J Moffat
2013-02-12 10:25:51 UTC
Permalink
Post by Koopmann, Jan-Peter
Why should it?
Unless you do a shrink on the vmdk and use a zfs variant with scsi unmap support (I believe currently only Nexenta but correct me if I am wrong) the blocks will not be freed, will they?
Solaris 11.1 has ZFS with SCSI UNMAP support.
--
Darren J Moffat
Stefan Ring
2013-02-12 11:30:53 UTC
Permalink
Post by Darren J Moffat
Post by Koopmann, Jan-Peter
Unless you do a shrink on the vmdk and use a zfs variant with scsi unmap
support (I believe currently only Nexenta but correct me if I am wrong) the
blocks will not be freed, will they?
Solaris 11.1 has ZFS with SCSI UNMAP support.
Freeing unused blocks works perfectly well with fstrim (Linux)
consuming an iSCSI zvol served up by oi151a6.
Thomas Nau
2013-02-12 15:07:26 UTC
Permalink
Darren
Post by Darren J Moffat
Post by Koopmann, Jan-Peter
Why should it?
Unless you do a shrink on the vmdk and use a zfs variant with scsi
unmap support (I believe currently only Nexenta but correct me if I am
wrong) the blocks will not be freed, will they?
Solaris 11.1 has ZFS with SCSI UNMAP support.
Seem to have skipped that one... Are there any related tools e.g. to
release all "zero" blocks or the like? Of course it's up to the admin
then to know what all this is about or to wreck the data

Thomas
Darren J Moffat
2013-02-12 15:36:19 UTC
Permalink
Post by Darren J Moffat
Darren
Post by Darren J Moffat
Post by Koopmann, Jan-Peter
Why should it?
Unless you do a shrink on the vmdk and use a zfs variant with scsi
unmap support (I believe currently only Nexenta but correct me if I am
wrong) the blocks will not be freed, will they?
Solaris 11.1 has ZFS with SCSI UNMAP support.
Seem to have skipped that one... Are there any related tools e.g. to
release all "zero" blocks or the like? Of course it's up to the admin
then to know what all this is about or to wreck the data
No tools, ZFS does it automaticaly when freeing blocks when the
underlying device advertises the functionality.

ZFS ZVOLs shared over COMSTAR advertise SCSI UNMAP as well.
--
Darren J Moffat
C***@oracle.com
2013-02-12 15:45:00 UTC
Permalink
Post by Darren J Moffat
No tools, ZFS does it automaticaly when freeing blocks when the
underlying device advertises the functionality.
ZFS ZVOLs shared over COMSTAR advertise SCSI UNMAP as well.
If a system was running something older, e.g., Solaris 11; the "free"
blocks will not be marked such on the server even after the system
upgrades to Solaris 11.1.

There might be a way to force that by disabling compression and then
create a large file full with NULs and then remove that. But you need to
check first that this has some effect before you even try.

Casper
Sašo Kiselkov
2013-02-12 16:11:00 UTC
Permalink
Post by Koopmann, Jan-Peter
Why should it?
I believe currently only Nexenta but correct me if I am wrong
The code has been mainlined a while ago, see:

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/io/comstar/lu/stmf_sbd/sbd.c#L3702-L3730
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/zvol.c#L1697-L1754

Thanks should go to the guys at Nexenta for contributing this to the
open-source effort.

Cheers,
--
Saso

Jim Klimov
2013-02-10 13:13:20 UTC
Permalink
Post by Datnus
I run dd if=/dev/zero of=testfile bs=1024k count=50000 inside the iscsi vmfs
from ESXi and rm textfile.
However, the zpool list doesn't decrease at all. In fact, the used storage
increase when I do dd.
FreeNas 8.0.4 and ESXi 5.0
Help.
Thanks.
Did you also enable compression (any non-"off" kind) for the ZVOL
which houses your iSCSI volume?

The procedure with zero-copying does allocate (logically) the blocks
requested in the sparse volume. If this volume is stored on ZFS with
compression (active at the moment when you write these blocks), then
ZFS detects an all-zeroes blocks and uses no space to store it, only
adding a block pointer entry to reference its emptiness. This way you
get some growth in metadata, but none in userdata for the volume.
If by doing this trick you "overwrite" the non-empty but logically
"deleted" blocks in the VM's filesystem housed inside iSCSI in the
ZVOL, then the backend storage should shrink by releasing those
non-empty blocks. Ultimately, if you use snapshots - those released
blocks would be reassigned into the snapshots of the ZVOL; and so in
order to get usable free space on your pool, you'd have to destroy
all those older snapshots (between creation and deletion times of
those no-longer-useful blocks).

If you have reservations about compression for VMs (performance-wise
or somehow else), take a look at "zle" compression mode which should
only reduce consecutive strings of zeroes.

Also I'd reiterate - the compression mode takes effect for blocks
written after the mode was set. For example, if you prefer to store
your datasets generally uncompressed for any reason, then you can
enable a compression mode, zero-fill the VM disk's free space as
you did, and re-disable the compression for the volume for any
further writes. Also note that if you "zfs send" or otherwise copy
the data off the dataset into another (backup one), only the one
compression method last defined for the target dataset would be
applied to the new writes into it - regardless of absence or
presence (and type) of compression on the original dataset.

HTH,
//Jim Klimov
Koopmann, Jan-Peter
2013-02-10 13:55:46 UTC
Permalink
I forgot about compression. Makes sense. As long as the zeroes find their way to the backend storage this should work. Thanks!



Kind regards
JP
Loading...