[CFT] Improved ZFS metaslab code (faster write speed)

Discussion:

Martin Matuska

2010-08-22 15:15:01 UTC

Dear FreeBSD community,

many of our [2] (and Solaris [3]) users today are complaining about slow
ZFS writes. One of the causes for these writes is the selection of the
proper allocation method for allocation of new blocks [3] [4]. Another
issue a write slowdown during TXG sync times.

Solaris 10 (and OpenSolaris up to november 2009) have the
following scenario:

- pool has more than 30% free space: use first fit method [1]
- pool has less than 30% free space: use best fit method [1]

This causes a major slowdown of the writes if we go below 30% of free
space. On large pools, 30% may be terabytes of free space.

OpenSolaris has changed this in November 2009 and the Oracle Storage
Appliances also included the new code in Q1/2010 [1].

The source [1] states, that with this change they archieved a speedup
of: "50% Improved OLTP Performance, 70% Reduced Variability, 200%
Improvement on MS Exchange"

I would like to issue a Call For Testing for the following 9-CURRENT patch:
http://people.freebsd.org/~mm/patches/zfs/zfs_metaslab.patch

To apply the patch against 8-STABLE, you need to apply the v15 update first:
http://people.freebsd.org/~mm/patches/zfs/v15/stable-8-v15.patch

The patch includes the following OpenSolaris onnv revisions:
10921 (partial), 11146, 11728, 12047

And covers the following Bug IDs:
6826241 Sync write IOPS drops dramatically during TXG sync
6869229 zfs should switch to shiny new metaslabs more frequently
6917066 zfs block picking can be improved
6918420 zdb -m has issues printing metaslab statistics

References:
[1] http://blogs.sun.com/roch/entry/doubling_exchange_performance
[2] http://forums.freebsd.org/showthread.php?t=8270
[3]
http://blogs.everycity.co.uk/alasdair/2010/07/zfs-runs-really-slowly-when-free-disk-usage-goes-above-80/
[4] http://blogs.sun.com/bonwick/entry/zfs_block_allocation
[5] http://blogs.sun.com/bonwick/entry/space_maps

Olivier Smedts

2010-08-22 15:44:52 UTC

Permalink

Post by Martin Matuska
Dear FreeBSD community,
many of our [2] (and Solaris [3]) users today are complaining about slow
ZFS writes. One of the causes for these writes is the selection of the
proper allocation method for allocation of new blocks [3] [4]. Another
issue a write slowdown during TXG sync times.
Solaris 10 (and OpenSolaris up to november 2009) have the
- pool has more than 30% free space: use first fit method [1]
- pool has less than 30% free space: use best fit method [1]
This causes a major slowdown of the writes if we go below 30% of free
space. On large pools, 30% may be terabytes of free space.
OpenSolaris has changed this in November 2009 and the Oracle Storage
Appliances also included the new code in Q1/2010 [1].
The source [1] states, that with this change they archieved a speedup
of: "50% Improved OLTP Performance, 70% Reduced Variability, 200%
Improvement on MS Exchange"
http://people.freebsd.org/~mm/patches/zfs/zfs_metaslab.patch
http://people.freebsd.org/~mm/patches/zfs/v15/stable-8-v15.patch

This one doesn't apply cleanly since few minutes :
# svn log -l 1 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
------------------------------------------------------------------------
r211599 | avg | 2010-08-22 10:18:32 +0200 (Dim 22 aoû 2010) | 7 lignes

Fix a mismerge in r211581, MFC of r210427

This is a direct commit.

Reported by: many
Pointyhat to: avg

------------------------------------------------------------------------

But it does not seem hard to correct. Do you want me to submit an
updated patch for 8-stable ?

Post by Martin Matuska
10921 (partial), 11146, 11728, 12047
6826241 Sync write IOPS drops dramatically during TXG sync
6869229 zfs should switch to shiny new metaslabs more frequently
6917066 zfs block picking can be improved
6918420 zdb -m has issues printing metaslab statistics
[1] http://blogs.sun.com/roch/entry/doubling_exchange_performance
[2] http://forums.freebsd.org/showthread.php?t=8270
[3]
http://blogs.everycity.co.uk/alasdair/2010/07/zfs-runs-really-slowly-when-free-disk-usage-goes-above-80/
[4] http://blogs.sun.com/bonwick/entry/zfs_block_allocation
[5] http://blogs.sun.com/bonwick/entry/space_maps
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current

--
Olivier Smedts _
ASCII ribbon campaign ( )
e-mail: ***@gid0.org - against HTML email & vCards X
www: http://www.gid0.org - against proprietary attachments / \

"Il y a seulement 10 sortes de gens dans le monde :
ceux qui comprennent le binaire,
et ceux qui ne le comprennent pas."

Martin Matuska

2010-08-22 16:26:40 UTC

Permalink

Thank you, I have updated the v15 patch for 8-STABLE.

Post by Olivier Smedts

# svn log -l 1 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
------------------------------------------------------------------------
r211599 | avg | 2010-08-22 10:18:32 +0200 (Dim 22 aoû 2010) | 7 lignes
Fix a mismerge in r211581, MFC of r210427
This is a direct commit.
Reported by: many
Pointyhat to: avg
------------------------------------------------------------------------
But it does not seem hard to correct. Do you want me to submit an
updated patch for 8-stable ?

Scott Ullrich

2010-08-27 20:05:00 UTC

Permalink

Thank you, I have updated the v15 patch for 8-STABLE.

I have been running your patch for a couple days now and no issues.

Nice work!

Scott

Norikatsu Shigemura

2010-08-27 23:19:17 UTC

Permalink

Hi mm.

On Fri, 27 Aug 2010 16:05:00 -0400

Post by Scott Ullrich

Thank you, I have updated the v15 patch for 8-STABLE.

I have been running your patch for a couple days now and no issues.
Nice work!

Yes, me too. I'll try to zpool/zfs upgrade!
I'm waiting for your update v15, metaslab and abe_stat_rrwlock:-).

--
Norikatsu Shigemura <***@FreeBSD.org>

Artem Belevich

2010-08-27 23:50:34 UTC

Permalink

Another "me too" here.

8-stable/amd64 + v15 (zpool still uses v14) + metaslab +
abe_stat_rrwlock + A.Gapon's vm_paging_needed() + uma defrag patches.

The box survived few days of pounding on it without any signs of trouble.

--Artem

Post by Norikatsu Shigemura
Hi mm.
On Fri, 27 Aug 2010 16:05:00 -0400

Post by Scott Ullrich

Thank you, I have updated the v15 patch for 8-STABLE.

I have been running your patch for a couple days now and no issues.
Nice work!

Yes, me too. I'll try to zpool/zfs upgrade!
I'm waiting for your update v15, metaslab and abe_stat_rrwlock:-).
--
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current

jhell

2010-08-28 01:24:44 UTC

Permalink

Post by Artem Belevich
Another "me too" here.
8-stable/amd64 + v15 (zpool still uses v14) + metaslab +
abe_stat_rrwlock + A.Gapon's vm_paging_needed() + uma defrag patches.
The box survived few days of pounding on it without any signs of trouble.

I must have missed the uma defrag patches but according to the code
those patches should not have any effect on your implimentation of ZFS
on your system because vfs.zfs.zio.use_uma defaults to off unless you
have manually turned this on or the patch reverts that facility back to
its original form.

Running on a full ZFSv15 system with the metaslab & rrwlock patches and
a slightly modified patch from avg@ for vm_paging_needed() I was able to
achieve the results in read and write ops that I was looking for.

The modified patch from avg@ (portion patch) is:

#ifdef _KERNEL
if (arc_reclaim_needed()) {
needfree = 0;
wakeup(&needfree);
}
#endif

I still moved that down to below _KERNEL for the obvious reasons. But
when I was using the original patch with if (needfree) I noticed a
performance degradation after ~12 hours of use with and without UMA
turned on. So far with ~48 hours of testing with the top half of that
being with the above change, I have not seen more degradation of
performance after that ~12 hour mark.

In another 12 hours of testing with UMA turned off Ill be turning UMA
back on and testing for another 24 hours. Before that third patch from
avg@ had come along I had turned UMA on and had no performance loss for
~7 hours. Obviously I had to reboot after applying avg@'s patch and
decided to test strictly without UMA at that point.

There seems to be a problem in the logic behind the needfree use and or
arc_reclaim_needed() area that should be worked out, but at least for
this system i386 8.1-STABLE where my code is at right now "Is STABLE!".

=======================================================================
For reference I have also adjusted these: (arc.c)

- /* Start out with 1/8 of all memory */
- arc_c = kmem_size() / 8;
+ /* Start out with 1/4 of all memory */
+ arc_c = kmem_size() / 4;

And these: (arc.c)

- arc_c = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 8);
+ arc_c = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 4);

There seems to be no relative way currently to handle adjusting these
properly based on the amount of memory in the system and sets a blind
default currently to 1/8 and in a system with 2GB that is ~256MB but if
you are adjusting to kmem_size as stated above and you set KVA_PAGES to
512 like suggested, then you end up with an arc_c equaling 64MB. So
unless you adjust your kmem_size accordingly on some systems to make up
for the 1/8th problem your ZFS install is going to suffer. This is more
of a problem for systems below the 2GB memory range. Now for systems
that have quite high ranges of memory 8G for example your really only
using 1GB and it will be fairly hard besides adjusting the source to use
more RAM without effecting something else in the system inherently by
bumping vm.kmem_size*
=======================================================================

1GB RAM on ZFSv15 with the patches mentioned: (loader.conf) adjust
accordingly to your own systems environment.
kern.maxdsiz="640M"
kern.maxusers="512" # Overcome the max calculated 384 for >1G of MEM.
# See: /sys/kern/subr_param.c for details. ???
vfs.zfs.arc_min="62M"
vfs.zfs.arc_max="496M"
vfs.zfs.prefetch_disable=0
vm.kmem_size="512M"
vm.kmem_size_max="768M"
vm.kmem_size_min="128M"

Regards,

--
jhell,v

Artem Belevich

2010-08-28 03:34:18 UTC

Permalink

I must have missed the uma defrag patches but according to the code

Here is the UMA patch I was talking about:
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/hackers/2010-08/msg00188.html

those patches should not have any effect on your implimentation of ZFS
on your system because vfs.zfs.zio.use_uma defaults to off unless you
have manually turned this on or the patch reverts that facility back to
its original form.

Hmm. Indeed. Kmem_malloc carves memory allocations directly from kmem.
Yet the difference in max ARC size with the patch applied is there.

http://unix.derkeiler.com/Mailing-Lists/FreeBSD/hackers/2010-08/msg00257.html

Perhaps reduced UMA fragmentation helps those subsystem that do use
UMA (including ZFS which always uses uma for various housekeeping
data).

--Artem

Alexander Leidinger

2010-08-30 03:25:29 UTC

Permalink

Post by Artem Belevich
Perhaps reduced UMA fragmentation helps those subsystem that do use
UMA (including ZFS which always uses uma for various housekeeping
data).

PJD told me once that ZFS is always using UMA, it is just not using it
for everything (except when the sysctl is switched to use it for
everything).

FYI: I have a 9-current system which panics (without a backtrace/dump)
after 1-2 days of uptime when the zfs-uma-sysctl is activated. When it
is not activated it survives several weeks (let's say about a month).
So any work on the UMA fragmentation issue is well spend time.

No, I haven't tested any of the patches on this machine.

Bye,
Alexander.

Andriy Gapon

2010-08-28 08:13:55 UTC

Permalink

Post by jhell
I must have missed the uma defrag patches but according to the code
those patches should not have any effect on your implimentation of ZFS
on your system because vfs.zfs.zio.use_uma defaults to off unless you
have manually turned this on or the patch reverts that facility back to
its original form.

ZFS uses UMA even for other things besides ARC and those are not controlled by
vfs.zfs.zio.use_uma. Those zones also happen to be the most fragmented ones for
me. To name a few: dnode_t, dmu_buf_impl_t, arc_buf_hdr_t.
So the patch should help there.

--
Andriy Gapon

Andriy Gapon

2010-08-28 08:20:19 UTC

Permalink

Post by jhell
#ifdef _KERNEL
if (arc_reclaim_needed()) {
needfree = 0;
wakeup(&needfree);
}
#endif
I still moved that down to below _KERNEL for the obvious reasons. But
when I was using the original patch with if (needfree) I noticed a
performance degradation after ~12 hours of use with and without UMA
turned on. So far with ~48 hours of testing with the top half of that
being with the above change, I have not seen more degradation of

This is quite unexpected.
needfree should be checked as the very first thing in arc_reclaim_needed()
[unless you have patched it locally]. So if needfree is 1 then
arc_reclaim_needed() should also return 1. But the converse is not true,
arc_reclaim_needed() may return 1 even if needfree is zero.

So if your testing results are conclusive then it must mean that some extra
wakeups on needfree are needed. I.e. needfree is zero, so there shouldn't be
anything waiting on it (see arc_lowmem) and no notification should be needed,
but issuing somehow does make difference,
Hmm...

--
Andriy Gapon

jhell

2010-08-28 09:03:42 UTC

Permalink

Post by Andriy Gapon

This is quite unexpected.
needfree should be checked as the very first thing in arc_reclaim_needed()
[unless you have patched it locally]. So if needfree is 1 then
arc_reclaim_needed() should also return 1. But the converse is not true,
arc_reclaim_needed() may return 1 even if needfree is zero.
So if your testing results are conclusive then it must mean that some extra
wakeups on needfree are needed. I.e. needfree is zero, so there shouldn't be
anything waiting on it (see arc_lowmem) and no notification should be needed,
but issuing somehow does make difference,
Hmm...

I will look further into this and see if I can throw a counter around it
or some printf's so I can at least log what its doing in both instances.

I thought the very same thing you said above when I saw your patch for
that and was astounded at the results that were returned from it. So in
short testing I reverted it back quickly to see if that was the cause of
the problem and sure enough everything resumed to the way it was before.

Anyway thanks for the reply. I will get back to you if I see anything
cool arise from this.

Regards,

--
jhell,v

Pawel Jakub Dawidek

2010-08-28 09:26:02 UTC

Permalink

Post by jhell

Post by Andriy Gapon

This is quite unexpected.
needfree should be checked as the very first thing in arc_reclaim_needed()
[unless you have patched it locally]. So if needfree is 1 then
arc_reclaim_needed() should also return 1. But the converse is not true,
arc_reclaim_needed() may return 1 even if needfree is zero.
So if your testing results are conclusive then it must mean that some extra
wakeups on needfree are needed. I.e. needfree is zero, so there shouldn't be
anything waiting on it (see arc_lowmem) and no notification should be needed,
but issuing somehow does make difference,
Hmm...

I will look further into this and see if I can throw a counter around it
or some printf's so I can at least log what its doing in both instances.
I thought the very same thing you said above when I saw your patch for
that and was astounded at the results that were returned from it. So in
short testing I reverted it back quickly to see if that was the cause of
the problem and sure enough everything resumed to the way it was before.
Anyway thanks for the reply. I will get back to you if I see anything
cool arise from this.

Could you include the following patch to your testing:

http://people.freebsd.org/~pjd/patches/arc.c.9.patch

--
Pawel Jakub Dawidek http://www.wheelsystems.com
***@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

jhell

2010-08-29 09:38:05 UTC

Permalink

Post by Pawel Jakub Dawidek

Post by jhell

Post by Andriy Gapon

This is quite unexpected.
needfree should be checked as the very first thing in arc_reclaim_needed()
[unless you have patched it locally]. So if needfree is 1 then
arc_reclaim_needed() should also return 1. But the converse is not true,
arc_reclaim_needed() may return 1 even if needfree is zero.
So if your testing results are conclusive then it must mean that some extra
wakeups on needfree are needed. I.e. needfree is zero, so there shouldn't be
anything waiting on it (see arc_lowmem) and no notification should be needed,
but issuing somehow does make difference,
Hmm...

I will look further into this and see if I can throw a counter around it
or some printf's so I can at least log what its doing in both instances.
I thought the very same thing you said above when I saw your patch for
that and was astounded at the results that were returned from it. So in
short testing I reverted it back quickly to see if that was the cause of
the problem and sure enough everything resumed to the way it was before.
Anyway thanks for the reply. I will get back to you if I see anything
cool arise from this.

http://people.freebsd.org/~pjd/patches/arc.c.9.patch

Sure thing. Adding it now.

--
jhell,v