Stefan Esser
2011-12-19 14:22:06 UTC
Hi ZFS users,
for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
a longer log of 10 second averages logged with gstat:
dT: 10.001s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 130 106 4134 4.5 23 1033 5.2 48.8| ada0
0 131 111 3784 4.2 19 1007 4.0 47.6| ada1
0 90 66 2219 4.5 24 1031 5.1 31.7| ada2
1 81 58 2007 4.6 22 1023 2.3 28.1| ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 132 104 4036 4.2 27 1129 5.3 45.2| ada0
0 129 103 3679 4.5 26 1115 6.8 47.6| ada1
1 91 61 2133 4.6 30 1129 1.9 29.6| ada2
0 81 56 1985 4.8 24 1102 6.0 29.4| ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 148 108 4084 5.3 39 2511 7.2 55.5| ada0
1 141 104 3693 5.1 36 2505 10.4 54.4| ada1
1 102 62 2112 5.6 39 2508 5.5 35.4| ada2
0 99 60 2064 6.0 39 2483 3.7 36.1| ada3
This goes on for minutes, without a change of roles (I had assumed that
other 10 minute samples might show relatively higher load on another
subset of the drives, but it's always the first two, which receive some
50% more read requests than the other two.
The test consisted of minidlna rebuilding its content database for a
media collection held on that pool. The unbalanced distribution of
requests does not depend on the particular application and the
distribution of requests does not change when the drives with highest
load approach 100% busy.
This is a -CURRENT built from yesterdays sources, but the problem exists
for quite some time (and should definitely be reproducible on -STABLE, too).
The pool consists of a 4 drive raidz1 on an ICH10 (H67) without cache or
log devices and without much ZFS tuning (only max. ARC size, should not
at all be relevant in this context):
zpool status -v
pool: raid1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
raid1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada2p2 ONLINE 0 0 0
ada3p2 ONLINE 0 0 0
errors: No known data errors
Cached configuration:
version: 28
name: 'raid1'
state: 0
txg: 153899
pool_guid: 10507751750437208608
hostid: 3558706393
hostname: 'se.local'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 10507751750437208608
children[0]:
type: 'raidz'
id: 0
guid: 7821125965293497372
nparity: 1
metaslab_array: 30
metaslab_shift: 36
ashift: 12
asize: 7301425528832
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 7487684108701568404
path: '/dev/ada0p2'
phys_path: '/dev/ada0p2'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 12000329414109214882
path: '/dev/ada1p2'
phys_path: '/dev/ada1p2'
whole_disk: 1
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 2926246868795008014
path: '/dev/ada2p2'
phys_path: '/dev/ada2p2'
whole_disk: 1
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 5226543136138409733
path: '/dev/ada3p2'
phys_path: '/dev/ada3p2'
whole_disk: 1
create_txg: 4
I'd be interested to know, whether this behavior can be reproduced on
other systems with raidz1 pools consisting of 4 or more drives. All it
takes is generating some disk load and running the command:
gstat -I 10000000 -f '^a?da?.$'
to obtain 10 second averages.
I have not even tried to look at the scheduling of requests in ZFS, but
I'm surprised to see higher than average load on just 2 of the 4 drives,
since RAID parity should be evenly spread over all drives and for each
file system block a different subset of 3 out of 4 drives should be able
to deliver the data without need to reconstruct it from parity (that
would lead to an even distribution of load).
I've got two theories what might cause the obtained behavior:
1) There is some meta data that is only kept on the first two drives.
Data is evenly spread, but meta data accesses lead to additional reads.
2) The read requests are distributed in such a way, that 1/3 goes to
ada0, another 1/3 to ada1, while the remaining 1/3 is evenly distributed
to ada2 and ada3.
So: Can anybody reproduce this distribution requests?
Any idea, why this is happening and whether something should be changed
in ZFS to better distribute the load (leading to higher file system
performance)?
Best regards, STefan
for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
a longer log of 10 second averages logged with gstat:
dT: 10.001s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 130 106 4134 4.5 23 1033 5.2 48.8| ada0
0 131 111 3784 4.2 19 1007 4.0 47.6| ada1
0 90 66 2219 4.5 24 1031 5.1 31.7| ada2
1 81 58 2007 4.6 22 1023 2.3 28.1| ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 132 104 4036 4.2 27 1129 5.3 45.2| ada0
0 129 103 3679 4.5 26 1115 6.8 47.6| ada1
1 91 61 2133 4.6 30 1129 1.9 29.6| ada2
0 81 56 1985 4.8 24 1102 6.0 29.4| ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 148 108 4084 5.3 39 2511 7.2 55.5| ada0
1 141 104 3693 5.1 36 2505 10.4 54.4| ada1
1 102 62 2112 5.6 39 2508 5.5 35.4| ada2
0 99 60 2064 6.0 39 2483 3.7 36.1| ada3
This goes on for minutes, without a change of roles (I had assumed that
other 10 minute samples might show relatively higher load on another
subset of the drives, but it's always the first two, which receive some
50% more read requests than the other two.
The test consisted of minidlna rebuilding its content database for a
media collection held on that pool. The unbalanced distribution of
requests does not depend on the particular application and the
distribution of requests does not change when the drives with highest
load approach 100% busy.
This is a -CURRENT built from yesterdays sources, but the problem exists
for quite some time (and should definitely be reproducible on -STABLE, too).
The pool consists of a 4 drive raidz1 on an ICH10 (H67) without cache or
log devices and without much ZFS tuning (only max. ARC size, should not
at all be relevant in this context):
zpool status -v
pool: raid1
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
raid1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada2p2 ONLINE 0 0 0
ada3p2 ONLINE 0 0 0
errors: No known data errors
Cached configuration:
version: 28
name: 'raid1'
state: 0
txg: 153899
pool_guid: 10507751750437208608
hostid: 3558706393
hostname: 'se.local'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 10507751750437208608
children[0]:
type: 'raidz'
id: 0
guid: 7821125965293497372
nparity: 1
metaslab_array: 30
metaslab_shift: 36
ashift: 12
asize: 7301425528832
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 7487684108701568404
path: '/dev/ada0p2'
phys_path: '/dev/ada0p2'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 12000329414109214882
path: '/dev/ada1p2'
phys_path: '/dev/ada1p2'
whole_disk: 1
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 2926246868795008014
path: '/dev/ada2p2'
phys_path: '/dev/ada2p2'
whole_disk: 1
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 5226543136138409733
path: '/dev/ada3p2'
phys_path: '/dev/ada3p2'
whole_disk: 1
create_txg: 4
I'd be interested to know, whether this behavior can be reproduced on
other systems with raidz1 pools consisting of 4 or more drives. All it
takes is generating some disk load and running the command:
gstat -I 10000000 -f '^a?da?.$'
to obtain 10 second averages.
I have not even tried to look at the scheduling of requests in ZFS, but
I'm surprised to see higher than average load on just 2 of the 4 drives,
since RAID parity should be evenly spread over all drives and for each
file system block a different subset of 3 out of 4 drives should be able
to deliver the data without need to reconstruct it from parity (that
would lead to an even distribution of load).
I've got two theories what might cause the obtained behavior:
1) There is some meta data that is only kept on the first two drives.
Data is evenly spread, but meta data accesses lead to additional reads.
2) The read requests are distributed in such a way, that 1/3 goes to
ada0, another 1/3 to ada1, while the remaining 1/3 is evenly distributed
to ada2 and ada3.
So: Can anybody reproduce this distribution requests?
Any idea, why this is happening and whether something should be changed
in ZFS to better distribute the load (leading to higher file system
performance)?
Best regards, STefan