Discussion:
Uneven load on drives in ZFS RAIDZ1
Stefan Esser
2011-12-19 14:22:06 UTC
Permalink
Hi ZFS users,

for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
a longer log of 10 second averages logged with gstat:

dT: 10.001s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 130 106 4134 4.5 23 1033 5.2 48.8| ada0
0 131 111 3784 4.2 19 1007 4.0 47.6| ada1
0 90 66 2219 4.5 24 1031 5.1 31.7| ada2
1 81 58 2007 4.6 22 1023 2.3 28.1| ada3

L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 132 104 4036 4.2 27 1129 5.3 45.2| ada0
0 129 103 3679 4.5 26 1115 6.8 47.6| ada1
1 91 61 2133 4.6 30 1129 1.9 29.6| ada2
0 81 56 1985 4.8 24 1102 6.0 29.4| ada3

L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 148 108 4084 5.3 39 2511 7.2 55.5| ada0
1 141 104 3693 5.1 36 2505 10.4 54.4| ada1
1 102 62 2112 5.6 39 2508 5.5 35.4| ada2
0 99 60 2064 6.0 39 2483 3.7 36.1| ada3

This goes on for minutes, without a change of roles (I had assumed that
other 10 minute samples might show relatively higher load on another
subset of the drives, but it's always the first two, which receive some
50% more read requests than the other two.

The test consisted of minidlna rebuilding its content database for a
media collection held on that pool. The unbalanced distribution of
requests does not depend on the particular application and the
distribution of requests does not change when the drives with highest
load approach 100% busy.

This is a -CURRENT built from yesterdays sources, but the problem exists
for quite some time (and should definitely be reproducible on -STABLE, too).

The pool consists of a 4 drive raidz1 on an ICH10 (H67) without cache or
log devices and without much ZFS tuning (only max. ARC size, should not
at all be relevant in this context):

zpool status -v
pool: raid1
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
raid1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada2p2 ONLINE 0 0 0
ada3p2 ONLINE 0 0 0

errors: No known data errors

Cached configuration:
version: 28
name: 'raid1'
state: 0
txg: 153899
pool_guid: 10507751750437208608
hostid: 3558706393
hostname: 'se.local'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 10507751750437208608
children[0]:
type: 'raidz'
id: 0
guid: 7821125965293497372
nparity: 1
metaslab_array: 30
metaslab_shift: 36
ashift: 12
asize: 7301425528832
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 7487684108701568404
path: '/dev/ada0p2'
phys_path: '/dev/ada0p2'
whole_disk: 1
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 12000329414109214882
path: '/dev/ada1p2'
phys_path: '/dev/ada1p2'
whole_disk: 1
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 2926246868795008014
path: '/dev/ada2p2'
phys_path: '/dev/ada2p2'
whole_disk: 1
create_txg: 4
children[3]:
type: 'disk'
id: 3
guid: 5226543136138409733
path: '/dev/ada3p2'
phys_path: '/dev/ada3p2'
whole_disk: 1
create_txg: 4

I'd be interested to know, whether this behavior can be reproduced on
other systems with raidz1 pools consisting of 4 or more drives. All it
takes is generating some disk load and running the command:

gstat -I 10000000 -f '^a?da?.$'

to obtain 10 second averages.

I have not even tried to look at the scheduling of requests in ZFS, but
I'm surprised to see higher than average load on just 2 of the 4 drives,
since RAID parity should be evenly spread over all drives and for each
file system block a different subset of 3 out of 4 drives should be able
to deliver the data without need to reconstruct it from parity (that
would lead to an even distribution of load).

I've got two theories what might cause the obtained behavior:

1) There is some meta data that is only kept on the first two drives.
Data is evenly spread, but meta data accesses lead to additional reads.

2) The read requests are distributed in such a way, that 1/3 goes to
ada0, another 1/3 to ada1, while the remaining 1/3 is evenly distributed
to ada2 and ada3.


So: Can anybody reproduce this distribution requests?

Any idea, why this is happening and whether something should be changed
in ZFS to better distribute the load (leading to higher file system
performance)?

Best regards, STefan
Olivier Smedts
2011-12-19 14:36:29 UTC
Permalink
Post by Stefan Esser
Hi ZFS users,
for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
dT: 10.001s  w: 10.000s  filter: ^a?da?.$
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   0    130    106   4134    4.5     23   1033    5.2   48.8| ada0
   0    131    111   3784    4.2     19   1007    4.0   47.6| ada1
   0     90     66   2219    4.5     24   1031    5.1   31.7| ada2
   1     81     58   2007    4.6     22   1023    2.3   28.1| ada3
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   1    132    104   4036    4.2     27   1129    5.3   45.2| ada0
   0    129    103   3679    4.5     26   1115    6.8   47.6| ada1
   1     91     61   2133    4.6     30   1129    1.9   29.6| ada2
   0     81     56   1985    4.8     24   1102    6.0   29.4| ada3
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   1    148    108   4084    5.3     39   2511    7.2   55.5| ada0
   1    141    104   3693    5.1     36   2505 10.4 54.4| ada1
   1    102     62   2112    5.6     39   2508    5.5   35.4| ada2
   0     99     60   2064    6.0     39   2483    3.7   36.1| ada3
This goes on for minutes, without a change of roles (I had assumed that
other 10 minute samples might show relatively higher load on another
subset of the drives, but it's always the first two, which receive some
50% more read requests than the other two.
The test consisted of minidlna rebuilding its content database for a
media collection held on that pool. The unbalanced distribution of
requests does not depend on the particular application and the
distribution of requests does not change when the drives with highest
load approach 100% busy.
This is a -CURRENT built from yesterdays sources, but the problem exists
for quite some time (and should definitely be reproducible on -STABLE, too).
The pool consists of a 4 drive raidz1 on an ICH10 (H67) without cache or
log devices and without much ZFS tuning (only max. ARC size, should not
zpool status -v
 pool: raid1
 state: ONLINE
 scan: none requested
       NAME        STATE     READ WRITE CKSUM
       raid1       ONLINE       0     0     0
         raidz1-0  ONLINE       0     0     0
           ada0p2  ONLINE       0     0     0
           ada1p2  ONLINE       0     0     0
           ada2p2  ONLINE       0     0     0
           ada3p2  ONLINE       0     0     0
errors: No known data errors
       version: 28
       name: 'raid1'
       state: 0
       txg: 153899
       pool_guid: 10507751750437208608
       hostid: 3558706393
       hostname: 'se.local'
       vdev_children: 1
           type: 'root'
           id: 0
           guid: 10507751750437208608
               type: 'raidz'
               id: 0
               guid: 7821125965293497372
               nparity: 1
               metaslab_array: 30
               metaslab_shift: 36
               ashift: 12
               asize: 7301425528832
               is_log: 0
               create_txg: 4
                   type: 'disk'
                   id: 0
                   guid: 7487684108701568404
                   path: '/dev/ada0p2'
                   phys_path: '/dev/ada0p2'
                   whole_disk: 1
                   create_txg: 4
                   type: 'disk'
                   id: 1
                   guid: 12000329414109214882
                   path: '/dev/ada1p2'
                   phys_path: '/dev/ada1p2'
                   whole_disk: 1
                   create_txg: 4
                   type: 'disk'
                   id: 2
                   guid: 2926246868795008014
                   path: '/dev/ada2p2'
                   phys_path: '/dev/ada2p2'
                   whole_disk: 1
                   create_txg: 4
                   type: 'disk'
                   id: 3
                   guid: 5226543136138409733
                   path: '/dev/ada3p2'
                   phys_path: '/dev/ada3p2'
                   whole_disk: 1
                   create_txg: 4
I'd be interested to know, whether this behavior can be reproduced on
other systems with raidz1 pools consisting of 4 or more drives. All it
       gstat -I 10000000 -f '^a?da?.$'
to obtain 10 second averages.
I have not even tried to look at the scheduling of requests in ZFS, but
I'm surprised to see higher than average load on just 2 of the 4 drives,
since RAID parity should be evenly spread over all drives and for each
file system block a different subset of 3 out of 4 drives should be able
to deliver the data without need to reconstruct it from parity (that
would lead to an even distribution of load).
1) There is some meta data that is only kept on the first two drives.
Data is evenly spread, but meta data accesses lead to additional reads.
2) The read requests are distributed in such a way, that 1/3 goes to
ada0, another 1/3 to ada1, while the remaining 1/3 is evenly distributed
to ada2 and ada3.
So: Can anybody reproduce this distribution requests?
Hello,

Stupid question, but are your drives all exactly the same ? I noticed
"ashift: 12" so I think you should have at least one 4k-sector drive,
are you sure they're not mixed with 512B per sector drives ?
Post by Stefan Esser
Any idea, why this is happening and whether something should be changed
in ZFS to better distribute the load (leading to higher file system
performance)?
Best regards, STefan
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current
--
Olivier Smedts                                                 _
                                        ASCII ribbon campaign ( )
e-mail: ***@gid0.org        - against HTML email & vCards  X
www: http://www.gid0.org    - against proprietary attachments / \

  "Il y a seulement 10 sortes de gens dans le monde :
  ceux qui comprennent le binaire,
  et ceux qui ne le comprennent pas."
Stefan Esser
2011-12-19 18:17:36 UTC
Permalink
Post by Olivier Smedts
Post by Stefan Esser
So: Can anybody reproduce this distribution requests?
Hello,
Stupid question, but are your drives all exactly the same ? I noticed
"ashift: 12" so I think you should have at least one 4k-sector drive,
are you sure they're not mixed with 512B per sector drives ?
All drives are identical:

<SAMSUNG HD204UI 1AQ10001> at scbus3 target 0 lun 0 (ada0,pass2)
<SAMSUNG HD204UI 1AQ10001> at scbus4 target 0 lun 0 (ada1,pass3)
<SAMSUNG HD204UI 1AQ10001> at scbus5 target 0 lun 0 (ada2,pass4)
<SAMSUNG HD204UI 1AQ10001> at scbus6 target 0 lun 0 (ada3,pass5)

These are 4KB sector drives. Everything is correctly aligned and all
drives have identical partition (created by a script that was run once
for each drive, so there is no risk of typoes leading to differences).

Regards, STefan
Peter Maloney
2011-12-19 15:42:20 UTC
Permalink
Post by Stefan Esser
Hi ZFS users,
for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
dT: 10.001s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 130 106 4134 4.5 23 1033 5.2 48.8| ada0
0 131 111 3784 4.2 19 1007 4.0 47.6| ada1
0 90 66 2219 4.5 24 1031 5.1 31.7| ada2
1 81 58 2007 4.6 22 1023 2.3 28.1| ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 132 104 4036 4.2 27 1129 5.3 45.2| ada0
0 129 103 3679 4.5 26 1115 6.8 47.6| ada1
1 91 61 2133 4.6 30 1129 1.9 29.6| ada2
0 81 56 1985 4.8 24 1102 6.0 29.4| ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 148 108 4084 5.3 39 2511 7.2 55.5| ada0
1 141 104 3693 5.1 36 2505 10.4 54.4| ada1
1 102 62 2112 5.6 39 2508 5.5 35.4| ada2
0 99 60 2064 6.0 39 2483 3.7 36.1| ada3
...
So: Can anybody reproduce this distribution requests?
I don't have a raidz1 machine, and no time to make you a special raidz1
pool out of spare disks, but on my raidz2 I can only ever see unevenness
when a disk is bad, or between different vdevs. But you only have one vdev.

Check is that your disks are identical (are they? we can only assume so
since you didn't say so).
Show us output from:
smartctl -i /dev/ada0
smartctl -i /dev/ada1
smartctl -i /dev/ada2
smartctl -i /dev/ada3

Since your tests show read ms/r to be pretty even, I guess your disks
are not broken. But the ms/w is slightly different. So I think it seems
that the first 2 disks are slower for writing (someone once said that
refurbished disks are like this, even if identical), or the hard disk
controller ports they use are slower. For example, maybe your
motherboard has 6 ports, and you plugged disks 1,2,3 into port 1,2,3 and
disk 4 into port 5. Disk 3 and 4 would have their own channel, but disk
1 and 2 share one.

So if the disks are identical, I would guess your hard disk controller
is to blame. To test this, first back it up. Then *fix your setup by
using labels*. ie. use gpt/somelabel0 or gptid/....... rather than
ada0p2. Check "ls /dev/gpt*" output for options on what labels you have
already. Then try swapping disks around to see if the load changes. Make
sure to back up...

Swapping disks (or even removing one depending on controller, etc. when
it fails) without labels can be bad.
eg.
You have ada1 ada2 ada3 ada4.
Someone spills coffee on ada2; it fries and cannot be detected anymore,
and you reboot.
Now you have ada1 ada2 ada3.
Then things are usually still fine (even though ada3 is now ada2 and
ada4 is now ada3, because there is some zfs superblock stuff to keep
track of things), but if you also had an ada5 that was not part of the
pool, or was a spare or a log or something other than another disk in
the same vdev as ada1, etc., bad things happen when it becomes ada4.
Unfortunately, I don't know exactly what people do to cause the "bad
things" that happen. When this happened to me, it just said my pool was
faulted or degraded or something, and set a disk or two to UNAVAIL or
FAULTED. I don't remember it automatically resilvering them, but when I
read about these problems, I think it seems like some disks were
resilvered afterwards.


And last thing I can think of is to make sure your partitions are
aligned, and identical. Show us output from:
gpart show
Post by Stefan Esser
Any idea, why this is happening and whether something should be changed
in ZFS to better distribute the load (leading to higher file system
performance)?
Best regards, STefan
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current
--
--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: ***@brockmann-consult.de
Internet: http://www.brockmann-consult.de
--------------------------------------------
Michael Reifenberger
2011-12-19 16:48:56 UTC
Permalink
Post by Peter Maloney
Swapping disks (or even removing one depending on controller, etc. when
it fails) without labels can be bad.
eg.
Since ZFS uses (and searches for) its own UUID partition signatures s
disk wapping shouldn't matter as long enough disks are found.

Set vfs.zfs.debug=1 during boot to watch what is searched for.

Bye/2
---
Michael Reifenberger
***@Reifenberger.com
http://www.Reifenberger.com
Peter Maloney
2011-12-19 20:40:13 UTC
Permalink
Post by Michael Reifenberger
Post by Peter Maloney
Swapping disks (or even removing one depending on controller, etc. when
it fails) without labels can be bad.
eg.
Since ZFS uses (and searches for) its own UUID partition signatures s
disk wapping shouldn't matter as long enough disks are found.
Set vfs.zfs.debug=1 during boot to watch what is searched for.
Bye/2
---
Michael Reifenberger
http://www.Reifenberger.com
Thanks for the info. But I am confused by it, because when my disks
moved around randomly on reboot, it really did mess things up. The first
few times it happened, there was no issue, but when a spare took the
place of a pool disk, it messed things up. I can see the UUIDs when I
look at zdb output, so I really have no idea why it messed things up.
... but it did, so I will always caution people anyway. I can't point
you to any relevant lines of code that cause the problem, but I know it
can happen... and it will when you least expect it. ;)

And I also see the opposite... people talking about their very old
pools, with many disks exchanged, and wonder why mine was so easily
messed up and theirs survived so long without labels. I just assumed it
was the way the controller arranged the disks. (and by the way, mine now
orders the disks perfectly consistently now that it is in IT mode, not
mostly random like before... could be a factor)

I am always very busy, but when I get the chance, it shouldn't take too
long, so I will try to recreate it on a virtual machine and try
vfs.zfs.debug=1.Thanks for the suggestion.
Michael Reifenberger
2011-12-20 16:58:16 UTC
Permalink
On Mon, 19 Dec 2011, Peter Maloney wrote:
...
Post by Peter Maloney
Thanks for the info. But I am confused by it, because when my disks
moved around randomly on reboot, it really did mess things up. The first
few times it happened, there was no issue, but when a spare took the
place of a pool disk, it messed things up. I can see the UUIDs when I
look at zdb output, so I really have no idea why it messed things up.
... but it did, so I will always caution people anyway. I can't point
you to any relevant lines of code that cause the problem, but I know it
can happen... and it will when you least expect it. ;)
Normaly spare disks are not part of the pool so it can't replace
a pools disk automatically (except the property 'autoreplace' is set to 'on').

But as allways no software is error free and your issue could be an uncaught
edge case of something.

Theoretically it could be an timing issue during boot too due to the async
GEOM/CAM nature...

Bye/2
---
Michael Reifenberger
***@Reifenberger.com
http://www.Reifenberger.com

Stefan Esser
2011-12-19 18:56:03 UTC
Permalink
Post by Peter Maloney
Post by Stefan Esser
So: Can anybody reproduce this distribution requests?
I don't have a raidz1 machine, and no time to make you a special raidz1
pool out of spare disks, but on my raidz2 I can only ever see unevenness
when a disk is bad, or between different vdevs. But you only have one vdev.
Thanks for replying.

In my previous raidz1 pool consisting of 3*1TB, one of the drives had to
be replaced because it showed lots of recoverable errors when I
initially created the pool. The effects where much more drastic than
what I see now: Given identical request rates, the failed drive was 100%
busy when the other drives had busy percentages in the one digit range.

But the observed differences seem to be caused by a different rate of
read requests issued towards the drives (the first two receive 30% of
the reads, each, while the last two receive 20% each). And this ratio
has been stable over months (I had already noticed this in summer, but
did not have time to start a thread at that time).
Post by Peter Maloney
Check is that your disks are identical (are they? we can only assume so
since you didn't say so).
Yes, all 4 are identical.
Post by Peter Maloney
smartctl -i /dev/ada0
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7JD1B116957
LU WWN Device Id: 5 0024e9 0049bee63
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Mon Dec 19 19:23:36 2011 CET

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always
- 0
2 Throughput_Performance 0x0026 252 252 000 Old_age Always
- 0
3 Spin_Up_Time 0x0023 067 067 025 Pre-fail Always
- 10127
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
- 254
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always
- 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
- 2300
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always
- 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always
- 1
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 228
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always
- 621067
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always
- 4
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always
- 0
194 Temperature_Celsius 0x0002 064 055 000 Old_age Always
- 28 (Min/Max 15/48)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always
- 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always
- 2
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always
- 1
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 264
Post by Peter Maloney
smartctl -i /dev/ada1
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7JD1B116947
LU WWN Device Id: 5 0024e9 0049bee49
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Mon Dec 19 19:23:22 2011 CET

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always
- 0
2 Throughput_Performance 0x0026 252 252 000 Old_age Always
- 0
3 Spin_Up_Time 0x0023 067 067 025 Pre-fail Always
- 10096
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
- 255
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always
- 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
- 2316
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always
- 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always
- 1
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 231
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always
- 2175909
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always
- 1
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always
- 0
194 Temperature_Celsius 0x0002 064 055 000 Old_age Always
- 26 (Min/Max 16/47)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always
- 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always
- 1
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always
- 1
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 264
Post by Peter Maloney
smartctl -i /dev/ada2
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7JD1B116956
LU WWN Device Id: 5 0024e9 0049bee60
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Mon Dec 19 19:24:24 2011 CET

1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always
- 0
2 Throughput_Performance 0x0026 252 252 000 Old_age Always
- 0
3 Spin_Up_Time 0x0023 067 066 025 Pre-fail Always
- 10254
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
- 246
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always
- 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
- 2300
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always
- 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 227
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always
- 105259
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always
- 1
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always
- 0
194 Temperature_Celsius 0x0002 064 056 000 Old_age Always
- 28 (Min/Max 16/45)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always
- 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always
- 0
223 Load_Retry_Count 0x0032 252 252 000 Old_age Always
- 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 256
Post by Peter Maloney
smartctl -i /dev/ada3
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7JD1B116946
LU WWN Device Id: 5 0024e9 0049bee47
Firmware Version: 1AQ10001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Mon Dec 19 19:24:55 2011 CET

1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always
- 0
2 Throughput_Performance 0x0026 252 252 000 Old_age Always
- 0
3 Spin_Up_Time 0x0023 066 066 025 Pre-fail Always
- 10472
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always
- 250
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always
- 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always
- 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always
- 2302
10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always
- 0
11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 227
181 Program_Fail_Cnt_Total 0x0022 100 100 000 Old_age Always
- 239254
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age Always
- 1
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always
- 0
194 Temperature_Celsius 0x0002 064 055 000 Old_age Always
- 27 (Min/Max 16/47)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always
- 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always
- 0
197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always
- 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age Always
- 2
223 Load_Retry_Count 0x0032 252 252 000 Old_age Always
- 0
225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 259
Post by Peter Maloney
Since your tests show read ms/r to be pretty even, I guess your disks
are not broken. But the ms/w is slightly different. So I think it seems
that the first 2 disks are slower for writing (someone once said that
My interpretation is, that the first two have higher write latencies
since they receive more read requests.
Post by Peter Maloney
refurbished disks are like this, even if identical), or the hard disk
controller ports they use are slower. For example, maybe your
motherboard has 6 ports, and you plugged disks 1,2,3 into port 1,2,3 and
disk 4 into port 5. Disk 3 and 4 would have their own channel, but disk
1 and 2 share one.
This is an ICH10 and the drives are connected to SATA II channels (the
SATA III channels are reserved for a planned SSD cache).
Post by Peter Maloney
So if the disks are identical, I would guess your hard disk controller
is to blame. To test this, first back it up. Then *fix your setup by
using labels*. ie. use gpt/somelabel0 or gptid/....... rather than
ada0p2. Check "ls /dev/gpt*" output for options on what labels you have
already. Then try swapping disks around to see if the load changes. Make
sure to back up...
The drives are lalready abelled and I can easily modify the pool to
refer to GPT labels. But swapping drives should not cause any harm in
ZFS, whether labels are device names are used (the drives in the pool
are identified by their GUID).
Post by Peter Maloney
Swapping disks (or even removing one depending on controller, etc. when
it fails) without labels can be bad.
Yes, I know (having seen my first Unix system more than 30 years ago).
I'll re-import the drives with "zpool import -d /dev/gpt ..." but need
to boot from an alternate boot device first.
Post by Peter Maloney
eg.
You have ada1 ada2 ada3 ada4.
Someone spills coffee on ada2; it fries and cannot be detected anymore,
and you reboot.
Now you have ada1 ada2 ada3.
Then things are usually still fine (even though ada3 is now ada2 and
ada4 is now ada3, because there is some zfs superblock stuff to keep
track of things), but if you also had an ada5 that was not part of the
pool, or was a spare or a log or something other than another disk in
the same vdev as ada1, etc., bad things happen when it becomes ada4.
Unfortunately, I don't know exactly what people do to cause the "bad
things" that happen. When this happened to me, it just said my pool was
faulted or degraded or something, and set a disk or two to UNAVAIL or
FAULTED. I don't remember it automatically resilvering them, but when I
read about these problems, I think it seems like some disks were
resilvered afterwards.
The recovery from partial pool failures and the collection of drives to
form a pool has been modified several times in the last two years and
should be quite robust by now. One thing to look out for is to not copy
a pool to new disk drives (I used to have 3*1TB, copied to 4*2TB) and
later connect a drive from the original pool with its ZFS metadata
intact at the end of the drive (I had cleared the first 1MB, but not the
last 1MB). This causes confusion, if the name of the pool has not
changed. But other than that, I do not see much risk in ZFS pools built
from /dev nodes.
Post by Peter Maloney
And last thing I can think of is to make sure your partitions are
gpart show
They have all been created by a script that takes the device node name
as parameter and thus are identical.

=> 34 3907029101 ada0 GPT (1.8T)
34 30 - free - (15k)
64 192 1 freebsd-boot (96k)
256 3565158400 2 freebsd-zfs (1.7T)
3565158656 341870479 3 freebsd (163G)

=> 34 3907029101 ada1 GPT (1.8T)
34 30 - free - (15k)
64 192 1 freebsd-boot (96k)
256 3565158400 2 freebsd-zfs (1.7T)
3565158656 341870479 3 freebsd (163G)

=> 34 3907029101 ada2 GPT (1.8T)
34 30 - free - (15k)
64 192 1 freebsd-boot (96k)
256 3565158400 2 freebsd-zfs (1.7T)
3565158656 341870479 3 freebsd (163G)

=> 34 3907029101 ada3 GPT (1.8T)
34 30 - free - (15k)
64 192 1 freebsd-boot (96k)
256 3565158400 2 freebsd-zfs (1.7T)
3565158656 1792 - free - (896k)
3565160448 341868544 3 freebsd-swap (163G)
3907028992 143 - free - (71k)


There is an unused 10% at the end of each device, and I have recently
made ada3p3 a swap device, just to be able to collect kernel dumps (no
swpa is actually used; this is an 8GB RAM machine with 6GB assigned to
ARC and mostly low load).

Best regards, STefan
Michael Reifenberger
2011-12-19 16:36:42 UTC
Permalink
Hi,
a quick test using `dd if=/dev/zero of=/test ...` shows:

dT: 10.004s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 378 0 0 12.5 376 36414 11.9 60.6| ada0
0 380 0 0 12.2 378 36501 11.8 60.0| ada1
0 382 0 0 7.7 380 36847 11.6 59.2| ada2
0 375 0 0 7.4 374 36164 9.6 51.3| ada3
0 377 0 1 10.2 375 36325 10.1 53.3| ada4
10 391 0 0 39.3 389 38064 15.7 80.2| ada5

Seems to be sufficiently equally distributed for a life system...

zpool status shows:
...
NAME STATE READ WRITE CKSUM
boot ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
ada3p3 ONLINE 0 0 0
ada4p3 ONLINE 0 0 0
ada5p3 ONLINE 0 0 0
...

The only cases I've seen (and expected to see) unequal load distributions on ZFS
was after extending a nearly full four disk mirror pool by additional two disks.


Bye/2
---
Michael Reifenberger
***@Reifenberger.com
http://www.Reifenberger.com
Stefan Esser
2011-12-19 20:42:53 UTC
Permalink
Post by Michael Reifenberger
Hi,
dT: 10.004s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 378 0 0 12.5 376 36414 11.9 60.6| ada0
0 380 0 0 12.2 378 36501 11.8 60.0| ada1
0 382 0 0 7.7 380 36847 11.6 59.2| ada2
0 375 0 0 7.4 374 36164 9.6 51.3| ada3
0 377 0 1 10.2 375 36325 10.1 53.3| ada4
10 391 0 0 39.3 389 38064 15.7 80.2| ada5
Thanks! There are surprising differences (ada5 has a queue length of 10
and much higher latency than the other drives).
Post by Michael Reifenberger
Seems to be sufficiently equally distributed for a life system...
Hmmm, 50%-55% busy on ada3 and ada4 contrasts with 80% busy on ada5.
Post by Michael Reifenberger
...
NAME STATE READ WRITE CKSUM
boot ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p3 ONLINE 0 0 0
ada1p3 ONLINE 0 0 0
ada2p3 ONLINE 0 0 0
ada3p3 ONLINE 0 0 0
ada4p3 ONLINE 0 0 0
ada5p3 ONLINE 0 0 0
...
The only cases I've seen (and expected to see) unequal load
distributions on ZFS was after extending a nearly full four disk mirror
pool by additional two disks.
In my case the pool was created from disk drives with nearly identical
serial numbers in its current configuration. Some of the drives have a
few more power-on hours, since I performed some tests with them, before
moving all data from the old pool the new one, but else everything
should be symmetric.

Best regards, STefan
Stefan Esser
2011-12-20 10:07:08 UTC
Permalink
Post by Michael Reifenberger
Hi,
dT: 10.004s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 378 0 0 12.5 376 36414 11.9 60.6| ada0
0 380 0 0 12.2 378 36501 11.8 60.0| ada1
0 382 0 0 7.7 380 36847 11.6 59.2| ada2
0 375 0 0 7.4 374 36164 9.6 51.3| ada3
0 377 0 1 10.2 375 36325 10.1 53.3| ada4
10 391 0 0 39.3 389 38064 15.7 80.2| ada5
Seems to be sufficiently equally distributed for a life system...
Hi Michael,

in an earlier reply I mentioned the suspicious queue length and %busy of
ada5, which may be the result of other load (not caused by the dd
command) or of a hardware problem (I'd check drive health ...).

(Hmmm, the numbers look strange: ops/s is not the sum of r/s and w/s,
but misses that value by 2. I could understand a rounding difference of
1, but not 2 counts per second. But this is a different issue ...)

Anyway: The imbalance that I observe on my system is specific to reads,
not writes. Could you please check, whether sending a large (multi-GB)
file to /dev/null shows identical read load over all drives?

I suspect that 2 of the drives will see slightly (some 20%, perhaps)
less read requests than the rest.

Regards, STefan
Dan Nelson
2011-12-19 16:22:20 UTC
Permalink
for quite some time I have observed an uneven distribution of load between
drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of a longer
dT: 10.001s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 130 106 4134 4.5 23 1033 5.2 48.8| ada0
0 131 111 3784 4.2 19 1007 4.0 47.6| ada1
0 90 66 2219 4.5 24 1031 5.1 31.7| ada2
1 81 58 2007 4.6 22 1023 2.3 28.1| ada3
[...]
zpool status -v
pool: raid1
state: ONLINE
scan: none requested
NAME STATE READ WRITE CKSUM
raid1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada2p2 ONLINE 0 0 0
ada3p2 ONLINE 0 0 0
Any read from your raidz device will hit three disks (the checksum is
applied across the stripe, not on each block, so a full stripe is always
read) so I think your extra IOs are coming from somewhere else.

What's on p1 on these disks? Could that be the cause of your extra I/Os?
Does "zpool iostat -v 10" give you even numbers across all disks?
--
Dan Nelson
***@allantgroup.com
Stefan Esser
2011-12-19 20:36:46 UTC
Permalink
Post by Dan Nelson
for quite some time I have observed an uneven distribution of load between
drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of a longer
dT: 10.001s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 130 106 4134 4.5 23 1033 5.2 48.8| ada0
0 131 111 3784 4.2 19 1007 4.0 47.6| ada1
0 90 66 2219 4.5 24 1031 5.1 31.7| ada2
1 81 58 2007 4.6 22 1023 2.3 28.1| ada3
[...]
zpool status -v
pool: raid1
state: ONLINE
scan: none requested
NAME STATE READ WRITE CKSUM
raid1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0
ada2p2 ONLINE 0 0 0
ada3p2 ONLINE 0 0 0
Any read from your raidz device will hit three disks (the checksum is
applied across the stripe, not on each block, so a full stripe is always
read) so I think your extra IOs are coming from somewhere else.
What's on p1 on these disks? Could that be the cause of your extra I/Os?
Does "zpool iostat -v 10" give you even numbers across all disks?
This is a ZFS only system. The first partition on each drive holds just
the gptzfsloader.

pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
raid1 4.41T 2.21T 139 72 12.3M 818K
raidz1 4.41T 2.21T 139 72 12.3M 818K
ada0p2 - - 114 17 4.24M 332K
ada1p2 - - 106 15 3.82M 305K
ada2p2 - - 65 20 2.09M 337K
ada3p2 - - 58 18 2.18M 329K

capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
raid1 4.41T 2.21T 150 45 12.8M 751K
raidz1 4.41T 2.21T 150 45 12.8M 751K
ada0p2 - - 113 14 4.34M 294K
ada1p2 - - 111 14 3.94M 277K
ada2p2 - - 62 16 2.23M 294K
ada3p2 - - 68 14 2.32M 277K
---------- ----- ----- ----- ----- ----- -----

capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
raid1 4.41T 2.21T 157 86 12.3M 6.41M
raidz1 4.41T 2.21T 157 86 12.3M 6.41M
ada0p2 - - 119 39 4.21M 2.24M
ada1p2 - - 106 31 3.78M 2.21M
ada2p2 - - 81 59 2.23M 2.23M
ada3p2 - - 57 39 2.06M 2.22M
---------- ----- ----- ----- ----- ----- -----

capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
raid1 4.41T 2.21T 187 45 14.2M 1.04M
raidz1 4.41T 2.21T 187 45 14.2M 1.04M
ada0p2 - - 117 13 4.27M 398K
ada1p2 - - 120 12 4.01M 384K
ada2p2 - - 89 12 2.97M 403K
ada3p2 - - 85 13 2.91M 386K
---------- ----- ----- ----- ----- ----- -----

The same difference of read operations per second as shown by gstat ...

Regards, STefan
Dan Nelson
2011-12-19 21:53:17 UTC
Permalink
Post by Stefan Esser
Post by Stefan Esser
for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt
dT: 10.001s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 130 106 4134 4.5 23 1033 5.2 48.8| ada0
0 131 111 3784 4.2 19 1007 4.0 47.6| ada1
0 90 66 2219 4.5 24 1031 5.1 31.7| ada2
1 81 58 2007 4.6 22 1023 2.3 28.1| ada3
[...]
This is a ZFS only system. The first partition on each drive holds just
the gptzfsloader.
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
raid1 4.41T 2.21T 139 72 12.3M 818K
raidz1 4.41T 2.21T 139 72 12.3M 818K
ada0p2 - - 114 17 4.24M 332K
ada1p2 - - 106 15 3.82M 305K
ada2p2 - - 65 20 2.09M 337K
ada3p2 - - 58 18 2.18M 329K
The same difference of read operations per second as shown by gstat ...
I was under the impression that the parity blocks were scattered evenly
across all disks, but from reading vdev_raidz.c, it looks like that isn't
always the case. See the comment at the bottom of the
vdev_raidz_map_alloc() function; it looks like it will toggle parity between
the first two disks in a stripe every 1MB. It's not necessarily the first
two disks assigned to the zvol, since stripes don't have to span all disks
as long as there's one parity block (a small sync write may just hit two
disks, essentially being written mirrored). The imbalance is only visible
if you're writing full-width stripes in sequence, so if you write a 1TB file
in one long stream, chances are that that file's parity blocks will be
concentrated on just two disks, so those two disks will get less I/O on
later reads. I don't know why the code toggles parity between just the
first two columns; rotating it between all columns would give you an even
balance.

Is it always the last two disks that have less load, or does it slowly
rotate to different disks depending on the data that you are reading? An
interesting test would be to idle the system, run a "tar cvf /dev/null
/raidz1" in one window, and watch iostat output on another window. If the
load moves from disk to disk as tar reads different files, then my parity
guess is probably right. If ada0 and ada1 are always busier, than you can
ignore me :)

Since it looks like the algorithm ends up creating two half-cold parity
disks instead of one cold disk, I bet a 3-disk RAIDZ would exhibit even
worse balancing, and a 5-disk set would be more even.
--
Dan Nelson
***@allantgroup.com
Daniel Kalchev
2011-12-19 22:31:53 UTC
Permalink
Post by Dan Nelson
Since it looks like the algorithm ends up creating two half-cold parity
disks instead of one cold disk, I bet a 3-disk RAIDZ would exhibit even
worse balancing, and a 5-disk set would be more even.
There were some experiments a year or two ago with different number of disks in raidz and the results suggested that certain number of disks had better performance, contrary to theory that writes should be evenly distributed. Worse, this is in the official theory of how raidz operates…

Perhaps the code can be fixed to spread the writes to all devices in raidz, but compatibility issues need to be considered.

Perhaps DDT is stored in the 'worst case' write size, because it clearly exhibits such poor distribution.

Daniel_______________________________________________
freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
Stefan Esser
2011-12-20 11:45:48 UTC
Permalink
Post by Dan Nelson
Post by Stefan Esser
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
raid1 4.41T 2.21T 139 72 12.3M 818K
raidz1 4.41T 2.21T 139 72 12.3M 818K
ada0p2 - - 114 17 4.24M 332K
ada1p2 - - 106 15 3.82M 305K
ada2p2 - - 65 20 2.09M 337K
ada3p2 - - 58 18 2.18M 329K
The same difference of read operations per second as shown by gstat ...
I was under the impression that the parity blocks were scattered evenly
across all disks, but from reading vdev_raidz.c, it looks like that isn't
always the case. See the comment at the bottom of the
vdev_raidz_map_alloc() function; it looks like it will toggle parity between
the first two disks in a stripe every 1MB. It's not necessarily the first
Thanks, this is very interesting information, indeed. I observed the
problem when minidlna rebuild its index database, which scans all media
files, many of them GBytes in length and sequentially written. This is a
typical scenario that should trigger the code you point at.

The comment explains that an attempt has been made to spread the (read)
load more evenly, if large files are sequentially written:

* If all data stored spans all columns, there's a danger that parity
* will always be on the same device and, since parity isn't read
* during normal operation, that that device's I/O bandwidth won't be
* used effectively. We therefore switch the parity every 1MB.

But they later found, that they failed to implement a good solution:

* ... at least that was, ostensibly, the theory. As a practical
* matter unless we juggle the parity between all devices evenly, we
* won't see any benefit. Further, occasional writes that aren't a
* multiple of the LCM of the number of children and the minimum
* stripe width are sufficient to avoid pessimal behavior.

But I do not understand the reasoning behind:

* Unfortunately, this decision created an implicit on-disk format
* requirement that we need to support for all eternity, but only
* for single-parity RAID-Z.

I see how the devidx and offset are swapped between col[0] and col[1],
and it appears that this swapping is not explicitly reflected in the
meta data. But there is no reason, that the algorithm could not be
modified to cover all drives, if some flag is set (which effectively
would lead to a 2nd generation raidz1 with incompatible block layout).

Anyway, I do not think that the current behavior is so bad, that it
needs immediate fixing.
Post by Dan Nelson
two disks assigned to the zvol, since stripes don't have to span all disks
as long as there's one parity block (a small sync write may just hit two
disks, essentially being written mirrored). The imbalance is only visible
if you're writing full-width stripes in sequence, so if you write a 1TB file
in one long stream, chances are that that file's parity blocks will be
concentrated on just two disks, so those two disks will get less I/O on
later reads. I don't know why the code toggles parity between just the
first two columns; rotating it between all columns would give you an even
balance.
Yes, but as the comment indicates, this would require introduction of a
different raidz1 (a higher ZFS revision or a flag could trigger that).
Post by Dan Nelson
Is it always the last two disks that have less load, or does it slowly
rotate to different disks depending on the data that you are reading? An
interesting test would be to idle the system, run a "tar cvf /dev/null
/raidz1" in one window, and watch iostat output on another window. If the
load moves from disk to disk as tar reads different files, then my parity
guess is probably right. If ada0 and ada1 are always busier, than you can
ignore me :)
Yes, you are perfectly right! I tested the tar on a spool directory
holding DVB-C recordings (typical files length 2GB to 8GB). The

dT: 10.001s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 935 921 40216 0.4 13 139 0.5 32.8| ada0
0 927 913 36530 0.3 13 108 1.5 31.8| ada1
0 474 460 20110 0.7 14 141 0.9 32.4| ada2
0 474 461 20102 0.7 13 141 0.7 31.6| ada3

L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 1046 1041 45503 0.3 5 35 0.9 31.5| ada0
0 1039 1035 41353 0.3 4 23 0.4 31.6| ada1
0 531 526 22827 0.6 5 38 0.4 33.4| ada2
1 523 518 22772 0.6 5 38 0.6 30.8| ada3

L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 384 377 16414 0.8 7 46 3.3 30.2| ada0
0 380 373 15857 0.8 6 42 0.4 30.5| ada1
0 553 547 23937 0.5 6 47 1.7 28.0| ada2
1 551 545 22004 0.6 6 38 0.7 32.2| ada3

L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 667 656 28633 0.4 11 123 0.6 29.6| ada0
1 660 650 26010 0.5 10 109 0.6 33.4| ada1
0 338 327 14328 0.8 11 126 0.9 25.7| ada2
0 339 328 14303 1.0 11 120 1.0 32.7| ada3

$ iostat -d -n4 3
ada0 ada1 ada2 ada3
KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s
44.0 860 36.94 40.0 860 33.60 44.0 429 18.44 44.0 431 18.50
43.9 814 34.86 39.9 813 31.67 43.8 408 17.45 43.7 408 17.44
43.4 900 38.10 39.4 899 34.64 42.5 463 19.18 42.7 459 19.14
44.0 904 38.86 40.0 904 35.33 44.0 453 19.44 44.0 452 19.42
! 43.1 571 24.01 41.5 571 23.17 43.4 799 33.85 40.0 801 31.27
! 44.0 461 19.79 44.0 460 19.74 44.0 920 39.52 40.0 920 35.93
! 43.9 435 18.65 43.9 435 18.68 44.0 868 37.29 40.0 868 33.91
! 42.8 390 16.29 42.8 390 16.28 43.4 765 32.42 39.4 767 29.48
! 44.0 331 14.22 44.0 329 14.12 44.0 659 28.32 40.0 659 25.75
! 41.8 332 13.55 42.1 326 13.38 42.9 640 26.84 39.0 640 24.38
44.0 452 19.40 42.2 451 18.58 44.0 597 25.66 40.7 595 23.65
= 42.3 589 24.33 39.8 585 22.75 42.1 562 23.14 39.7 561 21.77
= 43.0 569 23.93 40.8 570 22.72 43.0 641 26.95 40.1 642 25.14
44.0 709 30.48 40.9 710 28.41 44.0 607 26.10 41.8 606 24.73
44.0 785 33.73 40.6 784 31.07 44.0 567 24.36 42.4 568 23.50
44.0 899 38.62 40.0 899 35.11 44.0 449 19.30 44.0 450 19.32
44.0 881 37.87 40.0 881 34.43 44.0 441 18.94 44.0 441 18.93
43.4 841 35.61 39.4 841 32.37 42.7 428 17.87 42.7 428 17.84

Hmmm, looking back through hundreds of lines of iostat output I see that
ada0 and ada1 see similar request rates, as do ada2 and ada3.
But I know that I also observed other combinations on earlier tests
(with different data?).
Post by Dan Nelson
Since it looks like the algorithm ends up creating two half-cold parity
disks instead of one cold disk, I bet a 3-disk RAIDZ would exhibit even
worse balancing, and a 5-disk set would be more even.
Yes, this sounds very reasonable. Some iostat results were posted for
a 6 disk raidz1, but they were for writes, not reads. I've kept the
3*1TB drives that formed the pool before I replaced them by 4*2TB.
I can create a 3 drive raidz1 on them and perform some tests ...


BTW: Read throughput in the tar test was far lower than I had expected.
The CPU load was 3% user and some 0,2 system time (on an i2600K) and the
effective transfer speed of the RAID was only some 115MB/s.
The pool has 1/3 empty space and the test files were written in one go
and should have been layed out in an optimal way.

A dd of a large file (~10GB) gives similar results, independently of the
block size (128k vs. 1m).

Transfer sizes were only 43KB on average, which matches MAXPHYS=128KB
distributed over 3 drives (plus parity in case of writes). This
indicates, that in order to be able to read MAXPHYS bytes from each
drive, the original request size should have covered 3*MAXPHYS.

But the small transfer length does not seem to be the cause of the low
transfer rate:

# dd if=/dev/ada2p2 of=/dev/null bs=10k count=10000
10000+0 records in
10000+0 records out
102400000 bytes transferred in 0.853374 secs (119994281 bytes/sec)

# dd if=/dev/ada1p2 of=/dev/null bs=2k count=50000
50000+0 records in
50000+0 records out
102400000 bytes transferred in 2.668089 secs (38379531 bytes/sec)

Even a block size of 2KB will result in 35-40MB/s read throughput ...

Any idea, why the read performance is so much lower than possible given
the hardware?

Regards, STefan
Garrett Cooper
2011-12-19 17:05:20 UTC
Permalink
Post by Stefan Esser
Hi ZFS users,
for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
dT: 10.001s  w: 10.000s  filter: ^a?da?.$
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   0    130    106   4134    4.5     23   1033    5.2   48.8| ada0
   0    131    111   3784    4.2     19   1007    4.0   47.6| ada1
   0     90     66   2219    4.5     24   1031    5.1   31.7| ada2
   1     81     58   2007    4.6     22   1023    2.3   28.1| ada3
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   1    132    104   4036    4.2     27   1129    5.3   45.2| ada0
   0    129    103   3679    4.5     26   1115    6.8   47.6| ada1
   1     91     61   2133    4.6     30   1129    1.9   29.6| ada2
   0     81     56   1985    4.8     24   1102    6.0   29.4| ada3
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
   1    148    108   4084    5.3     39   2511    7.2   55.5| ada0
   1    141    104   3693    5.1     36   2505   10.4   54.4| ada1
   1    102     62   2112    5.6     39   2508    5.5   35.4| ada2
   0     99     60   2064    6.0     39   2483    3.7   36.1| ada3
This suggests (note that I said suggests) that there might be a slight
difference in the data path speeds or physical media as someone else
suggested; look at zpool iostat -v <interval> though before making a
firm statement as to whether or not a drive is truly not performing to
your assumed spec. gstat and zpool iostat -v suggest performance
though -- they aren't the end-all-be-all for determining drive
performance.

If the latency numbers were high enough, I would suggest dd'ing out to
the individual drives (i.e. remove the drive from the RAIDZ) to see if
there's a noticeable discrepancy, as this can indicate a bad cable,
backplane, or drive; from there I would start doing the physical swap
routine and see if the issue moves with the drive or stays static with
the controller channel and/or chassis slot.

Cheers,
-Garrett
Stefan Esser
2011-12-19 20:54:10 UTC
Permalink
Post by Garrett Cooper
Post by Stefan Esser
Hi ZFS users,
for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
dT: 10.001s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 130 106 4134 4.5 23 1033 5.2 48.8| ada0
0 131 111 3784 4.2 19 1007 4.0 47.6| ada1
0 90 66 2219 4.5 24 1031 5.1 31.7| ada2
1 81 58 2007 4.6 22 1023 2.3 28.1| ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 132 104 4036 4.2 27 1129 5.3 45.2| ada0
0 129 103 3679 4.5 26 1115 6.8 47.6| ada1
1 91 61 2133 4.6 30 1129 1.9 29.6| ada2
0 81 56 1985 4.8 24 1102 6.0 29.4| ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 148 108 4084 5.3 39 2511 7.2 55.5| ada0
1 141 104 3693 5.1 36 2505 10.4 54.4| ada1
1 102 62 2112 5.6 39 2508 5.5 35.4| ada2
0 99 60 2064 6.0 39 2483 3.7 36.1| ada3
This suggests (note that I said suggests) that there might be a slight
difference in the data path speeds or physical media as someone else
suggested; look at zpool iostat -v <interval> though before making a
firm statement as to whether or not a drive is truly not performing to
your assumed spec. gstat and zpool iostat -v suggest performance
though -- they aren't the end-all-be-all for determining drive
performance.
I doubt there is a difference in the data path speeds, since all drives
are connected to the SATA II ports of an Intel H67 chip.

The drives seem to perform equally well, just with a ratio of read
requests of 30% / 30% / 20% / 20% for ada0 .. ada3. But neither queue
length nor command latencies indicate a problem or differences in the
drives. It seems that a different number of commands is scheduled for 2
of the 4 drives, compared to the other 2, and that scheduling should be
part of the ZFS code. I'm quite convinced, that neither the drives nor
the other hardware plays a role, but I'll follow the suggestion to swap
drives between controller ports and to observe whether the increased
read load moves with the drives (indicating something on disk causes the
anomaly) or stays with the SATA ports (indicating that lower numbered
ports see higher load).
Post by Garrett Cooper
If the latency numbers were high enough, I would suggest dd'ing out to
the individual drives (i.e. remove the drive from the RAIDZ) to see if
there's a noticeable discrepancy, as this can indicate a bad cable,
backplane, or drive; from there I would start doing the physical swap
routine and see if the issue moves with the drive or stays static with
the controller channel and/or chassis slot.
I do not expect a hardware problem, since command latencies are very
similar over all drives, despite the higher read load on some of them.
These are more busy by exactly the factor to be expected by only the
higher command rate.

But it seems that others do not observe the asymmetric distribution of
requests, which makes me wonder whether I happen to have meta data
arranged in such a way that it is always read from ada0 or ada1, but not
(or rarely) from ada2 or ada3. That could explain it, including the fact
that raidz1 over other numbers of drives 8e.g. 3 or 6) apparently show a
much more symmetric distribution of read requests.

Regards, STefan
Garrett Cooper
2011-12-19 21:00:18 UTC
Permalink
Post by Stefan Esser
Post by Garrett Cooper
Post by Stefan Esser
Hi ZFS users,
for quite some time I have observed an uneven distribution of load
between drives in a 4 * 2TB RAIDZ1 pool. The following is an excerpt of
dT: 10.001s w: 10.000s filter: ^a?da?.$
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 130 106 4134 4.5 23 1033 5.2 48.8| ada0
0 131 111 3784 4.2 19 1007 4.0 47.6| ada1
0 90 66 2219 4.5 24 1031 5.1 31.7| ada2
1 81 58 2007 4.6 22 1023 2.3 28.1| ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 132 104 4036 4.2 27 1129 5.3 45.2| ada0
0 129 103 3679 4.5 26 1115 6.8 47.6| ada1
1 91 61 2133 4.6 30 1129 1.9 29.6| ada2
0 81 56 1985 4.8 24 1102 6.0 29.4| ada3
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 148 108 4084 5.3 39 2511 7.2 55.5| ada0
1 141 104 3693 5.1 36 2505 10.4 54.4| ada1
1 102 62 2112 5.6 39 2508 5.5 35.4| ada2
0 99 60 2064 6.0 39 2483 3.7 36.1| ada3
This suggests (note that I said suggests) that there might be a slight
difference in the data path speeds or physical media as someone else
suggested; look at zpool iostat -v <interval> though before making a
firm statement as to whether or not a drive is truly not performing to
your assumed spec. gstat and zpool iostat -v suggest performance
though -- they aren't the end-all-be-all for determining drive
performance.
I doubt there is a difference in the data path speeds, since all drives
are connected to the SATA II ports of an Intel H67 chip.
The drives seem to perform equally well, just with a ratio of read
requests of 30% / 30% / 20% / 20% for ada0 .. ada3. But neither queue
length nor command latencies indicate a problem or differences in the
drives. It seems that a different number of commands is scheduled for 2
of the 4 drives, compared to the other 2, and that scheduling should be
part of the ZFS code. I'm quite convinced, that neither the drives nor
the other hardware plays a role, but I'll follow the suggestion to swap
drives between controller ports and to observe whether the increased
read load moves with the drives (indicating something on disk causes the
anomaly) or stays with the SATA ports (indicating that lower numbered
ports see higher load).
Post by Garrett Cooper
If the latency numbers were high enough, I would suggest dd'ing out to
the individual drives (i.e. remove the drive from the RAIDZ) to see if
there's a noticeable discrepancy, as this can indicate a bad cable,
backplane, or drive; from there I would start doing the physical swap
routine and see if the issue moves with the drive or stays static with
the controller channel and/or chassis slot.
I do not expect a hardware problem, since command latencies are very
similar over all drives, despite the higher read load on some of them.
These are more busy by exactly the factor to be expected by only the
higher command rate.
But it seems that others do not observe the asymmetric distribution of
requests, which makes me wonder whether I happen to have meta data
arranged in such a way that it is always read from ada0 or ada1, but not
(or rarely) from ada2 or ada3. That could explain it, including the fact
that raidz1 over other numbers of drives 8e.g. 3 or 6) apparently show a
much more symmetric distribution of read requests.
Basic question: does one set of drives vibrate differently than the other set?
-Garrett_______________________________________________
freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
Stefan Esser
2011-12-19 21:26:18 UTC
Permalink
Post by Garrett Cooper
Post by Stefan Esser
But it seems that others do not observe the asymmetric distribution of
requests, which makes me wonder whether I happen to have meta data
arranged in such a way that it is always read from ada0 or ada1, but not
(or rarely) from ada2 or ada3. That could explain it, including the fact
that raidz1 over other numbers of drives 8e.g. 3 or 6) apparently show a
much more symmetric distribution of read requests.
Basic question: does one set of drives vibrate differently than the other set?
No: All drives are mounted in similar cages in a midi tower case (and
since I did not like the temperature rising to 45C, last summer, I added
case fans to keep the temperature of all drives equally low, too).

But I'll try swapping drives (or rather SATA ports) tomorrow. If the
drives are different (hardware or data on the drives), then the higher
load will move, but if it's in the ZFS code, then I expect the higher
request rate to stay on the first two drives. I'll report the outcome.

(And repeating what I wrote before: The drives seem to behave perfectly
well, they do just receive different numbers of read requests although
the pool appears to be symmetric with regard to all factors that could
have an impact. I really doubt this is caused by hardware, else there
would be observable differences in latency or queue length.)

Regards, STefan
Daniel Kalchev
2011-12-19 18:03:49 UTC
Permalink
I have observed similar behavior, even more extreme on a spool with dedup enabled. Is dedup enabled on this spool?

Might be that the DDT tables somehow end up unevenly distributed to disks. My observation was on a 6 disk raidz2.

Daniel_______________________________________________
freebsd-***@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
Stefan Esser
2011-12-19 21:00:20 UTC
Permalink
Post by Daniel Kalchev
I have observed similar behavior, even more extreme on a spool with dedup enabled. Is dedup enabled on this spool?
Thank you for the report!

Well, I had dedup enabled for a few short tests. But since I have got
"only" 8GB of RAM and dedup seems to require an order of magnitude more
to be working well, I switched dedup off again after a few hours.
Post by Daniel Kalchev
Might be that the DDT tables somehow end up unevenly distributed to disks. My observation was on a 6 disk raidz2.
Hmmm, there was another report of even distribution of load on a 6 disk
raidz1 (but in fact, in that case the first half seems to have got some
10% to 15 higher load than the second half; the sixth drive showed quite
different queue length and latencies and I think these might be caused
either by a defect (soft-errors) or another partition being actively
used only on that drive).

Regards, STefan
Daniel Kalchev
2011-12-19 21:07:36 UTC
Permalink
Post by Stefan Esser
Post by Daniel Kalchev
I have observed similar behavior, even more extreme on a spool with dedup enabled. Is dedup enabled on this spool?
Thank you for the report!
Well, I had dedup enabled for a few short tests. But since I have got
"only" 8GB of RAM and dedup seems to require an order of magnitude more
to be working well, I switched dedup off again after a few hours.
You will need to get rid of the DDT, as those are read nevertheless even with dedup (already) disabled. The tables refer to already deduped data.

In my case, I had about 2-3TB of deduced data, with 24GB RAM. There was no shortage of RAM and I could not confirm that ARC is full.. but somehow the pool was placing heavy read on one or two disks only (all others, nearly idle) -- apparently many small size reads.

I resolved my issue by copying the data to a newly created filesystem in the same pool -- luckily there was enough space available, then removing the 'deduped' filesystems.

That last operation was particularly slow and at one time I had spontaneous reboot -- the pool was 'impossible to mount', and as weird as it sounds, I had 'out of swap space' killing the 'zpool list' process.
I let it sit for few hours, until it has cleared itself.

I/O in that pool is back to normal now.

There is something terribly wrong with the dedup code.

Well, if your test data is not valuable, you can just delete it. :)

Daniel
Garrett Cooper
2011-12-19 21:14:50 UTC
Permalink
Post by Daniel Kalchev
Post by Stefan Esser
Post by Daniel Kalchev
I have observed similar behavior, even more extreme on a spool with dedup enabled. Is dedup enabled on this spool?
Thank you for the report!
Well, I had dedup enabled for a few short tests. But since I have got
"only" 8GB of RAM and dedup seems to require an order of magnitude more
to be working well, I switched dedup off again after a few hours.
You will need to get rid of the DDT, as those are read nevertheless even with dedup (already) disabled. The tables refer to already deduped data.
In my case, I had about 2-3TB of deduced data, with 24GB RAM. There was no shortage of RAM and I could not confirm that ARC is full.. but somehow the pool was placing heavy read on one or two disks only (all others, nearly idle) -- apparently many small size reads.
I resolved my issue by copying the data to a newly created filesystem in the same pool -- luckily there was enough space available, then removing the 'deduped' filesystems.
That last operation was particularly slow and at one time I had spontaneous reboot -- the pool was 'impossible to mount', and as weird as it sounds, I had 'out of swap space' killing the 'zpool list' process.
I let it sit for few hours, until it has cleared itself.
I/O in that pool is back to normal now.
There is something terribly wrong with the dedup code.
Dedup in the ZFS manual claims that it needs 2GB of memory per TB of
data, but in reality it's closer to 5GB of memory per TB of data on
average. So if you turn it on on large datasets or pools and don't
limit the ARC, it ties your box in knots after it wires down all of
the physical memory (even when you're doing a reimport when it's
replaying the ZIL -- either on the array or on your dedicated ZIL
device). This of course either causes your machine to dig into swap
and slow to a crawl, and/or blows away your userland (and now you're
pretty much SoL).

Bottom line is that dedup is a poorly articulated feature and causes
lots of issues if enabled. Compression is a much better feature to
enable.
Post by Daniel Kalchev
Well, if your test data is not valuable, you can just delete it. :)
+1, but I suggest limiting the ARC first.

Cheers,
-Garrett
Stefan Esser
2011-12-19 21:34:35 UTC
Permalink
Post by Daniel Kalchev
Post by Stefan Esser
Well, I had dedup enabled for a few short tests. But since I have got
"only" 8GB of RAM and dedup seems to require an order of magnitude more
to be working well, I switched dedup off again after a few hours.
You will need to get rid of the DDT, as those are read nevertheless
even with dedup (already) disabled. The tables refer to already
deduped data.
Thanks for the hint!

Is there an easy way to identify the file systems that ever had dedup
enabled? (I don't mind to extract the information from zdb output, in
case that the UI of choice.)

I seem to remember that I tried it with my /usr/svn (which obviously had
lots of duplicated files), but I do not remember on which other file
systems I tried it ... (I've created some 20-25 filesystems on this pool.)
Post by Daniel Kalchev
In my case, I had about 2-3TB of deduced data, with 24GB RAM. There
was no shortage of RAM and I could not confirm that ARC is full.. but
somehow the pool was placing heavy read on one or two disks only (all
others, nearly idle) -- apparently many small size reads.
I resolved my issue by copying the data to a newly created filesystem
in the same pool -- luckily there was enough space available, then
removing the 'deduped' filesystems.
This should be easy in the case of /usr/svn, thanks for the suggestion!
Post by Daniel Kalchev
That last operation was particularly slow and at one time I had
spontaneous reboot -- the pool was 'impossible to mount', and as
weird as it sounds, I had 'out of swap space' killing the 'zpool
list' process.
I let it sit for few hours, until it has cleared itself.
I/O in that pool is back to normal now.
There is something terribly wrong with the dedup code.
Well, if your test data is not valuable, you can just delete it. :)
I could also start over with a clean SVN check-out, but since I've got
the free disk space to copy the data over, I'll try that first.

Thanks again and best regards, STefan
Loading...