ZFS buggy in CURRENT? Stuck in [zio->io

Discussion:

ZFS buggy in CURRENT? Stuck in [zio->io_cv] forever!

O. Hartmann

2013-10-27 12:40:39 UTC

I have setup a RAIDZ pool comprised from 4 3TB HDDs. To maintain 4k
block alignment, I followed the instructions given on several sites and
I'll sketch them here for the protocol.

The operating system is 11.0-CURRENT AND 10.0-BETA2.

create a GPT partition on each drive and add one whole-covering
partition with the option

gpart add -t freebsd-zfs -b 1M -l disk0[0-3] ada[3-6]

gnop create -S4096 gtp/disk[3-6]

Because I added a disk to an existing RAIDZ, I exported the former
ZFS pool, then I deleted on each disk the partition and then destroyed
the GPT scheme. The former pool had a ZIL and CACHE residing on the
same SSD, partioned. I didn't kill or destroy the partitions on that
SSD. To align 4k blocks, I also created on the existing gpt/log00 and
gpt/cache00 via

gnop create -S4096 gpt/log00|gpt/cache00

the NOP overlays.

After I created a new pool via zpool create POOL gpt/disk0[0-3].nop
log gpt/log00.nop cache gpt/cache00.nop

I "received" a snapshot taken and sent to another storage array, after
I the newly created pool didn't show up any signs of illness or
corruption.

After ~10 hours of receiving the backup, I exported that pool amongst
the backup pool, destroyed the appropriate .nop device entries via

gnop destroy gpt/disk0[0-3]

and the same for cache and log and tried to check via

zpool import

whether my pool (as well as the backup pool) shows up. And here the
nasty mess starts!

The "zpool import" command issued on console is now stuck for hours and
can not be interrupted via Ctrl-C! No pool shows up! Hitting Ctrl-T
shows a state like

... cmd: zpool 4317 [zio->io_cv]: 7345.34r 0.00 [...]

Looking with

systat -vm 1

at the trhoughput of the CAM devices I realise that two of the four
RAIDZ-comprising drives show activities, having 7000 - 8000 tps and ~
30 MB/s bandwidth - the other two zero!

And the pool is still inactive, the console is stuck.

Well, this made my day! At this point, I try to understand what's going
wrong and try to recall what I did the last time different when the
same procedure on three disks on the same hardware worked for me.

Now after 10 hours copy orgy and the need for the working array I start
believing that using ZFS is still peppered with too many
development-like flaws rendering it risky on FreeBSD. Colleagues
working on SOLARIS on ZFS I consulted never saw those stuck-behaviour
like I realise this moment.

I don not want to repeat the procedure again. There must be a
possibility to import the pool - even the backup pool, which is
working, untouched by the work, should be able to import - but it
doesn't. If I address that pool, while this crap "zpool import" command
is still blocking the console, not willing to die even with "killall -9
zpool", I can not import the backup pool via "zpool import BACKUP00".
The console gets stuck immediately and for the eternity without any
notice. Htting Ctrl-T says something like

load: 3.59 cmd: zpool 46199 [spa_namespace_lock] 839.18r 0.00u 0.00s
0% 3036k

which means I can not even import the backup facility and this means
really no fun.

O. Hartmann

2013-10-27 15:10:26 UTC

Permalink

On Sun, 27 Oct 2013 13:40:39 +0100

Post by O. Hartmann
I have setup a RAIDZ pool comprised from 4 3TB HDDs. To maintain 4k
block alignment, I followed the instructions given on several sites
and I'll sketch them here for the protocol.
The operating system is 11.0-CURRENT AND 10.0-BETA2.
create a GPT partition on each drive and add one whole-covering
partition with the option
gpart add -t freebsd-zfs -b 1M -l disk0[0-3] ada[3-6]
gnop create -S4096 gtp/disk[3-6]
Because I added a disk to an existing RAIDZ, I exported the former
ZFS pool, then I deleted on each disk the partition and then destroyed
the GPT scheme. The former pool had a ZIL and CACHE residing on the
same SSD, partioned. I didn't kill or destroy the partitions on that
SSD. To align 4k blocks, I also created on the existing gpt/log00 and
gpt/cache00 via
gnop create -S4096 gpt/log00|gpt/cache00
the NOP overlays.
After I created a new pool via zpool create POOL gpt/disk0[0-3].nop
log gpt/log00.nop cache gpt/cache00.nop

It is, of course, a "zpool create POOL raidz ..."

Post by O. Hartmann
I "received" a snapshot taken and sent to another storage array, after
I the newly created pool didn't show up any signs of illness or
corruption.
After ~10 hours of receiving the backup, I exported that pool amongst
the backup pool, destroyed the appropriate .nop device entries via
gnop destroy gpt/disk0[0-3]
and the same for cache and log and tried to check via
zpool import
whether my pool (as well as the backup pool) shows up. And here the
nasty mess starts!
The "zpool import" command issued on console is now stuck for hours
and can not be interrupted via Ctrl-C! No pool shows up! Hitting
Ctrl-T shows a state like
... cmd: zpool 4317 [zio->io_cv]: 7345.34r 0.00 [...]
Looking with
systat -vm 1
at the trhoughput of the CAM devices I realise that two of the four
RAIDZ-comprising drives show activities, having 7000 - 8000 tps and ~
30 MB/s bandwidth - the other two zero!
And the pool is still inactive, the console is stuck.
Well, this made my day! At this point, I try to understand what's
going wrong and try to recall what I did the last time different when
the same procedure on three disks on the same hardware worked for me.
Now after 10 hours copy orgy and the need for the working array I
start believing that using ZFS is still peppered with too many
development-like flaws rendering it risky on FreeBSD. Colleagues
working on SOLARIS on ZFS I consulted never saw those stuck-behaviour
like I realise this moment.
I don not want to repeat the procedure again. There must be a
possibility to import the pool - even the backup pool, which is
working, untouched by the work, should be able to import - but it
doesn't. If I address that pool, while this crap "zpool import"
command is still blocking the console, not willing to die even with
"killall -9 zpool", I can not import the backup pool via "zpool
import BACKUP00". The console gets stuck immediately and for the
eternity without any notice. Htting Ctrl-T says something like
load: 3.59 cmd: zpool 46199 [spa_namespace_lock] 839.18r 0.00u 0.00s
0% 3036k
which means I can not even import the backup facility and this means
really no fun.

Steven Hartland

2013-10-27 16:32:13 UTC

Permalink

----- Original Message -----

Post by O. Hartmann
I have setup a RAIDZ pool comprised from 4 3TB HDDs. To maintain 4k
block alignment, I followed the instructions given on several sites and
I'll sketch them here for the protocol.
The operating system is 11.0-CURRENT AND 10.0-BETA2.
create a GPT partition on each drive and add one whole-covering
partition with the option
gpart add -t freebsd-zfs -b 1M -l disk0[0-3] ada[3-6]
gnop create -S4096 gtp/disk[3-6]
Because I added a disk to an existing RAIDZ, I exported the former
ZFS pool, then I deleted on each disk the partition and then destroyed
the GPT scheme. The former pool had a ZIL and CACHE residing on the
same SSD, partioned. I didn't kill or destroy the partitions on that
SSD. To align 4k blocks, I also created on the existing gpt/log00 and
gpt/cache00 via
gnop create -S4096 gpt/log00|gpt/cache00
the NOP overlays.
After I created a new pool via zpool create POOL gpt/disk0[0-3].nop
log gpt/log00.nop cache gpt/cache00.nop

You don't need any of the nop hax in 10 or 11 any more as it has
proper sector size detection. The caviate for this is when you have a
disk which adervtises 512b sectors but is 4k and we dont have a 4k
quirk in the kernel for it yet.

If you anyone comes across a case of this feel free to drop me the
details from camcontrol <identify|inquiry> <device>

If due to this you still need to use the gnop hack then you only need
to apply it to 1 device as the zpool create uses the largest ashift
from the disks.

I would then as the very first step export and import as at this point
there is much less data on the devices to scan through, not that
this should be needed but...

While we only run 8.3-RELEASE currently, as we've decided to skip 9.X
and move straight to 10 once we've tested, we've found ZFS is not only
very stable but it now become critical to the way we run things.

Post by O. Hartmann
I don not want to repeat the procedure again. There must be a
possibility to import the pool - even the backup pool, which is
working, untouched by the work, should be able to import - but it
doesn't. If I address that pool, while this crap "zpool import" command
is still blocking the console, not willing to die even with "killall -9
zpool", I can not import the backup pool via "zpool import BACKUP00".
The console gets stuck immediately and for the eternity without any
notice. Htting Ctrl-T says something like
load: 3.59 cmd: zpool 46199 [spa_namespace_lock] 839.18r 0.00u 0.00s
0% 3036k
which means I can not even import the backup facility and this means
really no fun.

I'm not sure there's enough information here to determine where any
issue may lie, but as a guess it could be that ZFS is having issues
locating the one change devices and is scanning the entire disk to
try and determine that. This would explain the IO on the one device
but not the others.

Did you per-chance have one of the disks in use for something else
and hence it may have old label information in it that wasn't cleaned
down?

Regards
Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to ***@multiplay.co.uk.

O. Hartmann

2013-10-27 23:30:30 UTC

Permalink

On Sun, 27 Oct 2013 16:32:13 -0000

Post by Steven Hartland
----- Original Message -----

Post by O. Hartmann
I have setup a RAIDZ pool comprised from 4 3TB HDDs. To maintain 4k
block alignment, I followed the instructions given on several sites
and I'll sketch them here for the protocol.
The operating system is 11.0-CURRENT AND 10.0-BETA2.
create a GPT partition on each drive and add one whole-covering
partition with the option
gpart add -t freebsd-zfs -b 1M -l disk0[0-3] ada[3-6]
gnop create -S4096 gtp/disk[3-6]
Because I added a disk to an existing RAIDZ, I exported the former
ZFS pool, then I deleted on each disk the partition and then
destroyed the GPT scheme. The former pool had a ZIL and CACHE
residing on the same SSD, partioned. I didn't kill or destroy the
partitions on that SSD. To align 4k blocks, I also created on the
existing gpt/log00 and gpt/cache00 via
gnop create -S4096 gpt/log00|gpt/cache00
the NOP overlays.
After I created a new pool via zpool create POOL gpt/disk0[0-3].nop
log gpt/log00.nop cache gpt/cache00.nop

Well, this is news to me.

Post by Steven Hartland
If you anyone comes across a case of this feel free to drop me the
details from camcontrol <identify|inquiry> <device>

camcontrol identify says this (serial numbers skipped):

ada3:
cylinders 16383
heads 16
sectors/track 63
sector size logical 512, physical 4096, offset 0
LBA supported 268435455 sectors
LBA48 supported 5860533168 sectors
PIO supported PIO4
DMA supported WDMA2 UDMA6
media RPM 5400

Feature Support Enabled Value Vendor
read ahead yes yes
write cache yes yes
flush cache yes yes
overlap no
Tagged Command Queuing (TCQ) no no
Native Command Queuing (NCQ) yes 32 tags
SMART yes yes
microcode download yes yes
security yes no
power management yes yes
advanced power management no no
automatic acoustic management no no
media status notification no no
power-up in Standby yes no
write-read-verify no no
unload yes yes
free-fall no no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA) yes no 5860533168/5860533168
HPA - Security no

ada4/ada5/ada6:
cylinders 16383
heads 16
sectors/track 63
sector size logical 512, physical 4096, offset 0
LBA supported 268435455 sectors
LBA48 supported 5860533168 sectors
PIO supported PIO4
DMA supported WDMA2 UDMA6

Feature Support Enabled Value Vendor
read ahead yes yes
write cache yes yes
flush cache yes yes
overlap no
Tagged Command Queuing (TCQ) no no
Native Command Queuing (NCQ) yes 32 tags
SMART yes yes
microcode download yes yes
security yes no
power management yes yes
advanced power management no no
automatic acoustic management no no
media status notification no no
power-up in Standby yes no
write-read-verify no no
unload no no
free-fall no no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA) yes no 5860533168/5860533168
HPA - Security no

Post by Steven Hartland
If due to this you still need to use the gnop hack then you only need
to apply it to 1 device as the zpool create uses the largest ashift
from the disks.

This is also new and not obviously reported/documented well.

Post by Steven Hartland
I would then as the very first step export and import as at this point
there is much less data on the devices to scan through, not that
this should be needed but...

I performed the task again. The pool is not destroyed, as I can not
import it anymore. I do so delete all the partitions and then
destroying the GPT scheme and recreating it as well as the partions.

After the "receive" finished after 10 hours, I exported the backup pool
and the newly created pool. Now I'm trying to import the new pool again
- and I'm stuck again.

This crap is stuck, really stuck. I can not

- kill the process
- shutdown the server(!)

Yes, shutdown the server is stuck forever. It doesn't go down and I do
now wait days for the box going down due to ZFS.

Post by Steven Hartland

I have had no problems during 10.0-CURRENT with the pool unless it
reported some issues with the block sizes and warnings about degraded
performance.

After I followed the steps of block aligning for 4k, the RAIDZ worked
also. It starts non-working when the fourth HDD has been introduced.
The system doesn't report any issues with the harddrive and the disk
itself is healthy.

Post by Steven Hartland

Post by O. Hartmann
I don not want to repeat the procedure again. There must be a
possibility to import the pool - even the backup pool, which is
working, untouched by the work, should be able to import - but it
doesn't. If I address that pool, while this crap "zpool import"
command is still blocking the console, not willing to die even with
"killall -9 zpool", I can not import the backup pool via "zpool
import BACKUP00". The console gets stuck immediately and for the
eternity without any notice. Htting Ctrl-T says something like
load: 3.59 cmd: zpool 46199 [spa_namespace_lock] 839.18r 0.00u
0.00s 0% 3036k
which means I can not even import the backup facility and this means
really no fun.

If there are issues, what kind of issues? I would expect this if there
is a possibilty to add a drive "on the fly".

I was told I have to destroy and create the pool (RAIDZ) when I have
added another drive. I did so. The pool is newly created abd received a
former snapshot from a backup device via

zfs receive -vdF POOL00 < /path/to/backup.zfs

Post by Steven Hartland
Did you per-chance have one of the disks in use for something else
and hence it may have old label information in it that wasn't cleaned
down?

No. The disks are entirely ZFS only. One partition per disk.

If old labels are there, then this must be considered a bug or design
flaw. I deleted everything with the command line tools I have (I do not
like this hoodo-voodo crap with dd ...). I can not destroy a pool that
isn't imported.

What I see now, again the third time in a row, is that I neither can
reboot the box (shutdown gets stuck for ever), I can not kill the job
"zpool import" (for looking what pools are available after I have
exported the newly created and "received back" pool).

What I see is that two drives are busy as reported earlier - and those
both drives are two of the former pool. The new drive isn't involved:

[...]
Disks ada0 ada1 ada2 ada3 ada4 ada5 ada6 intrn 718
cpu1:timer
KB/t 0.00 0.00 32.00 0.00 0.00 4.00 4.00 2906400 wire 719
cpu2:timer
tps 0 0 6 0 0 7286 7287 20492 act 1135
cpu3:timer
MB/s 0.00 0.00 0.19 0.00 0.00 28.46 28.47 141320 inact
%busy 0 0 0 0 0 42 42 cache
12944592 free
1372736 buf

ada3-ada6 are the RAIDZ comprising HDDs, ada0 is the cache/ZIL, ada1
and ada2 are backup/system.

<SAMSUNG SSD 830 Series CXM03B1Q> at scbus0 target 0 lun 0 (ada0,pass0)
<WDC WD40EZRX-00SPEB0 80.00A80> at scbus1 target 0 lun 0 (ada1,pass1)
<SAMSUNG SSD 830 Series CXM03B1Q> at scbus2 target 0 lun 0 (ada2,pass2)
<WDC WD30EFRX-68EUZN0 80.00A80> at scbus3 target 0 lun 0 (ada3,pass3)
<WDC WD30EZRX-00DC0B0 80.00A80> at scbus4 target 0 lun 0 (ada4,pass4)
<WDC WD30EZRX-00DC0B0 80.00A80> at scbus5 target 0 lun 0 (ada5,pass5)
<WDC WD30EZRX-00DC0B0 80.00A80> at scbus6 target 0 lun 0 (ada6,pass6)

***@gate [src] zpool status
load: 0.20 cmd: zpool 4944 [spa_namespace_lock] 4.29r 0.00u 0.00s 0%
2572k

I can not even interrupt this command. Everything touches ZFS right now
locks up the system/console completely! How do I kill successfully this
command? I need control back. This is a weird and unacceptable
behaviour.

The most frustrating thing is this total blockade of everything
ragarding ZFS. This one(!) pool prevents in a single-thread-giant-lock
manner to import, check on or view other pools. This is a absolute
no-go for me on a server system. As I reported, not even a shutdown is
possible.

Post by Steven Hartland
Regards
Steve

Thanks anyway,

oh

O. Hartmann

2013-10-28 19:00:24 UTC

Permalink

On Sun, 27 Oct 2013 16:32:13 -0000
"Steven Hartland" <***@multiplay.co.uk> wrote:

Hello all,

after a third attempt, I realised that some remnant labels seem to
cause the problem.

Those labels didn't go away with "zpool create -f" or "zfs clearlabel
provider", I had to issue "zfs destroy -F provider" to ensure that
everything is cleared out.

After the last unsuccessful attempt, I waited 14 hours for the "busy
drives" as reported and they didn't stop doing something after that
time, so I rebooted the box.

Besides the confusion about how to proper use ZFS (I miss a
documentation as a normal user and not being considered a core
developer when using ZFS, several BLOGs have outdated data), there is
still this issue with this nasty blocking of the whole system, only
solveable by a hard reset.

After the pool has been successfully created and after a snapshot has
been received via -vdF option, a reimport of the pool wasn't possible
as described below and any attempt to have pools listed for import
(zfs import) ended up in a stuck console, uninteruptable by no kill or
Ctrl-C. The demaged pool's drives showed some action, but even the
pools considered unharmed didn't show up.

This total-blockade also prevented the system from properly rebooting
- a "shutdown -r" or "reboot" ended up in waiting for eternity after
the last block has been synchronised - only power off or full reset
could bring the box to life again. I think this is not intended and
can be considered a bug?

Thanks for the patience.

oh

Post by Steven Hartland
----- Original Message -----

Post by O. Hartmann
I have setup a RAIDZ pool comprised from 4 3TB HDDs. To maintain 4k
block alignment, I followed the instructions given on several sites
and I'll sketch them here for the protocol.
The operating system is 11.0-CURRENT AND 10.0-BETA2.
create a GPT partition on each drive and add one whole-covering
partition with the option
gpart add -t freebsd-zfs -b 1M -l disk0[0-3] ada[3-6]
gnop create -S4096 gtp/disk[3-6]
Because I added a disk to an existing RAIDZ, I exported the former
ZFS pool, then I deleted on each disk the partition and then
destroyed the GPT scheme. The former pool had a ZIL and CACHE
residing on the same SSD, partioned. I didn't kill or destroy the
partitions on that SSD. To align 4k blocks, I also created on the
existing gpt/log00 and gpt/cache00 via
gnop create -S4096 gpt/log00|gpt/cache00
the NOP overlays.
After I created a new pool via zpool create POOL gpt/disk0[0-3].nop
log gpt/log00.nop cache gpt/cache00.nop

You don't need any of the nop hax in 10 or 11 any more as it has
proper sector size detection. The caviate for this is when you have a
disk which adervtises 512b sectors but is 4k and we dont have a 4k
quirk in the kernel for it yet.
If you anyone comes across a case of this feel free to drop me the
details from camcontrol <identify|inquiry> <device>
If due to this you still need to use the gnop hack then you only need
to apply it to 1 device as the zpool create uses the largest ashift
from the disks.
I would then as the very first step export and import as at this point
there is much less data on the devices to scan through, not that
this should be needed but...

Post by O. Hartmann
I don not want to repeat the procedure again. There must be a
possibility to import the pool - even the backup pool, which is
working, untouched by the work, should be able to import - but it
doesn't. If I address that pool, while this crap "zpool import"
command is still blocking the console, not willing to die even with
"killall -9 zpool", I can not import the backup pool via "zpool
import BACKUP00". The console gets stuck immediately and for the
eternity without any notice. Htting Ctrl-T says something like
load: 3.59 cmd: zpool 46199 [spa_namespace_lock] 839.18r 0.00u
0.00s 0% 3036k
which means I can not even import the backup facility and this means
really no fun.

I'm not sure there's enough information here to determine where any
issue may lie, but as a guess it could be that ZFS is having issues
locating the one change devices and is scanning the entire disk to
try and determine that. This would explain the IO on the one device
but not the others.
Did you per-chance have one of the disks in use for something else
and hence it may have old label information in it that wasn't cleaned
down?
Regards
Steve
================================================
This e.mail is private and confidential between Multiplay (UK) Ltd.
and the person or entity to whom it is addressed. In the event of
misdirection, the recipient is prohibited from using, copying,
printing or otherwise disseminating it or any information contained
in it.
In the event of misdirection, illegible or incomplete transmission
please telephone +44 845 868 1337 or return the E.mail to
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to

Freddie Cash

2013-10-27 17:38:17 UTC

Permalink

Forgot to include the list in reply.
---------- Forwarded message ----------
From: "Freddie Cash" <***@gmail.com>
Date: Oct 27, 2013 10:36 AM
Subject: Re: ZFS buggy in CURRENT? Stuck in [zio->io_cv] forever!
To: "O. Hartmann" <***@zedat.fu-berlin.de>
Cc:

Did your recv complete before you exported the pool?

If not, then the import will "hang" until it has deleted three hidden clone
dataset for the aborted receive. Once all the blocks are freed
successfully, then the import will complete and the pool well be usable
again.