Discussion:
ZFS hangs/freezes after disk failure, resumes when disk is replaced
Todd H. Poole
2008-08-24 04:06:54 UTC
Permalink
Howdy yall,

Earlier this month I downloaded and installed the latest copy of OpenSolaris (2008.05) so that I could test out some of the newer features I've heard so much about, primarily ZFS.

My goal was to replace our aging linux-based (SuSE 10.1) file and media server with a new machine running Sun's OpenSolaris and ZFS. Our old server ran your typical RAID5 setup with 4 500GB disks (3 data, 1 parity), used lvm, mdadm, and xfs to help keep things in order, and relied on NFS to export users' shares. It was solid, stable, and worked wonderfully well.

I would like to replicate this experience using the tools OpenSolaris has to offer, taking advantages of ZFS. However, there are enough differences between the two OSes - especially with respect to the filesystems and (for lack of a better phrase) "RAID managers" - to cause me to consult (on numerous occasions) the likes of Google, these forums, and other places for help.

I've been successful in troubleshooting all problems up until now.

On our old media server (the SuSE 10.1 one), when a disk failed, the machine would send out an e-mail detailing the type of failure, and gracefully fall into a degraded state, but would otherwise continue to operate using the remaining 3 disks in the system. After the faulty disk was replaced, all of the data from the old disk would be replicated onto the new one (I think the term is "resilvered" around here?), and after a few hours, the RAID5 array would be seamlessly promoted from "degraded" back up to a healthy "clean" (or "online") state.

Throughout the entire process, there would be no interruptions to the end user: all NFS shares still remained mounted, there were no noticeable drops in I/O, files, directories, and any other user-created data still remained available, and if everything went smoothly, no one would notice a failure had even occurred.

I've tried my best to recreate something similar in OpenSolaris, but I'm stuck on making it all happen seamlessly.

For example, I have a standard beige box machine running OS 2008.05 with a zpool that contains 4 disks, similar to what the old SuSE 10.1 server had. However, whenever I unplug the SATA cable from one of the drives (to simulate a catastrophic drive failure) while doing moderate reading from the zpool (such as streaming HD video), not only does the video hang on the remote machine (which is accessing the zpool via NFS), but the server running OpenSolaris seems to either hang, or become incredibly unresponsive.

And when I write unresponsive, I mean that when I type the command "zpool status" to see what's going on, the command hangs, followed by a frozen Terminal a few seconds later. After just a few more seconds, the entire GUI - mouse included - locks up or freezes, and all NFS shares become unavailable from the perspective of the remote machines. The whole machine locks up hard.

The machine then stays in this frozen state until I plug the hard disk back in, at which point everything, quite literally, pops back into existence all at once: the output of the "zpool status" command flies by (with all disks listed as "ONLINE" and all "READ," "WRITE," and "CKSUM," fields listed as "0"), the mouse jumps to a different part of the screen, the NFS share becomes available again, and the movie resumes right where it had left off.

While such a quick resume is encouraging, I'd like to avoid the freeze in the first place.

How can I keep any hardware failures like the above transparent to my users?

-Todd

PS: I've done some researching, and while my problem is similar to the following:

http://opensolaris.org/jive/thread.jspa?messageID=151719&#151719
http://opensolaris.org/jive/thread.jspa?messageID=240481&#240481

most of these posts are quite old, and do not offer any solutions.

PSS: I know I haven't provided any details on hardware, but I feel like this is more likely a higher-level issue (like some sort of configuration file or setting is needed) rather than a lower-level one (like faulty hardware). However, if someone were to give me a command to run, I'd gladly do it... I'm just not sure which ones would be helpful, or if I even know which ones to run. It took me half an hour of searching just to find out how to list the disks installed in this system (it's "format") so that I could build my zpool in the first place. It's not quite as simple as writing out /dev/hda, /dev/hdb, /dev/hdc, /dev/hdd. ;)


This message posted from opensolaris.org
Tim
2008-08-24 04:13:40 UTC
Permalink
Post by Todd H. Poole
Howdy yall,
Earlier this month I downloaded and installed the latest copy of
OpenSolaris (2008.05) so that I could test out some of the newer features
I've heard so much about, primarily ZFS.
My goal was to replace our aging linux-based (SuSE 10.1) file and media
server with a new machine running Sun's OpenSolaris and ZFS. Our old server
ran your typical RAID5 setup with 4 500GB disks (3 data, 1 parity), used
lvm, mdadm, and xfs to help keep things in order, and relied on NFS to
export users' shares. It was solid, stable, and worked wonderfully well.
I would like to replicate this experience using the tools OpenSolaris has
to offer, taking advantages of ZFS. However, there are enough differences
between the two OSes - especially with respect to the filesystems and (for
lack of a better phrase) "RAID managers" - to cause me to consult (on
numerous occasions) the likes of Google, these forums, and other places for
help.
I've been successful in troubleshooting all problems up until now.
On our old media server (the SuSE 10.1 one), when a disk failed, the
machine would send out an e-mail detailing the type of failure, and
gracefully fall into a degraded state, but would otherwise continue to
operate using the remaining 3 disks in the system. After the faulty disk was
replaced, all of the data from the old disk would be replicated onto the new
one (I think the term is "resilvered" around here?), and after a few hours,
the RAID5 array would be seamlessly promoted from "degraded" back up to a
healthy "clean" (or "online") state.
Throughout the entire process, there would be no interruptions to the end
user: all NFS shares still remained mounted, there were no noticeable drops
in I/O, files, directories, and any other user-created data still remained
available, and if everything went smoothly, no one would notice a failure
had even occurred.
I've tried my best to recreate something similar in OpenSolaris, but I'm
stuck on making it all happen seamlessly.
For example, I have a standard beige box machine running OS 2008.05 with a
zpool that contains 4 disks, similar to what the old SuSE 10.1 server had.
However, whenever I unplug the SATA cable from one of the drives (to
simulate a catastrophic drive failure) while doing moderate reading from the
zpool (such as streaming HD video), not only does the video hang on the
remote machine (which is accessing the zpool via NFS), but the server
running OpenSolaris seems to either hang, or become incredibly unresponsive.
And when I write unresponsive, I mean that when I type the command "zpool
status" to see what's going on, the command hangs, followed by a frozen
Terminal a few seconds later. After just a few more seconds, the entire GUI
- mouse included - locks up or freezes, and all NFS shares become
unavailable from the perspective of the remote machines. The whole machine
locks up hard.
The machine then stays in this frozen state until I plug the hard disk back
in, at which point everything, quite literally, pops back into existence all
at once: the output of the "zpool status" command flies by (with all disks
listed as "ONLINE" and all "READ," "WRITE," and "CKSUM," fields listed as
"0"), the mouse jumps to a different part of the screen, the NFS share
becomes available again, and the movie resumes right where it had left off.
While such a quick resume is encouraging, I'd like to avoid the freeze in the first place.
How can I keep any hardware failures like the above transparent to my users?
-Todd
http://opensolaris.org/jive/thread.jspa?messageID=151719&#151719
http://opensolaris.org/jive/thread.jspa?messageID=240481&#240481
most of these posts are quite old, and do not offer any solutions.
PSS: I know I haven't provided any details on hardware, but I feel like
this is more likely a higher-level issue (like some sort of configuration
file or setting is needed) rather than a lower-level one (like faulty
hardware). However, if someone were to give me a command to run, I'd gladly
do it... I'm just not sure which ones would be helpful, or if I even know
which ones to run. It took me half an hour of searching just to find out how
to list the disks installed in this system (it's "format") so that I could
build my zpool in the first place. It's not quite as simple as writing out
/dev/hda, /dev/hdb, /dev/hdc, /dev/hdd. ;)
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
It's a lower level one. What hardware are you running?
Todd H. Poole
2008-08-24 04:41:38 UTC
Permalink
Hmm... I'm leaning away a bit from the hardware, but just in case you've got an idea, the machine is as follows:

CPU: AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model ADH4850DOBOX (http://www.newegg.com/Product/Product.aspx?Item=N82E16819103255)

Motherboard: GIGABYTE GA-MA770-DS3 AM2+/AM2 AMD 770 ATX All Solid Capacitor AMD Motherboard (http://www.newegg.com/Product/Product.aspx?Item=N82E16813128081)

RAM: G.SKILL 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800 (PC2 6400) Dual Channel Kit Desktop Memory Model F2-6400CL5D-4GBPQ (http://www.newegg.com/Product/Product.aspx?Item=N82E16820231122)

HDD (x4): Western Digital Caviar GP WD10EACS 1TB 5400 to 7200 RPM SATA 3.0Gb/s Hard Drive (http://www.newegg.com/Product/Product.aspx?Item=N82E16822136151)

The reason why I don't think there's a hardware issue is because before I got OpenSolaris up and running, I had a fully functional install of openSuSE 11.0 running (with everything similar to the original server) to make sure that none of the components were damaged during shipping from Newegg. Everything worked as expected.

Furthermore, before making my purchases, I made sure to check the HCL and my processor and motherboard combination are supported: http://www.sun.com/bigadmin/hcl/data/systems/details/3079.html

But, like I said earlier, I'm new here, so you might be on to something that never occurred to me.

Any ideas?


This message posted from opensolaris.org
Tim
2008-08-24 04:55:55 UTC
Permalink
Post by Todd H. Poole
Hmm... I'm leaning away a bit from the hardware, but just in case you've
CPU: AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model
ADH4850DOBOX (
http://www.newegg.com/Product/Product.aspx?Item=N82E16819103255)
Motherboard: GIGABYTE GA-MA770-DS3 AM2+/AM2 AMD 770 ATX All Solid Capacitor
AMD Motherboard (
http://www.newegg.com/Product/Product.aspx?Item=N82E16813128081)
RAM: G.SKILL 4GB (2 x 2GB) 240-Pin DDR2 SDRAM DDR2 800 (PC2 6400) Dual
Channel Kit Desktop Memory Model F2-6400CL5D-4GBPQ (
http://www.newegg.com/Product/Product.aspx?Item=N82E16820231122)
HDD (x4): Western Digital Caviar GP WD10EACS 1TB 5400 to 7200 RPM SATA
3.0Gb/s Hard Drive (
http://www.newegg.com/Product/Product.aspx?Item=N82E16822136151)
The reason why I don't think there's a hardware issue is because before I
got OpenSolaris up and running, I had a fully functional install of openSuSE
11.0 running (with everything similar to the original server) to make sure
that none of the components were damaged during shipping from Newegg.
Everything worked as expected.
Furthermore, before making my purchases, I made sure to check the HCL and
http://www.sun.com/bigadmin/hcl/data/systems/details/3079.html
But, like I said earlier, I'm new here, so you might be on to something
that never occurred to me.
Any ideas?
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
What are you using to connect the HD's to the system? The onboard ports?
What driver is being used? AHCI, or IDE compatibility mode?

I'm not saying the hardware is bad, I'm saying the hardware is most likely
the cause by way of driver. There really isn't any *setting* in solaris I'm
aware of that says "hey, freeze my system when a drive dies". That just
sounds like hot-swap isn't working as it should be.

--Tim
Todd H. Poole
2008-08-24 08:27:46 UTC
Permalink
Ah, yes - all four hard drives are connected to the motherboard's onboard SATA II ports. There is one additional drive I have neglected to mention thus far (the boot drive) but that is connected via the motherboard's IDE channel, and has remained untouched since the install... I don't really consider it part of the problem, but I thought I should mention it just in case... you never know...

As for the drivers... well, I'm not sure of the command to determine that directly, but going under System > Administration > Device Driver Utility yields the following information under the "Storage" entry:

Components: "ATI Technologies Inc. SB600 IDE"
Driver: pci-ide
--Driver Information--
Driver: pci-ide
Instance: 1
Attach Status: Attached
--Hardware Information--
Vendor ID: 0x1002
Device ID: 0x438c
Class Code: 0001018a
DevPath: /***@0,0/pci-***@14,1

and

Components: "ATI Technologies Inc. SB600 Non-Raid-5 SATA"
Driver: pci-ide
--Driver Information--
Driver: pci-ide
Instance: 0
Attach Status: Attached
--Hardware Information--
Vendor ID: 0x1002
Device ID: 0x4380
Class Code: 0001018f
DevPath: /***@0,0/pci-***@12

Furthermore, there is one Driver Problem detected but the error is under the "USB" entry. There are seven items listed:

Components: ATI Technologies Inc. SB600 USB Controller (EHCI)
Driver: ehci

Components: ATI Technologies Inc. SB600 USB (OHCI4)
Driver: ohci

Components: ATI Technologies Inc. SB600 USB (OHCI3)
Driver: ohci

Components: ATI Technologies Inc. SB600 USB (OHCI2)
Driver: ohci

Components: ATI Technologies Inc. SB600 USB (OHCI1)
Driver: ohci (Driver Misconfigured)

Components: ATI Technologies Inc. SB600 USB (OHCI0)
Driver: ohci

Components: Microsoft Corp. Wheel Mouse Optical
Driver: hid

As you can tell, the OHCI1 device isn't properly configured, but I don't know how to configure it (there's only a "Help" "Submit...", and "Close" button to click, no "Install Driver"). And, to tell you the truth, I'm not even sure it's worth mentioning because I don't have anything but my mouse plugged into USB, and even so... it's a mouse... plugged into USB... hardly something that is going to bring my machine to a grinding halt every time a SATA II disk gets yanked from a RAID-Z array (at least, I should hope the two don't have anything in common!).

And... wait... you mean to tell me that I can't just untick the checkbox that says "Hey, freeze my system when a drive dies" to solve this problem? Ugh. And here I was hoping for a quick fix... ;)

Anyway, how does the above sound? What else can I give you?

-Todd

PS: Thanks, by the way, for the support - I'm not sure where else to turn to for this kind of stuff!


This message posted from opensolaris.org
Tim
2008-08-24 15:30:34 UTC
Permalink
I'm pretty sure pci-ide doesn't support hot-swap. I believe you need ahci.
Post by Todd H. Poole
Ah, yes - all four hard drives are connected to the motherboard's onboard
SATA II ports. There is one additional drive I have neglected to mention
thus far (the boot drive) but that is connected via the motherboard's IDE
channel, and has remained untouched since the install... I don't really
consider it part of the problem, but I thought I should mention it just in
case... you never know...
As for the drivers... well, I'm not sure of the command to determine that
directly, but going under System > Administration > Device Driver Utility
Components: "ATI Technologies Inc. SB600 IDE"
Driver: pci-ide
--Driver Information--
Driver: pci-ide
Instance: 1
Attach Status: Attached
--Hardware Information--
Vendor ID: 0x1002
Device ID: 0x438c
Class Code: 0001018a
and
Components: "ATI Technologies Inc. SB600 Non-Raid-5 SATA"
Driver: pci-ide
--Driver Information--
Driver: pci-ide
Instance: 0
Attach Status: Attached
--Hardware Information--
Vendor ID: 0x1002
Device ID: 0x4380
Class Code: 0001018f
Furthermore, there is one Driver Problem detected but the error is under the
Components: ATI Technologies Inc. SB600 USB Controller (EHCI)
Driver: ehci
Components: ATI Technologies Inc. SB600 USB (OHCI4)
Driver: ohci
Components: ATI Technologies Inc. SB600 USB (OHCI3)
Driver: ohci
Components: ATI Technologies Inc. SB600 USB (OHCI2)
Driver: ohci
Components: ATI Technologies Inc. SB600 USB (OHCI1)
Driver: ohci (Driver Misconfigured)
Components: ATI Technologies Inc. SB600 USB (OHCI0)
Driver: ohci
Components: Microsoft Corp. Wheel Mouse Optical
Driver: hid
As you can tell, the OHCI1 device isn't properly configured, but I don't
know how to configure it (there's only a "Help" "Submit...", and "Close"
button to click, no "Install Driver"). And, to tell you the truth, I'm not
even sure it's worth mentioning because I don't have anything but my mouse
plugged into USB, and even so... it's a mouse... plugged into USB... hardly
something that is going to bring my machine to a grinding halt every time a
SATA II disk gets yanked from a RAID-Z array (at least, I should hope the
two don't have anything in common!).
And... wait... you mean to tell me that I can't just untick the checkbox
that says "Hey, freeze my system when a drive dies" to solve this problem?
Ugh. And here I was hoping for a quick fix... ;)
Anyway, how does the above sound? What else can I give you?
-Todd
PS: Thanks, by the way, for the support - I'm not sure where else to turn to
for this kind of stuff!
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Todd H. Poole
2008-08-24 20:23:52 UTC
Permalink
Hmmm. Alright, but supporting hot-swap isn't the issue, is it? I mean, like I said in my response to myxiplx, if I have to bring down the machine in order to replace a faulty drive, that's perfectly acceptable - I can do that whenever it's most convenient for me.

What is _not_ perfectly acceptable (indeed, what is quite _unacceptable_) is if the machine hangs/freezes/locks up or is otherwise brought down by an isolated failure in a supposedly redundant array... Yanking the drive is just how I chose to simulate that failure. I could just as easily have decided to take a sledgehammer or power drill to it,

(fast-forward to the 2:30 part)


and the machine shouldn't have skipped a beat. After all, that's the whole point behind the "redundant" part of RAID, no?

And besides, RAID's been around for almost 20 years now... It's nothing new. I've seen (countless times, mind you) plenty of regular old IDE drives fail in a simple software RAID5 array and not bring the machine down at all. Granted, you still had to power down to re-insert a new one (unless you were using some fancy controller card), but the point remains: the machine would still work perfectly with only 3 out of 4 drives present... So I know for a fact this type of stability can be achieved with IDE.

What I'm getting at is this: I don't think the method by which the drives are connected - or whether or not that method supports hot-swap - should matter. A machine _should_not_ crash when a single drive (out of a 4 drive ZFS RAID-Z array) is ungracefully removed, regardless of how abruptly that drive is excised (be it by a slow failure of the drive motor's spindle, by yanking the drive's power cable, by yanking the drive's SATA connector, by smashing it to bits with a sledgehammer, or by drilling into it with a power drill).

So we've established that one potential work around is to use the ahci instead of the pci-ide driver. Good! I like this kind of problem solving! But that's still side-stepping the problem... While this machine is entirely SATA II, what about those who have a mix between SATA and IDE? Or even much larger entities whose vast majority of hardware is only a couple of years old, and still entirely IDE?

I'm grateful for your help, but is there another way that you can think of to get this to work?


This message posted from opensolaris.org
James C. McPherson
2008-08-24 21:28:33 UTC
Permalink
Post by Todd H. Poole
Hmmm. Alright, but supporting hot-swap isn't the issue, is it? I mean,
like I said in my response to myxiplx, if I have to bring down the
machine in order to replace a faulty drive, that's perfectly acceptable -
I can do that whenever it's most convenient for me.
What is _not_ perfectly acceptable (indeed, what is quite _unacceptable_)
is if the machine hangs/freezes/locks up or is otherwise brought down by
an isolated failure in a supposedly redundant array... Yanking the drive
is just how I chose to simulate that failure. I could just as easily have
decided to take a sledgehammer or power drill to it,
But you're not attempting hotswap, you're doing hot plug....
and unless you're using the onboard bios' concept of an actual
RAID array, you don't have an array, you've got a JBOD and
it's not a real JBOD - it's a PC motherboard which does _not_
have the same electronic and electrical protections that a
JBOD has *by design*.
Post by Todd H. Poole
http://youtu.be/CN6iDzesEs0 (fast-forward to the 2:30
part) http://youtu.be/naKd9nARAes
and the machine shouldn't have skipped a beat. After all, that's the
whole point behind the "redundant" part of RAID, no?
Sigh.
Post by Todd H. Poole
And besides, RAID's been around for almost 20 years now... It's nothing
new. I've seen (countless times, mind you) plenty of regular old IDE
drives fail in a simple software RAID5 array and not bring the machine
down at all. Granted, you still had to power down to re-insert a new one
(unless you were using some fancy controller card), but the point
remains: the machine would still work perfectly with only 3 out of 4
drives present... So I know for a fact this type of stability can be
achieved with IDE.
And you're right, it can. But what you've been doing is outside
the bounds of what IDE hardware on a PC motherboard is designed
to cope with.
Post by Todd H. Poole
What I'm getting at is this: I don't think the method by which the drives
are connected - or whether or not that method supports hot-swap - should
matter.
Well sorry, it does. Welcome to an OS which does care.
Post by Todd H. Poole
A machine _should_not_ crash when a single drive (out of a 4
drive ZFS RAID-Z array) is ungracefully removed, regardless of how
abruptly that drive is excised (be it by a slow failure of the drive
motor's spindle, by yanking the drive's power cable, by yanking the
drive's SATA connector, by smashing it to bits with a sledgehammer, or by
drilling into it with a power drill).
If the controlling electronics for your disk can't handle
it, then you're hosed. That's why FC, SATA (in SATA mode)
and SAS are much more likely to handle this out of the box.
Parallel SCSI requires funky hardware, which is why those
old 6- or 12-disk multipacks are so useful to have.

Of the failure modes that you suggest above, only one is
going to give you anything other than catastrophic failure
(drive motor degradation) - and that is because the drive's
electronics will realise this, and send warnings to the
host.... which should have its drivers written so that these
messages are logged for the sysadmin to act upon.

The other failure modes are what we call catastrophic. And
where your hardware isn't designed with certain protections
around drive connections, you're hosed. No two ways about it.
If your system suffers that sort of failure, would you seriously
expect that non-hardened hardware would survive it?
Post by Todd H. Poole
So we've established that one potential work around is to use the ahci
instead of the pci-ide driver. Good! I like this kind of problem solving!
But that's still side-stepping the problem... While this machine is
entirely SATA II, what about those who have a mix between SATA and IDE?
Or even much larger entities whose vast majority of hardware is only a
couple of years old, and still entirely IDE?
If you've got newer hardware, which can support SATA in
native SATA mode, USE IT.

Don't _ever_ try that sort of thing with IDE. As I mentioned
above, IDE is not designed to be able to cope with what
you've been inflicting on this machine.
Post by Todd H. Poole
I'm grateful for your help, but is there another way that you can think
of to get this to work?
You could start by taking us seriously when we tell you
that what you've been doing is not a good idea, and find
other ways to simulate drive failures.


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Todd H. Poole
2008-08-25 02:36:11 UTC
Permalink
Post by James C. McPherson
But you're not attempting hotswap, you're doing hot plug....
Do you mean hot UNplug? Because I'm not trying to get this thing to recognize any new disks without a restart... Honest. I'm just trying to prevent the machine from freezing up when a drive fails. I have no problem restarting the machine with a new drive in it later so that it recognizes the new disk.
Post by James C. McPherson
and unless you're using the onboard bios' concept of an actual
RAID array, you don't have an array, you've got a JBOD and
it's not a real JBOD - it's a PC motherboard which does _not_
have the same electronic and electrical protections that a
JBOD has *by design*.
I'm confused by what your definition of a RAID array is, and for that matter, what a JBOD is... I've got plenty of experience with both, but just to make sure I wasn't off my rocker, I consulted the demigod:

http://en.wikipedia.org/wiki/RAID
http://en.wikipedia.org/wiki/JBOD

and I think what I'm doing is indeed RAID... I'm not using some sort of controller card, or any specialized hardware, so it's certainly not Hardware RAID (and thus doesn't contain any of the fancy electronic or electrical protections you mentioned), but lacking said protections doesn't preclude the machine from being considered a RAID. All the disks are the same capacity, the OS still sees the zpool I've created as one large volume, and since I'm using RAID-Z (RAID5), it should be redundant... What other qualifiers out there are necessary before a system can be called RAID compliant?

If it's hot-swappable technology, or a controller hiding the details from the OS and instead presenting a single volume, then I would argue those things are extra - not a fundamental prerequisite for a system to be called a RAID.

Furthermore, while I'm not sure what the difference between a "real JBOD" and a plain old JBOD is, this set-up certainly wouldn't qualify for either. I mean, there is no concatenation going on, redundancy should be present (but due to this issue, I haven't been able to verify that yet), and all the drives are the same size... Am I missing something in the definition of a JBOD?

I don't think so...
Post by James C. McPherson
And you're right, it can. But what you've been doing is outside
the bounds of what IDE hardware on a PC motherboard is designed
to cope with.
Well, yes, you're right, but it's not like I'm making some sort of radical departure outside of the bounds of the hardware... It really shouldn't be a problem so long as it's not an unreasonable departure because that's where software comes in. When the hardware can't cut it, that's where software picks up the slack.

Now, obviously, I'm not saying software can do anything with any piece of hardware you give it - no matter how many lines of code you write, your keyboard isn't going to turn into a speaker - but when it comes to reasonable stuff like ensuring a machine doesn't crash because a user did something with the hardware that he or she wasn't supposed to do? Prime target for software.

And that's the way it's always been... The whole push behind that whole ZFS Promise thing (or if you want to make it less specific, the attractiveness of RAID in general), was that "RAID-Z [wouldn't] require any special hardware. It doesn't need NVRAM for correctness, and it doesn't need write buffering for good performance. With RAID-Z, ZFS makes good on the original RAID promise: it provides fast, reliable storage using cheap, commodity disks." (http://blogs.sun.com/bonwick/entry/raid_z)
Post by James C. McPherson
Well sorry, it does. Welcome to an OS which does care.
The half-hearted apology wasn't necessary... I understand that OpenSolaris cares about the method those disks use to plug into the motherboard, but what I don't understand is why that limitation exists in the first place. It would seem much better to me to have an OS that doesn't care (but developers that do) and just finds a way to work, versus one that does care (but developers that don't) and instead isn't as flexible and gets picky... I'm not saying OpenSolaris is the latter, but I'm not getting the impression it's the former either...
Post by James C. McPherson
If the controlling electronics for your disk can't
handle it, then you're hosed. That's why FC, SATA (in SATA
mode) and SAS are much more likely to handle this out of
the box. Parallel SCSI requires funky hardware, which is why
those old 6- or 12-disk multipacks are so useful to have.
Of the failure modes that you suggest above, only one
is going to give you anything other than catastrophic
failure (drive motor degradation) - and that is because the
drive's electronics will realise this, and send warnings to
the host.... which should have its drivers written so
that these messages are logged for the sysadmin to act upon.
The other failure modes are what we call catastrophic. And
where your hardware isn't designed with certain protections
around drive connections, you're hosed. No two ways
about it. If your system suffers that sort of failure, would
you seriously expect that non-hardened hardware would survive it?
Yes, I would. At the risk of sounding repetitive, I'll summarize what I've been getting at in my previous responses: I certainly _do_ think it's reasonable to expect non-hardened hardware to survive this type of failure. In fact, I think its unreasonable _not_ to expect it to. The Linux kernel, the BSD kernels, and the NT kernel (or whatever chunk of code runs Windows) all provide this type of functionality, and have so for some time. Granted, they may all do it in different ways, but at the end of the day, unplugging an IDE hard drive from a software RAID5 array in OpenSuSE or RedHat, FreeBSD, or Windows XP Professional will not bring the machine down. And it shouldn't in OpenSolaris either. There might be some sort of noticeable bump (Windows, for example, pauses for a few seconds while
it tries to figure out what hell just happened to one of it's disks), but there isn't anything show stopping...
Post by James C. McPherson
If you've got newer hardware, which can support SATA
in native SATA mode, USE IT.
I'll see what I can do - this might be some sort of BIOS setting that can be configured.
Post by James C. McPherson
Post by Todd H. Poole
I'm grateful for your help, but is there another way that you can think
of to get this to work?
You could start by taking us seriously when we tell
you that what you've been doing is not a good idea, and
find other ways to simulate drive failures.
Lets drop the confrontational attitude - I'm not trying to dick around with you here. I've done my due diligence in researching this issue on Google, these forums, and Sun's documentation before making a post, I've provided any clarifying information that has been requested by those kind enough to post a response, and I've yet to resort to any witty or curt remarks in my correspondence with you, tcook, or myxiplx. Whatever is causing you to think I'm not taking anyone seriously, let me reassure you, I am.

The only thing I'm doing is testing a system by applying the worst case scenario of survivable torture to it and seeing how it recovers. If that's not a good idea, then I guess we disagree. But that's ok - you're James C. McPherson, Senior Kernel Software Engineer, Solaris, and I'm just some user who's trying to find a solution to his problem. My bad for expecting the same level of respect I've given two other members of this community to be returned in kind by one of it's leaders.

So aside from telling me to "[never] try this sort of thing with IDE" does anyone else have any other ideas on how to prevent OpenSolaris from locking up whenever an IDE drive is abruptly disconnected from a ZFS RAID-Z array?

-Todd


This message posted from opensolaris.org
Matt Harrison
2008-08-25 03:06:13 UTC
Permalink
Post by Todd H. Poole
Post by James C. McPherson
But you're not attempting hotswap, you're doing hot plug....
Do you mean hot UNplug? Because I'm not trying to get this thing to recognize any new disks without a restart... Honest. I'm just trying to prevent the machine from freezing up when a drive fails. I have no problem restarting the machine with a new drive in it later so that it recognizes the new disk.
Post by James C. McPherson
and unless you're using the onboard bios' concept of an actual
RAID array, you don't have an array, you've got a JBOD and
it's not a real JBOD - it's a PC motherboard which does _not_
have the same electronic and electrical protections that a
JBOD has *by design*.
http://en.wikipedia.org/wiki/RAID
http://en.wikipedia.org/wiki/JBOD
and I think what I'm doing is indeed RAID... I'm not using some sort of controller card, or any specialized hardware, so it's certainly not Hardware RAID (and thus doesn't contain any of the fancy electronic or electrical protections you mentioned), but lacking said protections doesn't preclude the machine from being considered a RAID. All the disks are the same capacity, the OS still sees the zpool I've created as one large volume, and since I'm using RAID-Z (RAID5), it should be redundant... What other qualifiers out there are necessary before a system can be called RAID compliant?
If it's hot-swappable technology, or a controller hiding the details from the OS and instead presenting a single volume, then I would argue those things are extra - not a fundamental prerequisite for a system to be called a RAID.
Furthermore, while I'm not sure what the difference between a "real JBOD" and a plain old JBOD is, this set-up certainly wouldn't qualify for either. I mean, there is no concatenation going on, redundancy should be present (but due to this issue, I haven't been able to verify that yet), and all the drives are the same size... Am I missing something in the definition of a JBOD?
I don't think so...
Post by James C. McPherson
And you're right, it can. But what you've been doing is outside
the bounds of what IDE hardware on a PC motherboard is designed
to cope with.
Well, yes, you're right, but it's not like I'm making some sort of radical departure outside of the bounds of the hardware... It really shouldn't be a problem so long as it's not an unreasonable departure because that's where software comes in. When the hardware can't cut it, that's where software picks up the slack.
Now, obviously, I'm not saying software can do anything with any piece of hardware you give it - no matter how many lines of code you write, your keyboard isn't going to turn into a speaker - but when it comes to reasonable stuff like ensuring a machine doesn't crash because a user did something with the hardware that he or she wasn't supposed to do? Prime target for software.
And that's the way it's always been... The whole push behind that whole ZFS Promise thing (or if you want to make it less specific, the attractiveness of RAID in general), was that "RAID-Z [wouldn't] require any special hardware. It doesn't need NVRAM for correctness, and it doesn't need write buffering for good performance. With RAID-Z, ZFS makes good on the original RAID promise: it provides fast, reliable storage using cheap, commodity disks." (http://blogs.sun.com/bonwick/entry/raid_z)
Post by James C. McPherson
Well sorry, it does. Welcome to an OS which does care.
The half-hearted apology wasn't necessary... I understand that OpenSolaris cares about the method those disks use to plug into the motherboard, but what I don't understand is why that limitation exists in the first place. It would seem much better to me to have an OS that doesn't care (but developers that do) and just finds a way to work, versus one that does care (but developers that don't) and instead isn't as flexible and gets picky... I'm not saying OpenSolaris is the latter, but I'm not getting the impression it's the former either...
Post by James C. McPherson
If the controlling electronics for your disk can't
handle it, then you're hosed. That's why FC, SATA (in SATA
mode) and SAS are much more likely to handle this out of
the box. Parallel SCSI requires funky hardware, which is why
those old 6- or 12-disk multipacks are so useful to have.
Of the failure modes that you suggest above, only one
is going to give you anything other than catastrophic
failure (drive motor degradation) - and that is because the
drive's electronics will realise this, and send warnings to
the host.... which should have its drivers written so
that these messages are logged for the sysadmin to act upon.
The other failure modes are what we call catastrophic. And
where your hardware isn't designed with certain protections
around drive connections, you're hosed. No two ways
about it. If your system suffers that sort of failure, would
you seriously expect that non-hardened hardware would survive it?
Yes, I would. At the risk of sounding repetitive, I'll summarize what I've been getting at in my previous responses: I certainly _do_ think it's reasonable to expect non-hardened hardware to survive this type of failure. In fact, I think its unreasonable _not_ to expect it to. The Linux kernel, the BSD kernels, and the NT kernel (or whatever chunk of code runs Windows) all provide this type of functionality, and have so for some time. Granted, they may all do it in different ways, but at the end of the day, unplugging an IDE hard drive from a software RAID5 array in OpenSuSE or RedHat, FreeBSD, or Windows XP Professional will not bring the machine down. And it shouldn't in OpenSolaris either. There might be some sort of noticeable bump (Windows, for example, pauses for a few seconds whil
e it tries to figure out what hell just happened to one of it's disks), but there isn't anything show stopping...
Post by Todd H. Poole
Post by James C. McPherson
If you've got newer hardware, which can support SATA
in native SATA mode, USE IT.
I'll see what I can do - this might be some sort of BIOS setting that can be configured.
Post by James C. McPherson
Post by Todd H. Poole
I'm grateful for your help, but is there another way that you can think
of to get this to work?
You could start by taking us seriously when we tell
you that what you've been doing is not a good idea, and
find other ways to simulate drive failures.
Lets drop the confrontational attitude - I'm not trying to dick around with you here. I've done my due diligence in researching this issue on Google, these forums, and Sun's documentation before making a post, I've provided any clarifying information that has been requested by those kind enough to post a response, and I've yet to resort to any witty or curt remarks in my correspondence with you, tcook, or myxiplx. Whatever is causing you to think I'm not taking anyone seriously, let me reassure you, I am.
The only thing I'm doing is testing a system by applying the worst case scenario of survivable torture to it and seeing how it recovers. If that's not a good idea, then I guess we disagree. But that's ok - you're James C. McPherson, Senior Kernel Software Engineer, Solaris, and I'm just some user who's trying to find a solution to his problem. My bad for expecting the same level of respect I've given two other members of this community to be returned in kind by one of it's leaders.
So aside from telling me to "[never] try this sort of thing with IDE" does anyone else have any other ideas on how to prevent OpenSolaris from locking up whenever an IDE drive is abruptly disconnected from a ZFS RAID-Z array?
-Todd
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I'm far from being an expert on this subject, but this is what I understand:

Unplugging a drive (actually pulling the cable out) does not simulate a
drive failure, it simulates a drive getting unplugged, which is
something the hardware is not capable of dealing with.

If your drive were to suffer something more realistic, along the lines
of how you would normally expect a drive to die, then the system should
cope with it a whole lot better.

Unfortunately, hard drives don't come with a big button saying "simulate
head crash now" or "make me some bad sectors" so it's going to be
difficult to simulate those failures.

All I can say is that unplugging a drive yourself will not simulate a
failure, it merely causes the disk to disappear. Dying or dead disks
will still normally be able to communicate with the driver to some
extent, so they are still "there".

If you were using dedicated hotswappable hardware, then I wouldn't
expect to see the problem, but AFAIK off the shelf SATA hardware doesn't
support this fully, so unexpected results will occur.

I hope this has been of some small help, even just to explain why the
system didn't cope as you expected.

Matt
Justin
2008-08-25 06:53:12 UTC
Permalink
Howdy Matt. Just to make it absolutely clear, I appreciate your response. I would be quite lost if it weren't for all of the input.
Post by Matt Harrison
Unplugging a drive (actually pulling the cable out) does not simulate a
drive failure, it simulates a drive getting unplugged, which is
something the hardware is not capable of dealing with.
If your drive were to suffer something more realistic, along the lines
of how you would normally expect a drive to die, then the system should
cope with it a whole lot better.
Hmmm... I see what you're saying. But, ok, let me play devil's advocate. What about the times when a drive fails in a way the system didn't expect? What you said was right - most of the time, when a hard drive goes bad, SMART will pick up on it's impending doom long before it's too late - but what about the times when the cause of the problem is larger or more abrupt than that (like tin whiskers causing shorts, or a server room technician yanking the wrong drive)?

To imply that OpenSolaris with a RAID-Z array of IDE drives will _only_ protect me from data loss during _specific_ kinds of failures (the one's which OpenSolaris considers "normal") is a pretty big implication... and is certainly a show-stopping one at that. Nobody is going to want to rely on an OS/RAID solution that can only survive certain types of drive failures, while there are others out there that can survive the same and more...

But then again, I'm not sure if that's what you meant... is that what you were getting at, or did I misunderstand?
Post by Matt Harrison
Unfortunately, hard drives don't come with a big button saying "simulate
head crash now" or "make me some bad sectors" so it's going to be
difficult to simulate those failures.
lol, if only they did - just having a button to push would make testing these types of things a lot easier. ;)
Post by Matt Harrison
All I can say is that unplugging a drive yourself will not simulate a
failure, it merely causes the disk to disappear.
But isn't that a perfect example of a failure!? One in which the drive just seems to pop out of existence? lol, forgive me if I'm sounding pedantic, but why is there even a distinction between the two? This is starting to sound more and more like a bug...
Post by Matt Harrison
I hope this has been of some small help, even just to
explain why the system didn't cope as you expected.
It has, thank you - I appreciate the response.


This message posted from opensolaris.org
Heikki Suonsivu on list forwarder
2008-08-25 14:36:34 UTC
Permalink
Post by Justin
Howdy Matt. Just to make it absolutely clear, I appreciate your
response. I would be quite lost if it weren't for all of the input.
Post by Matt Harrison
Unplugging a drive (actually pulling the cable out) does not
simulate a drive failure, it simulates a drive getting unplugged,
which is something the hardware is not capable of dealing with.
If your drive were to suffer something more realistic, along the
lines of how you would normally expect a drive to die, then the
system should cope with it a whole lot better.
Hmmm... I see what you're saying. But, ok, let me play devil's
advocate. What about the times when a drive fails in a way the system
didn't expect? What you said was right - most of the time, when a
hard drive goes bad, SMART will pick up on it's impending doom long
before it's too late - but what about the times when the cause of the
problem is larger or more abrupt than that (like tin whiskers causing
shorts, or a server room technician yanking the wrong drive)?
I read a research paper by google about this a while ago. Their
conclusion was that SMART is poor predictor of disk failure, even though
they did find some useful indications. google for "google disk
failure", it came out as second link a moment ago, title "Failure Trends
in a Large Disk Drive Population".

The problem with trying to predict disk failures with SMART parameters
only catches a certain percentage of failing disks, and that percentage
is not all that great. Many disks will still decide to fail
catastrophically, most often early morning December 25th, in particular
if there is a huge snowstorm going :)

Heikki
Todd H. Poole
2008-08-25 07:39:43 UTC
Permalink
Howdy Matt. Just to make it absolutely clear, I appreciate your response. I would be quite lost if it weren't for all of the input.
Post by Matt Harrison
Unplugging a drive (actually pulling the cable out) does not simulate a
drive failure, it simulates a drive getting unplugged, which is
something the hardware is not capable of dealing with.
If your drive were to suffer something more realistic, along the lines
of how you would normally expect a drive to die, then the system should
cope with it a whole lot better.
Hmmm... I see what you're saying. But, ok, let me play devil's advocate. What about the times when a drive fails in a way the system didn't expect? What you said was right - most of the time, when a hard drive goes bad, SMART will pick up on it's impending doom long before it's too late - but what about the times when the cause of the problem is larger or more abrupt than that (like tin whiskers causing shorts, or a server room technician yanking the wrong drive)?

To imply that OpenSolaris with a RAID-Z array of IDE drives will _only_ protect me from data loss during _specific_ kinds of failures (the one's which OpenSolaris considers "normal") is a pretty big implication... and is certainly a show-stopping one at that. Nobody is going to want to rely on an OS/RAID solution that can only survive certain types of drive failures, while there are others out there that can survive the same and more...

But then again, I'm not sure if that's what you meant... is that what you were getting at, or did I misunderstand?
Post by Matt Harrison
Unfortunately, hard drives don't come with a big button saying "simulate
head crash now" or "make me some bad sectors" so it's going to be
difficult to simulate those failures.
lol, if only they did - just having a button to push would make testing these types of things a lot easier. ;)
Post by Matt Harrison
All I can say is that unplugging a drive yourself will not simulate a
failure, it merely causes the disk to disappear.
But isn't that a perfect example of a failure!? One in which the drive just seems to pop out of existence? lol, forgive me if I'm sounding pedantic, but why is there even a distinction between the two? This is starting to sound more and more like a bug...
Post by Matt Harrison
I hope this has been of some small help, even just to
explain why the system didn't cope as you expected.
It has, thank you - I appreciate the response.


This message posted from opensolaris.org
Ralf Ramge
2008-08-25 10:15:41 UTC
Permalink
Post by Justin
Hmmm... I see what you're saying. But, ok, let me play devil's advocate. What about the times when a drive fails in a way the system didn't expect? What you said was right - most of the time, when a hard drive goes bad, SMART will pick up on it's impending doom long before it's too late - but what about the times when the cause of the problem is larger or more abrupt than that (like tin whiskers causing shorts, or a server room technician yanking the wrong drive)?
To imply that OpenSolaris with a RAID-Z array of IDE drives will _only_ protect me from data loss during _specific_ kinds of failures (the one's which OpenSolaris considers "normal") is a pretty big implication... and is certainly a show-stopping one at that. Nobody is going to want to rely on an OS/RAID solution that can only survive certain types of drive failures, while there are others out there that can survive the same and more...
But then again, I'm not sure if that's what you meant... is that what you were getting at, or did I misunderstand?
I think there's a misunderstanding concerning underlying concepts. I'll
try to explain my thoughts, please excuse me in case this becomes a bit
lengthy. Oh, and I am not a Sun employee or ZFS fan, I'm just a customer
who loves and hates ZFS at the same time ;-)

You know, ZFS is designed for high *reliability*. This means that ZFS
tries to keep your data as safe as possible. This includes faulty
hardware, missing hardware (like in your testing scenario) and, to a
certain degree, even human mistakes.
But there are limits. For instance, ZFS does not make a backup
unnecessary. If there's a fire and your drives melt, then ZFS can't do
anything. Or if the hardware is lying about the drive geometry. ZFS is
part of the operating environment and, as a consequence, relies on the
hardware.
so ZFS can't make unreliable hardware reliable. All it can do is trying
to protect the data you saved on it. But it cannot guarantee this to you
if the hardware becomes its enemy.
A real world example: I have a 32 core Opteron server here, with 4
FibreChannel Controllers and 4 JBODs with a total of FC drives connected
to it, running a RAID 10 using ZFS mirrors. Sounds a lot like high end
hardware compared to your NFS server, right? But ... I have exactly the
same symptom. If one drive fails, an entire JBOD with all 16 included
drives hangs, and all zpool access freezes. The reason for this is the
miserable JBOD hardware. There's only one FC loop inside of it, the
drives are connected serially to each other, and if one drive dies, the
drives behind it go downhill, too. ZFS immediately starts caring about
the data, the zpool command hangs (but I still have traffic on the other
half of the ZFS mirror!), and it does the right thing by doing so:
whatever happens, my data must not be damaged.
A "bad" filesystem like Linux ext2 or ext3 with LVM would just continue,
even if the Volume Manager noticed the missing drive or not. That's what
you experienced. But you run in the real danger of having to use fsck at
some point. Or, in my case, fsck'ing 5 TB of data on 64 drives. That's
not much fun and results in a lot more downtime than replacing the
faulty drive.

What can you expect from ZFS in your case? You can expect it to detect
that a drive is missing and to make sure, that your _data integrity_
isn't compromised. By any means necessary. This may even require to
make a system completely unresponsive until a timeout has passed.




But what you described is not a case of reliability. You want something
completely different. You expect it to deliver *availability*.

And availability is something ZFS doesn't promise. It simply can't
deliver this. You have the impression that NTFS and various other
Filesystems do so, but that's an illusion. The next reboot followed by a
fsck run will show you why. Availability requires full reliability of
every included component of your server as a minimum, and you can't
expect ZFS or any other filesystem to deliver this with cheap IDE
hardware.

Usually people want to save money when buying hardware, and ZFS is a
good choice to deliver the *reliability* then. But the conceptual
stalemate between reliability and availability of such cheap hardware
still exists - the hardware is cheap, the file system and services may
be reliable, but as soon as you want *availability*, it's getting
expensive again, because you have to buy every hardware component at
least twice.


So, you have the choice:

a) If you want *availability*, stay with your old solution. But oyu have
no guarantee that your data is always intact. You'll always be able to
stream your video, but you have no guarantee that the client will
receive a stream without drop outs forever.

b) If you want *data integrity*, ZFS is your best friend. But you may
have slight availability issues when it comes to hardware defects. You
may reduce the percentage of pain during a desaster by spending more
money, e.g. by making the SATA controllers redundant and creating a
mirror (than controller 1 will hang, but controller 2 will continue
working), but you must not forget that your PCI bridges, fans, power
supplies, etc. remain single points of failures why can take the entire
service down like your pulling of the non-hotpluggable drive did.

c) If you want both, you should buy a second server and create a NFS
cluster.

Hope I could help you a bit,

Ralf
--
Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
***@webde.de - http://web.de/

1&1 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
Ralf Ramge
2008-08-25 10:24:54 UTC
Permalink
Ralf Ramge wrote:
[...]

Oh, and please excuse the grammar mistakes and typos. I'm in a hurry,
not a retard ;-) At least I think so.
--
Ralf Ramge
Senior Solaris Administrator, SCNA, SCSA

Tel. +49-721-91374-3963
***@webde.de - http://web.de/

1&1 Internet AG
Brauerstraße 48
76135 Karlsruhe

Amtsgericht Montabaur HRB 6484

Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Thomas Gottschlich, Matthias Greve, Robert Hoffmann, Markus Huhn, Oliver Mauss, Achim Weiss
Aufsichtsratsvorsitzender: Michael Scheeren
John Sonnenschein
2008-08-25 03:19:07 UTC
Permalink
James isn't being a jerk because he hates your or anything...

Look, yanking the drives like that can seriously damage the drives or your motherboard. Solaris doesn't let you do it and assumes that something's gone seriously wrong if you try it. That Linux ignores the behavior and lets you do it sounds more like a bug in linux than anything else.


This message posted from opensolaris.org
Peter Bortas
2008-08-25 06:32:51 UTC
Permalink
On Mon, Aug 25, 2008 at 5:19 AM, John Sonnenschein
Post by John Sonnenschein
James isn't being a jerk because he hates your or anything...
Look, yanking the drives like that can seriously damage the drives or your motherboard.
It can, but it's not very likely to.
Post by John Sonnenschein
Solaris doesn't let you do it and assumes that something's gone seriously wrong if you try it. That Linux ignores the behavior and lets you do it sounds more like a bug in linux than anything else.
That if something sounds more like defensiveness. Pulling out the
cable isn't advisable, but it simulates the controller card on the
disk going belly up pretty well. Unless he pulls the power at the same
time, because that would also simulate a power failure.

If a piece of hardware stops responding you might do well to stop
talking to it, but there is nothing admirable about locking up the OS
if there is enough redundancy to continue without that particular
chunk of metal.
--
Peter Bortas
Justin
2008-08-25 07:34:34 UTC
Permalink
Howdy Matt, thanks for the response.

But I dunno man... I think I disagree... I'm kinda of the opinion that regardless of what happens to hardware, an OS should be able to work around it, if it's possible. If a sysadmin wants to yank a hard drive out of a motherboard (despite the risk of damage to the drive and board), then no OS in the world is going to stop him, so instead of the sysadmin trying to work around the OS, shouldn't the OS instead try to work around the sysadmin?

I mean, as great of an OS as it is, Solaris can't possibly hope to stop me from doing anything I want to do... so when it assumes that something's gone seriously wrong (which yanking a disk drive would hopefully cause it to assume), instead of just freezing up and becoming totally useless, why not do something useful like eject the disk from it's memory, degrade the array, send out an e-mail to a designated sysadmin, and then keep on chugging along?

Or, for a greater level of control, why not just read from some configuration set by the sysadmin, and then decide to either do the above or shut down entirely, as per the wishes of the sysadmin? Anything would be better than just going into a catatonic state in less than five seconds.

Which is exactly what Linux, BSD, and even Windows _don't_ do, and why their continual operation even under such failures wouldn't be considered a bug.

When I yank a drive in a RAID5 array - any drive, be it IDE, SATA, USB, or Firewire - in OpenSuSE or RedHat, the kernel will immediately notice it's absence, and inform lvm and mdadm (the software responsible for keeping the RAID array together). mdadm will then degrade the array, and consult whatever instructions root gave it when the sysadmin was configuring the array. If the sysadmin waned the array to "stay up as long as it could," then it would continue to do that. If root wanted the array to be "brought down after any sort of drive failure," then the array would be unmounted. If root wanted to "power the machine down," then the machine will dutifully turn off.

Shouldn't OpenSolaris do the same thing?

And as for James not being a jerk because he hates me, does that mean he's just always like that? lol, it's alright: lets not try to explain or excuse trollish behavior, and instead just call it out and expose it for what it is, and then be done with it.

I certainly am.

Anyways, thanks for the input Matt.


This message posted from opensolaris.org
Todd H. Poole
2008-08-25 07:41:34 UTC
Permalink
Howdy 404, thanks for the response.

But I dunno man... I think I disagree... I'm kinda of the opinion that regardless of what happens to hardware, an OS should be able to work around it, if it's possible. If a sysadmin wants to yank a hard drive out of a motherboard (despite the risk of damage to the drive and board), then no OS in the world is going to stop him, so instead of the sysadmin trying to work around the OS, shouldn't the OS instead try to work around the sysadmin?

I mean, as great of an OS as it is, Solaris can't possibly hope to stop me from doing anything I want to do... so when it assumes that something's gone seriously wrong (which yanking a disk drive would hopefully cause it to assume), instead of just freezing up and becoming totally useless, why not do something useful like eject the disk from it's memory, degrade the array, send out an e-mail to a designated sysadmin, and then keep on chugging along?

Or, for a greater level of control, why not just read from some configuration set by the sysadmin, and then decide to either do the above or shut down entirely, as per the wishes of the sysadmin? Anything would be better than just going into a catatonic state in less than five seconds.

Which is exactly what Linux, BSD, and even Windows _don't_ do, and why their continual operation even under such failures wouldn't be considered a bug.

When I yank a drive in a RAID5 array - any drive, be it IDE, SATA, USB, or Firewire - in OpenSuSE or RedHat, the kernel will immediately notice it's absence, and inform lvm and mdadm (the software responsible for keeping the RAID array together). mdadm will then degrade the array, and consult whatever instructions root gave it when the sysadmin was configuring the array. If the sysadmin waned the array to "stay up as long as it could," then it would continue to do that. If root wanted the array to be "brought down after any sort of drive failure," then the array would be unmounted. If root wanted to "power the machine down," then the machine will dutifully turn off.

Shouldn't OpenSolaris do the same thing?

And as for James not being a jerk because he hates me, does that mean he's just always like that? lol, it's alright: lets not try to explain or excuse trollish behavior, and instead just call it out and expose it for what it is, and then be done with it.

I certainly am.

As always, thanks for the input.


This message posted from opensolaris.org
Richard Elling
2008-08-25 14:39:32 UTC
Permalink
Post by Todd H. Poole
Howdy 404, thanks for the response.
But I dunno man... I think I disagree... I'm kinda of the opinion that regardless of what happens to hardware, an OS should be able to work around it, if it's possible. If a sysadmin wants to yank a hard drive out of a motherboard (despite the risk of damage to the drive and board), then no OS in the world is going to stop him, so instead of the sysadmin trying to work around the OS, shouldn't the OS instead try to work around the sysadmin?
The behavior of ZFS to an error reported by an underlying device
driver is tunable by the zpool failmode property. By default, it is
set to "wait." For root pools, the installer may change this
to "continue." The key here is that you can argue with the choice
of default behavior, but don't argue with the option to change.
Post by Todd H. Poole
I mean, as great of an OS as it is, Solaris can't possibly hope to stop me from doing anything I want to do... so when it assumes that something's gone seriously wrong (which yanking a disk drive would hopefully cause it to assume), instead of just freezing up and becoming totally useless, why not do something useful like eject the disk from it's memory, degrade the array, send out an e-mail to a designated sysadmin, and then keep on chugging along?
If this does not occur, then please file a bug against the appropriate
device driver (you're not operating in ZFS code here).
Post by Todd H. Poole
Or, for a greater level of control, why not just read from some configuration set by the sysadmin, and then decide to either do the above or shut down entirely, as per the wishes of the sysadmin? Anything would be better than just going into a catatonic state in less than five seconds.
qv. zpool failmode property, at least when you are operating in the
zfs code. I think the concerns here are that hangs can, and do, occur
at other places in the software stack. Please report these in the
appropriate forums and bug categories.
-- richard
Todd H. Poole
2008-08-26 19:30:03 UTC
Permalink
Post by Richard Elling
The behavior of ZFS to an error reported by an underlying device
driver is tunable by the zpool failmode property. By default, it is
set to "wait." For root pools, the installer may change this
to "continue." The key here is that you can argue with the choice
of default behavior, but don't argue with the option to change.
I didn't want to argue with the option to change... trust me. Being able to change those types of options and having that type of flexibility in the first place is what makes a very large part of my day possible.
Post by Richard Elling
qv. zpool failmode property, at least when you are operating in the
zfs code. I think the concerns here are that hangs can, and do, occur
at other places in the software stack. Please report these in the
appropriate forums and bug categories.
-- richard
Now _that's_ a great constructive suggestion! Very good - I'll research this in a few hours, and report back on what I find.

Thanks for the pointer!

-Todd


This message posted from opensolaris.org
Todd H. Poole
2008-08-27 15:01:10 UTC
Permalink
I plan on fiddling around with this failmode property in a few hours. I'll be using http://docs.sun.com/app/docs/doc/817-2271/gftgp?l=en&a=view as a reference.

I'll let you know what I find out.

-Todd


This message posted from opensolaris.org
Ian Collins
2008-08-25 08:17:55 UTC
Permalink
Post by John Sonnenschein
James isn't being a jerk because he hates your or anything...
Look, yanking the drives like that can seriously damage the drives or your motherboard. Solaris doesn't let you do it and assumes that something's gone seriously wrong if you try it. That Linux ignores the behavior and lets you do it sounds more like a bug in linux than anything else.
One point that's been overlooked in all the chest thumping - PCs vibrate
and cables fall out. I had this happen with an SCSI connector. Luckily
for me, it fell in a fan and made a lot of noise!

So pulling a drive is a possible, if rare, failure mode.

Ian
Jens Elkner
2008-08-25 18:25:15 UTC
Permalink
Post by Ian Collins
Post by John Sonnenschein
Look, yanking the drives like that can seriously damage the drives
or your motherboard. Solaris doesn't let you do it ...
Haven't seen an andruid/"universal soldier" shipping with Solaris ... ;-)
Post by Ian Collins
Post by John Sonnenschein
and assumes that something's gone seriously wrong if you try it. That Linux ignores the behavior and lets you do it sounds more like a bug in linux than anything else.
Not sure, whether everything, what can't be understood, is "likely a bug"
- maybe it is "more forgiving" and tries its best to solve the problem
without taking you out of business (see below), even if it requires some
hacks not in line with specifications ...
Post by Ian Collins
One point that's been overlooked in all the chest thumping - PCs vibrate
and cables fall out. I had this happen with an SCSI connector. Luckily
Yes - and a colleague told me, that he've had the same problem once.
Also he managed a SiemensFujitsu server, where the SCSI-controller card
had a tiny hairline crack: very odd behavior, usually not reproducible,
IIRC, the 4th ServiceEngineer finally replaced the card ...
Post by Ian Collins
So pulling a drive is a possible, if rare, failure mode.
Definitely!

And expecting strange controller (or in general hardware) behavior is
possibly a big + for an OS, which targets SMEs and "home users" as well
(everybody knows about far east and other cheap HW producers, which
sometimes seem to say, lets ship it, later we build a special driver for
MS Windows, which workarounds the bug/problem ...).

"Similar" story: ~ 2000+ we had a WG server with 4 IDE channels PATA,
one HDD on each. HDD0 on CH0 mirrored to HDD2 on CH2, HDD1 on CH1 mirrored
to HDD3 on CH3, using Linux Softraid driver. We found out, that when
HDD1 on CH1 got on the blink, for some reason the controller got on the
blink as well, i.e. took CH0 and vice versa down too. After reboot, we
were able to force the md raid to re-take the bad marked drives and even
found out, that the problem starts, when a certain part of a partition
was accessed (which made the ops on that raid really slow for some
minutes - but after the driver marked the drive(s) as bad, performance
was back). Thus disabling the partition gave us the time to get a new
drive... During all these ops nobody (except sysadmins) realized, that we
had a problem - thanx to the md raid1 (with xfs btw.). And also we did not
have any data corruption (at least, nobody has complained about it ;-)).

Wrt. what I've experienced and read in ZFS-discussion etc. list I've the
__feeling__, that we would have got really into trouble, using Solaris
(even the most recent one) on that system ...
So if one asks me, whether to run Solaris+ZFS on a production system, I
usually say: definitely, but only, if it is a Sun server ...

My 2¢ ;-)

Regards,
jel.

PS: And yes, all the vendor specific workarounds/hacks are for Linux
kernel folks a problem as well - at least on Torvalds side
discouraged IIRC ...
--
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 12768
Todd H. Poole
2008-08-30 05:32:16 UTC
Permalink
Post by Jens Elkner
Wrt. what I've experienced and read in ZFS-discussion etc. list I've the
__feeling__, that we would have got really into trouble, using Solaris
(even the most recent one) on that system ...
So if one asks me, whether to run Solaris+ZFS on a production system, I
usually say: definitely, but only, if it is a Sun server ...
My 2¢ ;-)
I can't agree with you more. I'm beginning to understand what the phrase "Sun's software is great - as long as you're running it on Sun's hardware" means...

Whether it's deserved or not, I feel like this OS isn't mature yet. And maybe it's not the whole OS, maybe it's some specific subsection (like ZFS), but my general impression of OpenSolaris has been... not stellar.

I don't think it's ready yet for a prime time slot on commodity hardware.

And while I don't intend to fan any flames that might already exist (remember, I've only just joined within the past week, and thus haven't been around long enough to figured out even if any flames exist), but I believe I'm justified in making the above statement. Just off the top of my head, here is a list of red flags I've run into in 7 day's time:

- If I don't wait for at least 2 minutes before logging into my system after I've powered everything up, my machine freezes.
- If I yank a hard drive out of a (supposedly redundant) RAID5 array (or "RAID-Z zpool," as its called) that has an NFS mount attached to it, not only does that mount point get severed, but _all_ NFS connections to all mount points are dropped, regardless of whether they were on the zpool or not. Oh, and then my machine freezes.
- If I just yank a hard drive out of a (supposedly redundant) RAID5 array (or "RAID-Z zpool," as its called), and just forgetting about NFS, my machine freezes.
- If I query a zpool for its status, but don't do so under the right circumstances, my machine freezes.

I've had to use the hard reset button on my case more times than I've had the ability to shut down the machine properly from a non-frozen console or GUI.

That shouldn't happen.

I dunno. If this sounds like bitching, that's fine: I'll file bug reports and then move on. It's just that sometimes, software needs to grow a bit more before it's ready for production, and I feel like trying to run OpenSolaris + ZFS on commodity hardware just might be one of those times.

Just two more cents to add to yours.

As Richard said, the only way to fix things is to file bug reports. Hopefully, the most helpful things to come out of this thread will be those forms of constructive criticism.

As for now, it looks like a return to LVM2, XFS, and one of the Linux or BSD kernels might be a more stable decision, but don't worry - I haven't been completely dissuaded, and I definitely plan on checking back in a few releases to see how things are going in the ZFS world. ;)

Thanks everyone for your help, and keep improving! :)

-Todd
--
This message posted from opensolaris.org
Toby Thain
2008-08-30 12:35:31 UTC
Permalink
Post by Todd H. Poole
Post by Jens Elkner
Wrt. what I've experienced and read in ZFS-discussion etc. list I've the
__feeling__, that we would have got really into trouble, using Solaris
(even the most recent one) on that system ...
So if one asks me, whether to run Solaris+ZFS on a production
system, I
usually say: definitely, but only, if it is a Sun server ...
My 2¢ ;-)
I can't agree with you more. I'm beginning to understand what the
phrase "Sun's software is great - as long as you're running it on
Sun's hardware" means...
...
Totally OT, but this is also why Apple doesn't sell OS X for whitebox
junk. :)

--Toby
Post by Todd H. Poole
-Todd
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
dick hoogendijk
2008-08-30 15:31:46 UTC
Permalink
On Sat, 30 Aug 2008 09:35:31 -0300
Post by Toby Thain
Post by Todd H. Poole
I can't agree with you more. I'm beginning to understand what the
phrase "Sun's software is great - as long as you're running it on
Sun's hardware" means...
Totally OT, but this is also why Apple doesn't sell OS X for
whitebox junk. :)
There's also a lot of whiteboxes that -do- run solaris very well.
"Some apples are rotten others are healthy". That quite normal.
--
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
++ http://nagual.nl/ + SunOS sxce snv95 ++
Joe S
2008-09-03 17:15:00 UTC
Permalink
Post by Todd H. Poole
I can't agree with you more. I'm beginning to understand what the phrase "Sun's software is great - as long as you're running it on Sun's hardware" means...
Whether it's deserved or not, I feel like this OS isn't mature yet. And maybe it's not the whole OS, maybe it's some specific subsection (like ZFS), but my general impression of OpenSolaris has been... not stellar.
I don't think it's ready yet for a prime time slot on commodity hardware.
I agree, but with careful research, you can find the *right* hardware.
In my quest (took weeks) to find reports of reliable hardware, I found
that the AMD chipsets were way too buggy. I also noticed that of the
workstations that Sun sells, they use nVidia nForce chipsets for AMD
CPU's and Intel x38 (only intel desktop chipset that supports ecc) for
the Intel CPUs. I read good and bad stories about various hardware and
decided I would stay close to what Sun sells. I've found NO Sun
hardware using the same chipset as yours.

There are a couple of AHCI bugs with the AMD/ATI SB600 chipset. Both
Linux and Solaris were affected. Linux put in a workaround that may
hurt performance slightly. Sun still has the bug open, but for what
it's worth, who's gonna use or care about a buggy desktop chipset in a
storage server?

I have an nVidia nForce 750a chipset (not the same as the sun
workstations, which use nforce pro, but its not too different) and the
same CPU (45 Watt dual core!) you have. My system works great (so
far). I haven't tried the disconnect drive issue thought. I will try
it tonight.

Carson Gaspar
2008-08-25 09:10:08 UTC
Permalink
Post by John Sonnenschein
Look, yanking the drives like that can seriously damage the drives or
your motherboard. Solaris doesn't let you do it and assumes that
something's gone seriously wrong if you try it. That Linux ignores
the behavior and lets you do it sounds more like a bug in linux than
anything else.
OK, so far we've had a lot of knee jerk defense of Solaris. Sorry, but
that isn't helping. Let's get back to science here, shall we?

What happens when you remove a disk?

A) The driver detects the removal and informs the OS. Solaris appears to
behave reasonaby well in this case.

B) The driver does not detect the removal. Commands must time out before
a problem is detected. Due to driver layering, timeouts increase
rapidly, causig te OS to "hang" for unreasonable periods of time.

We really need to fix (B). It seems the "easy" fixes are:

- Configure faster timeouts and fewer retries on redundant devices,
similar to drive manufacturers' RAID edition firmware. This could be via
driver config file, or (better) automatically via ZFS, similar to write
cache behaviour.

- Propagate timeouts quickly between layers (immediate soft fail without
retry) or perhaps just to the fault management system
--
Carson
Bob Friesenhahn
2008-08-25 15:57:57 UTC
Permalink
Post by Carson Gaspar
B) The driver does not detect the removal. Commands must time out before
a problem is detected. Due to driver layering, timeouts increase
rapidly, causig te OS to "hang" for unreasonable periods of time.
- Configure faster timeouts and fewer retries on redundant devices,
I don't think that any of these "easy" fixes are wise. Any fix based
on timeouts is going to cause problems with devices mysteriously
timing out and being resilvered.

Device drivers should know the expected behavior of the device and act
appropriately. For example, if the device is in a powered-down state,
then the device driver can expect that it will take at least 30
seconds for the device to return after being requested to power-up but
that some weak devices might take a minute. As far as device drivers
go, I expect that IDE device drivers are at the very bottom of the
feeding chain in Solaris since Solaris is optimized for enterprise
hardware.

Since OpenSolaris is open source, perhaps some brave soul can
investigate the issues with the IDE device driver and send a patch.

Bob
======================================
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Todd H. Poole
2008-08-26 19:50:07 UTC
Permalink
Post by Bob Friesenhahn
Since OpenSolaris is open source, perhaps some brave
soul can investigate the issues with the IDE device driver and
send a patch.
Fearing that other Senior Kernel Engineers, Solaris, might exhibit similar responses, or join in and play “antagonize the noob,” I decided that I would try to solve my problem on my own. I tried my best to unravel the source tree that is OpenSolaris with some help from a friend, but I'll be the first to admit - we didn't even know where to begin, much less understand what we were looking at.

To say that he and I were lost would be an understatement.

I’m familiar with some subsections of the Linux kernel, and I can read and write code in a pinch, but there's a reason why most of my work is done for small, personal projects, or just for fun... Some people out there can see things like Neo sees the Matrix… I am not one of them.

I wish I knew how to write and then submit those types of patches. If I did, you can bet I would have been all over that days ago! :)

-Todd


This message posted from opensolaris.org
Todd H. Poole
2008-08-26 20:14:46 UTC
Permalink
PS: I also think it's worthy to note the level of supportive and constructive feedback that many others have provided, and how much I appreciate it. Thanks! Keep it coming!


This message posted from opensolaris.org
MC
2008-08-27 06:00:43 UTC
Permalink
Post by John Sonnenschein
James isn't being a jerk because he hates your or
anything...
Look, yanking the drives like that can seriously
damage the drives or your motherboard. Solaris
doesn't let you do it and assumes that something's
gone seriously wrong if you try it. That Linux
ignores the behavior and lets you do it sounds more
like a bug in linux than anything else.
Solaris crashing is a linux bug. That's a new one folks.


This message posted from opensolaris.org
Miles Nordin
2008-08-27 17:48:33 UTC
Permalink
re> not all devices return error codes which indicate
re> unrecoverable reads.

What you mean is, ``devices sometimes return bad data instead of an
error code.''

If you really mean there are devices out there which never return
error codes, and always silently return bad data, please tell us which
one and the story of when you encountered it, because I'm incredulous.
I've never seen or heard of anything like that. Not even 5.25"
floppies do that.

Well...wait, actually I have. I heard some SGI disks had special
firmware which could be ordered to behave this way, and some kind of
ioctl or mount option to turn it on per-file or per-filesystem. But
the drives wouldn't disable error reporting unless ordered to.
Another interesting lesson SGI offers here: they pushed this feature
through their entire stack. The point was, for some video playback,
data which arrives after the playback point has passed is just as
useless as silently corrupt data, so the disk, driver, filesystem, all
need to modify their exception handling to deliver the largest amount
of on-time data possible, rather than the traditional goal of
eventually returning the largest amount of correct data possible and
clear errors instead of silent corruption. This whole-stack approach
is exactly what I thought ``green line'' was promising, and exactly
what's kept out of Solaris by the ``go blame the drivers'' mantra.

Maybe I was thinking of this SGI firmware when I suggested the
customized firmware netapp loads into the drives in their study could
silently return bad data more often than the firmware we're all using,
the standard firmware with 512-byte sectors intended for RAID layers
without block checksums.

re> I would love for you produce data to that effect.

Read the netapp paper you cited earlier

http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf

on page 234 there's a comparison of the relative prevalence of each
kind of error.

Latent sector errors / Unrecoverable reads

nearline disks experiencing latent read errors per year: 9.5%

Netapp calls the UNC errors, where the drive returns an error
instead of data, ``latent sector errors.'' Software RAID systems
other than ZFS *do* handle this error, usually better than ZFS to
my impression. And AIUI when it doesn't freeze and reboot, ZFS
counts this as a READ error. In addition to reporting it, most
consumer drives seem to log the last five of these non-volatilely,
and you can read the log with 'smartctl -a' (if you're using Linux
always, or under Solaris only if smartctl is working with your
particular disk driver).


Silent corruption

nearline disks experiencing silent corruption per year: 0.466%

What netapp calls ``silent data corruption'' is bad data silently
returned by drives with no error indication, counted by ZFS as
CKSUM and seems not to cause ZFS to freeze. I think you have been
lumping this in with unrecoverable reads, but using the word
``silent'' makes it clearer because unrecoverable makes it sound to
me like the drive tried to recover, and failed, in which case the
drive probably also reported the error making it a ``latent sector
error''.


filesystem corruption

This is also discovered silently w.r.t. the driver: the corruption
that happens to ZFS systems when SAN targets disappear suddenly or
when you offline a target and then reboot (which is also counted in
the CKSUM column, and which ZFS-level redundancy also helps fix).
I would call this ``ZFS bugs'', ``filesystem corruption,'' or
``manual resilvering''. Obviously it's not included on the Netapp
table. It would be nice if ZFS had two separate CKSUM columns to
distinguish between what netapp calls ``checksum errors'' vs
``identity discrepancies''. For ZFS the ``checksum error'' would
point with high certainty to the storage and silent corruption, and
the ``identity discrepancy'' would be more like filesystem
corruption and flag things like one side of a mirror being
out-of-date when ZFS thinks it shouldn't be. but currently we have
only one CKSUM column for both cases.


so, I would say, yes, the type of read error that other software RAID
systems besides ZFS do still handle is a lot more common: 9.5%/yr vs
0.466%/yr for nearline disks, and the same ~20x factor for enterprise
disks. The rare silent error which other software LVM's miss and only
ZFS/Netapp/EMC/... handles is still common enough to worry about, at
least on the nearline disks in the Netapp drive population.

What this also shows, though, is that about 1 in 10 drives will return
an UNC per year, and possibly cause ZFS to freeze up. It's worth
worrying about availability during an exception as common as that---it
might even be more important for some applications than catching the
silent corruption. not for my own application, but for some readily
imagineable ones, yes.
Richard Elling
2008-08-27 18:27:52 UTC
Permalink
Post by Miles Nordin
re> not all devices return error codes which indicate
re> unrecoverable reads.
What you mean is, ``devices sometimes return bad data instead of an
error code.''
If you really mean there are devices out there which never return
error codes, and always silently return bad data, please tell us which
one and the story of when you encountered it, because I'm incredulous.
I've never seen or heard of anything like that. Not even 5.25"
floppies do that.
I blogged about one such case.
http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file

However, I'm not inclined to publically chastise the vendor or device model.
It is a major vendor and a popular device. 'nuff said.
Post by Miles Nordin
Well...wait, actually I have. I heard some SGI disks had special
firmware which could be ordered to behave this way, and some kind of
ioctl or mount option to turn it on per-file or per-filesystem. But
the drives wouldn't disable error reporting unless ordered to.
Another interesting lesson SGI offers here: they pushed this feature
through their entire stack. The point was, for some video playback,
data which arrives after the playback point has passed is just as
useless as silently corrupt data, so the disk, driver, filesystem, all
need to modify their exception handling to deliver the largest amount
of on-time data possible, rather than the traditional goal of
eventually returning the largest amount of correct data possible and
clear errors instead of silent corruption. This whole-stack approach
is exactly what I thought ``green line'' was promising, and exactly
what's kept out of Solaris by the ``go blame the drivers'' mantra.
Maybe I was thinking of this SGI firmware when I suggested the
customized firmware netapp loads into the drives in their study could
silently return bad data more often than the firmware we're all using,
the standard firmware with 512-byte sectors intended for RAID layers
without block checksums.
re> I would love for you produce data to that effect.
Read the netapp paper you cited earlier
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
on page 234 there's a comparison of the relative prevalence of each
kind of error.
Latent sector errors / Unrecoverable reads
nearline disks experiencing latent read errors per year: 9.5%
This number should scare the *%^ out of you. It basically means
that no data redundancy is a recipe for disaster. Fortunately, with
ZFS you can have data redundancy without requiring a logical
volume manager to mirror your data. This is especially useful on
single-disk systems like laptops.
Post by Miles Nordin
Netapp calls the UNC errors, where the drive returns an error
instead of data, ``latent sector errors.'' Software RAID systems
other than ZFS *do* handle this error, usually better than ZFS to
my impression. And AIUI when it doesn't freeze and reboot, ZFS
counts this as a READ error. In addition to reporting it, most
consumer drives seem to log the last five of these non-volatilely,
and you can read the log with 'smartctl -a' (if you're using Linux
always, or under Solaris only if smartctl is working with your
particular disk driver).
Silent corruption
nearline disks experiencing silent corruption per year: 0.466%
What netapp calls ``silent data corruption'' is bad data silently
returned by drives with no error indication, counted by ZFS as
CKSUM and seems not to cause ZFS to freeze. I think you have been
lumping this in with unrecoverable reads, but using the word
``silent'' makes it clearer because unrecoverable makes it sound to
me like the drive tried to recover, and failed, in which case the
drive probably also reported the error making it a ``latent sector
error''.
Likewise, this number should scare you. AFAICT, logical volume
managers like SVM will not detect this.

Terminology wise, silent errors are, by-definition, not detected. But
in the literature you might see this in studies of failures where the
author intends to differentiate between one system which detects
such errors and one which does not.
Post by Miles Nordin
filesystem corruption
This is also discovered silently w.r.t. the driver: the corruption
that happens to ZFS systems when SAN targets disappear suddenly or
when you offline a target and then reboot (which is also counted in
the CKSUM column, and which ZFS-level redundancy also helps fix).
I would call this ``ZFS bugs'', ``filesystem corruption,'' or
``manual resilvering''. Obviously it's not included on the Netapp
table. It would be nice if ZFS had two separate CKSUM columns to
distinguish between what netapp calls ``checksum errors'' vs
``identity discrepancies''. For ZFS the ``checksum error'' would
point with high certainty to the storage and silent corruption, and
the ``identity discrepancy'' would be more like filesystem
corruption and flag things like one side of a mirror being
out-of-date when ZFS thinks it shouldn't be. but currently we have
only one CKSUM column for both cases.
This differentiation is noted in the FMA e-reports.
Post by Miles Nordin
so, I would say, yes, the type of read error that other software RAID
systems besides ZFS do still handle is a lot more common: 9.5%/yr vs
0.466%/yr for nearline disks, and the same ~20x factor for enterprise
disks. The rare silent error which other software LVM's miss and only
ZFS/Netapp/EMC/... handles is still common enough to worry about, at
least on the nearline disks in the Netapp drive population.
0.466%/yr is a per-disk rate. If you have 10 disks, your exposure
is 4.6% per year. For 100 disks, 46% per year, etc. For systems with
thousands of disks this is a big problem.

But I don't think using a rate-per-unit-time is the best way to look
at this problem because if you never read the data, you don't care.
This is why disk vendors spec UERs as rate-per-bits-read. I have
some field data on bits read over time, but routine activities, like
backups, zfs sends, or scrubs, can change the number of bits read
per unit time by a significant amount.
Post by Miles Nordin
What this also shows, though, is that about 1 in 10 drives will return
an UNC per year, and possibly cause ZFS to freeze up. It's worth
worrying about availability during an exception as common as that---it
might even be more important for some applications than catching the
silent corruption. not for my own application, but for some readily
imagineable ones, yes.
UNCs don't cause ZFS to freeze as long as failmode != wait or
ZFS manages the data redundancy.
-- richard
Miles Nordin
2008-08-27 21:51:49 UTC
Permalink
Post by Miles Nordin
If you really mean there are devices out there which never
return error codes, and always silently return bad data, please
tell us which one and the story of when you encountered it,
re> I blogged about one such case.
re> http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file

re> However, I'm not inclined to publically chastise the vendor or
re> device model. It is a major vendor and a popular
re> device. 'nuff said.

It's not really enough for me, but what's more the case doesn't match
what we were looking for: a device which ``never returns error codes,
always returns silently bad data.'' I asked for this because you said
``However, not all devices return error codes which indicate
unrecoverable reads,'' which I think is wrong. Rather, most devices
sometimes don't, not some devices always don't.

Your experience doesn't say anything about this drive's inability to
return UNC errors. It says you suspect it of silently returning bad
data, once, but your experience doesn't even clearly implicate the
device once: It could have been cabling/driver/power-supply/zfs-bugs
when the block was written. I was hoping for a device in your ``bad
stack'' which does it over and over.

Remember, I'm not arguing ZFS checksums are worthless---I think
they're great. I'm arguing with your original statement that ZFS is
the only software RAID which deals with the dominant error you find in
your testing, unrecoverable reads. This is untrue!

re> This number should scare the *%^ out of you. It basically
re> means that no data redundancy is a recipe for disaster.

yeah, but that 9.5% number alone isn't an argument for ZFS over other
software LVM's.

re> 0.466%/yr is a per-disk rate. If you have 10 disks, your
re> exposure is 4.6% per year. For 100 disks, 46% per year, etc.

no, you're doing the statistics wrong, and in a really elementary way.
You're counting multiple times the possible years in which more than
one disk out of the hundred failed. If what you care about for 100
disks is that no disk experiences an error within one year, then you
need to calculate

(1 - 0.00466) ^ 100 = 62.7%

so that's 37% probability of silent corruption. For 10 disks, the
mistake doesn't make much difference and 4.6% is about right.

I don't dispute ZFS checksums have value, but the point stands that
the reported-error failure mode is 20x more common in netapp's study
than this one, and other software LVM's do take care of the more
common failure mode.

re> UNCs don't cause ZFS to freeze as long as failmode != wait or
re> ZFS manages the data redundancy.

The time between issuing the read and getting the UNC back can be up
to 30 seconds, and there are often several unrecoverable sectors in a
row as well as lower-level retries multiplying this 30-second value.
so, it ends up being a freeze.

To fix it, ZFS needs to dispatch read requests for redundant data if
the driver doesn't reply quickly. ``Quickly'' can be ambiguous, but
the whole point of FMD was supposed to be that complicated statistics
could be collected at various levels to identify even more subtle
things than READ and CKSUM errors, like drives that are working at
1/10th the speed they should be, yet right now we can't even flag a
drive taking 30 seconds to read a sector. ZFS is still ``patiently
waiting'', and now that FMD is supposedly integrated instead of a
discussion of what knobs and responses there are, you're passing the
buck to the drivers and their haphazard nonuniform exception state
machines. The best answer isn't changing drivers to make the drive
timeout in 15 seconds instead---it's to send the read to other disks
quickly using a very simple state machine, and start actually using
FMD and a complicated state machine to generate suspicion-events for
slow disks that aren't returning errors.

Also the driver and mid-layer need to work with the hypothetical
ZFS-layer timeouts to be as good as possible about not stalling the
SATA chip, the channel if there's a port multiplier, or freezing the
whole SATA stack including other chips, just because one disk has an
outstanding READ command waiting to get an UNC back.

In some sense the disk drivers and ZFS have different goals. The goal
of drivers should be to keep marginal disk/cabling/... subsystems
online as aggressively as possible, while the goal of ZFS should be to
notice and work around slightly-failing devices as soon as possible.
I thought the point of putting off reasonable exception handling for
two years while waiting for FMD, was to be able to pursue both goals
simultaneously without pressure to compromise one in favor of the
other.

In addition, I'm repeating myself like crazy at this point, but ZFS
tools used for all pools like 'zpool status' need to not freeze when a
single pool, or single device within a pool, is unavailable or slow,
and this expectation is having nothing to do with failmode on the
failing pool. And NFS running above ZFS should continue serving
filesystems from available pools even if some pools are faulted, again
nothing to do with failmode.

Neither is the case now, and it's not a driver fix, but even beyond
fixing these basic problems there's vast room for improvement, to
deliver something better than LVM2 and closer to NetApp, rather than
just catching up.
Ian Collins
2008-08-27 22:21:30 UTC
Permalink
Post by Miles Nordin
In addition, I'm repeating myself like crazy at this point, but ZFS
tools used for all pools like 'zpool status' need to not freeze when a
single pool, or single device within a pool, is unavailable or slow,
and this expectation is having nothing to do with failmode on the
failing pool. And NFS running above ZFS should continue serving
filesystems from available pools even if some pools are faulted, again
nothing to do with failmode.
I agree with the bulk of this post, but I'd like to add to this last point.
I've had a few problems with ZFS tools hanging on recent builds due to
problems with a pool on a USB stick. One tiny $20 component causing a fault
that required a reboot of the host. This really shouldn't happen.

Ian
Toby Thain
2008-08-27 22:39:20 UTC
Permalink
Post by Ian Collins
Post by Miles Nordin
In addition, I'm repeating myself like crazy at this point, but ZFS
tools used for all pools like 'zpool status' need to not freeze when a
single pool, or single device within a pool, is unavailable or slow,
and this expectation is having nothing to do with failmode on the
failing pool. And NFS running above ZFS should continue serving
filesystems from available pools even if some pools are faulted, again
nothing to do with failmode.
I agree with the bulk of this post, but I'd like to add to this last point.
I've had a few problems with ZFS tools hanging on recent builds due to
problems with a pool on a USB stick. One tiny $20 component
causing a fault
that required a reboot of the host. This really shouldn't happen.
Let's not be too quick to assign blame, or to think that perfecting
the behaviour is straightforward or even possible.

Traditionally, systems bearing 'enterprisey' expectations were/are
integrated hardware and software from one vendor (e.g. Sun) which
could be certified as a unit.

Start introducing 'random $20 components' and you begin to dilute the
quality and predictability of the composite system's behaviour.

If hard drive firmware is as cr*ppy as anecdotes indicate, what can
we really expect from a $20 USB pendrive?

--Toby
Post by Ian Collins
Ian
_______________________________________________
zfs-discuss mailing list
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Tim
2008-08-27 22:43:12 UTC
Permalink
Post by Toby Thain
Let's not be too quick to assign blame, or to think that perfecting
the behaviour is straightforward or even possible.
Traditionally, systems bearing 'enterprisey' expectations were/are
integrated hardware and software from one vendor (e.g. Sun) which
could be certified as a unit.
PSSSHHH, Sun should be certifying every piece of hardware that is, or will
ever be released. Community putback shmamunnity putback.
Post by Toby Thain
Start introducing 'random $20 components' and you begin to dilute the
quality and predictability of the composite system's behaviour.
But this NEVER happens on linux *grin*.
Post by Toby Thain
If hard drive firmware is as cr*ppy as anecdotes indicate, what can
we really expect from a $20 USB pendrive?
--Toby
Perfection?

--Tim
Todd H. Poole
2008-08-30 05:05:15 UTC
Permalink
Post by Toby Thain
Let's not be too quick to assign blame, or to think that perfecting
the behaviour is straightforward or even possible.
Start introducing random $20 components and you begin to dilute the
quality and predictability of the composite system's behaviour.
But this NEVER happens on linux *grin*.
Actually, it really doesn't! At least, it hasn't in many years...

I can't tell if you were being sarcastic or not, but honestly... you find a USB drive that can bring down your Linux machine, and I'll show you someone running a kernel from November of 2003. And for all the other "cheaper" components out there? Those are the components we make serious bucks off of. Just because it costs $30 doesn't mean it won't last a _really_ long time under stress! But if it doesn't, even when hardware fails, software's always there to route around it. So no biggie.
Post by Toby Thain
Perfection?
Is Linux perfect?
Not even close. But certainly a lot closer at what the topic of this thread seems to cover: not crashing.

Linux may get a small number of things wrong, but it gets a ridiculously large number of them right, and stability/reliability on unstable/unreliable hardware is one of them. ;)

PS: I found this guy's experiment amusing. Talk about adding a bunch of cheap, $20 crappy components to a system, and still seeing it soar. http://linuxgazette.net/151/weiner.html
--
This message posted from opensolaris.org
Ian Collins
2008-08-27 23:04:12 UTC
Permalink
Post by Miles Nordin
In addition, I'm repeating myself like crazy at this point, but ZFS
tools used for all pools like 'zpool status' need to not freeze when a
single pool, or single device within a pool, is unavailable or slow,
and this expectation is having nothing to do with failmode on the
failing pool. And NFS running above ZFS should continue serving
filesystems from available pools even if some pools are faulted, again
nothing to do with failmode.
I agree with the bulk of this post, but I'd like to add to this last
point.
I've had a few problems with ZFS tools hanging on recent builds due to
problems with a pool on a USB stick. One tiny $20 component causing a
fault
that required a reboot of the host. This really shouldn't happen.
Let's not be too quick to assign blame, or to think that perfecting the
behaviour is straightforward or even possible.
I'm not assigning blame, just illustrating a problem.

If you look back a week or so you will see a thread I started with the
subject " ZFS commands hanging in B95". This thread went off list but the
cause was tracked back to a problem with a USB pool.
Traditionally, systems bearing 'enterprisey' expectations were/are
integrated hardware and software from one vendor (e.g. Sun) which could
be certified as a unit.
Start introducing 'random $20 components' and you begin to dilute the
quality and predictability of the composite system's behaviour.
So we shouldn't be using USB sticks to transfer data between home and office
systems? If the stick was a FAT device and it crapped out or was removed
without unmounting, the system would not have hung.
If hard drive firmware is as cr*ppy as anecdotes indicate, what can we
really expect from a $20 USB pendrive?
All the more reason not to lock up if one craps out.

Ian
Bob Friesenhahn
2008-08-27 22:42:59 UTC
Permalink
Post by Miles Nordin
In some sense the disk drivers and ZFS have different goals. The goal
of drivers should be to keep marginal disk/cabling/... subsystems
online as aggressively as possible, while the goal of ZFS should be to
notice and work around slightly-failing devices as soon as possible.
My buffer did overflow from this email, but I still noticed the stated
goal of ZFS, which might differ from the objectives the ZFS authors
have been working toward these past seven years. Could you please
define "slightly-failing device" as well as how ZFS can know when the
device is slightly-failing so it can start to work around it?

Thanks,

Bob
======================================
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2008-08-27 23:24:59 UTC
Permalink
Post by Miles Nordin
Post by Miles Nordin
If you really mean there are devices out there which never
return error codes, and always silently return bad data, please
tell us which one and the story of when you encountered it,
re> I blogged about one such case.
re> http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file
re> However, I'm not inclined to publically chastise the vendor or
re> device model. It is a major vendor and a popular
re> device. 'nuff said.
It's not really enough for me, but what's more the case doesn't match
what we were looking for: a device which ``never returns error codes,
always returns silently bad data.'' I asked for this because you said
``However, not all devices return error codes which indicate
unrecoverable reads,'' which I think is wrong. Rather, most devices
sometimes don't, not some devices always don't.
I really don't know how to please you. I've got a bunch of
borken devices of all sorts. If you'd like to stop by some time
and rummage in the boneyard, feel free. Make it quick before
my wife makes me clean up :-) For the device which
I mentioned in my blog, it does return bad data far more often
than I'd like. But that is why I only use it for testing and don't
store my wife's photo album on it. Anyone who has been
around for a while will have similar anecdotes.
Post by Miles Nordin
Your experience doesn't say anything about this drive's inability to
return UNC errors. It says you suspect it of silently returning bad
data, once, but your experience doesn't even clearly implicate the
device once: It could have been cabling/driver/power-supply/zfs-bugs
when the block was written. I was hoping for a device in your ``bad
stack'' which does it over and over.
Remember, I'm not arguing ZFS checksums are worthless---I think
they're great. I'm arguing with your original statement that ZFS is
the only software RAID which deals with the dominant error you find in
your testing, unrecoverable reads. This is untrue!
To be clear. I claim:
1. The dominant failure mode in my field data for magnetic disks is
unrecoverable reads. You need some sort of data protection to get
past this problem.
2. Unrecoverable reads are not always reported by disk drives.
3. You really want a system that performs end-to-end data verification,
and if you don't bother to code that into your applications, then you
might rely on ZFS to do it for you. If you ignore this problem, it will
not go away.
Post by Miles Nordin
re> This number should scare the *%^ out of you. It basically
re> means that no data redundancy is a recipe for disaster.
yeah, but that 9.5% number alone isn't an argument for ZFS over other
software LVM's.
re> 0.466%/yr is a per-disk rate. If you have 10 disks, your
re> exposure is 4.6% per year. For 100 disks, 46% per year, etc.
no, you're doing the statistics wrong, and in a really elementary way.
You're counting multiple times the possible years in which more than
one disk out of the hundred failed. If what you care about for 100
disks is that no disk experiences an error within one year, then you
need to calculate
(1 - 0.00466) ^ 100 = 62.7%
so that's 37% probability of silent corruption. For 10 disks, the
mistake doesn't make much difference and 4.6% is about right.
Indeed. Intuitively, the AFR and population is more easily grokked by
the masses. But if you go into a customer and say "dude, there is only a
62.7% chance that your system won't be affected by a silent data corruption
problem this year with my (insert favorite non-ZFS, non-NetApp solution
here)" then you will have a difficult sale.
Post by Miles Nordin
I don't dispute ZFS checksums have value, but the point stands that
the reported-error failure mode is 20x more common in netapp's study
than this one, and other software LVM's do take care of the more
common failure mode.
I agree.
Post by Miles Nordin
re> UNCs don't cause ZFS to freeze as long as failmode != wait or
re> ZFS manages the data redundancy.
The time between issuing the read and getting the UNC back can be up
to 30 seconds, and there are often several unrecoverable sectors in a
row as well as lower-level retries multiplying this 30-second value.
so, it ends up being a freeze.
Untrue. There are disks which will retry forever. But don't take
my word for it, believe another RAID software vendor:
http://blogs.sun.com/relling/entry/adaptec_webinar_on_disks_and
[sorry about the redirect, you have to sign up for an Adaptec
webinar before you can get to the list of webinars, so it is hard
to provide the direct URL]

Incidentally, I have one such disk in my boneyard, but it isn't
much fun to work with because it just sits there and spins when
you try to access the bad sector.
Post by Miles Nordin
To fix it, ZFS needs to dispatch read requests for redundant data if
the driver doesn't reply quickly. ``Quickly'' can be ambiguous, but
the whole point of FMD was supposed to be that complicated statistics
could be collected at various levels to identify even more subtle
things than READ and CKSUM errors, like drives that are working at
1/10th the speed they should be, yet right now we can't even flag a
drive taking 30 seconds to read a sector. ZFS is still ``patiently
waiting'', and now that FMD is supposedly integrated instead of a
discussion of what knobs and responses there are, you're passing the
buck to the drivers and their haphazard nonuniform exception state
machines. The best answer isn't changing drivers to make the drive
timeout in 15 seconds instead---it's to send the read to other disks
quickly using a very simple state machine, and start actually using
FMD and a complicated state machine to generate suspicion-events for
slow disks that aren't returning errors.
I think the proposed timeouts here are too short, but the idea has
merit. Note that such a preemptive read will have negative performance
impacts for high-workload systems, so it will not be a given that people
will want this enabled by default. Designing such a proactive system
which remains stable under high workloads may not be trivial.
Please file an RFE at http://bugs.opensolaris.org
Post by Miles Nordin
Also the driver and mid-layer need to work with the hypothetical
ZFS-layer timeouts to be as good as possible about not stalling the
SATA chip, the channel if there's a port multiplier, or freezing the
whole SATA stack including other chips, just because one disk has an
outstanding READ command waiting to get an UNC back.
In some sense the disk drivers and ZFS have different goals. The goal
of drivers should be to keep marginal disk/cabling/... subsystems
online as aggressively as possible, while the goal of ZFS should be to
notice and work around slightly-failing devices as soon as possible.
I thought the point of putting off reasonable exception handling for
two years while waiting for FMD, was to be able to pursue both goals
simultaneously without pressure to compromise one in favor of the
other.
In addition, I'm repeating myself like crazy at this point, but ZFS
tools used for all pools like 'zpool status' need to not freeze when a
single pool, or single device within a pool, is unavailable or slow,
and this expectation is having nothing to do with failmode on the
failing pool. And NFS running above ZFS should continue serving
filesystems from available pools even if some pools are faulted, again
nothing to do with failmode.
You mean something like:
http://bugs.opensolaris.org/view_bug.do?bug_id=6667208
http://bugs.opensolaris.org/view_bug.do?bug_id=6667199

Yes, we all wish these to be fixed soon.
Post by Miles Nordin
Neither is the case now, and it's not a driver fix, but even beyond
fixing these basic problems there's vast room for improvement, to
deliver something better than LVM2 and closer to NetApp, rather than
just catching up.
If you find more issues, then please file bugs. http://bugs.opensolaris.org
-- richard
Ian Collins
2008-08-27 23:41:43 UTC
Permalink
Post by Richard Elling
I think the proposed timeouts here are too short, but the idea has
merit. Note that such a preemptive read will have negative performance
impacts for high-workload systems, so it will not be a given that people
will want this enabled by default. Designing such a proactive system
which remains stable under high workloads may not be trivial.
Isn't this how things already work with mirrors? By this I mean requests
are issued to all devices and if the first returned data is OK, the others
are not required.

Ian
Richard Elling
2008-08-27 23:59:29 UTC
Permalink
Post by Ian Collins
Post by Richard Elling
I think the proposed timeouts here are too short, but the idea has
merit. Note that such a preemptive read will have negative performance
impacts for high-workload systems, so it will not be a given that people
will want this enabled by default. Designing such a proactive system
which remains stable under high workloads may not be trivial.
Isn't this how things already work with mirrors? By this I mean
requests are issued to all devices and if the first returned data is
OK, the others are not required.
No. Yes. Sometimes. The details on choice of read targets varies by
implementation. I've seen some telco systems which work this way,
but most of the general purpose systems will choose one target for
the read based on some policy: round-robin, location, etc. This way
you could get the read performance of all disks operating concurrently.
-- richard
Ian Collins
2008-08-28 00:38:04 UTC
Permalink
Post by Richard Elling
Post by Ian Collins
Post by Richard Elling
I think the proposed timeouts here are too short, but the idea has
merit. Note that such a preemptive read will have negative performance
impacts for high-workload systems, so it will not be a given that people
will want this enabled by default. Designing such a proactive system
which remains stable under high workloads may not be trivial.
Isn't this how things already work with mirrors? By this I mean requests
are issued to all devices and if the first returned data is OK, the
others are not required.
No. Yes. Sometimes. The details on choice of read targets varies by
implementation. I've seen some telco systems which work this way,
but most of the general purpose systems will choose one target for
the read based on some policy: round-robin, location, etc. This way
you could get the read performance of all disks operating concurrently.
Would it be possible to get ZFS to work the way I described? I was looking
at using an exported iSCSI target from a machine in another building to
mirror a fileserver with a mainly (>95%) read workload. A first back read
implementation would be a good fit for that situation.

Ian
Anton B. Rang
2008-08-28 20:35:43 UTC
Permalink
Many mid-range/high-end RAID controllers work by having a small timeout on individual disk I/O operations. If the disk doesn't respond quickly, they'll issue an I/O to the redundant disk(s) to get the data back to the host in a reasonable time. Often they'll change parameters on the disk to limit how long the disk retries before returning an error for a bad sector (this is standardized for SCSI, I don't recall offhand whether any of this is standardized for ATA).

RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when enough (N-1 or N-2) disks return data, they'll return the data to the host. At least they do that for full stripes. But this strategy works better for sequential I/O, not so good for random I/O, since you're using up extra bandwidth.

Host-based RAID/mirroring almost never takes this strategy for two reasons. First, the bottleneck is almost always the channel from disk to host, and you don't want to clog it. [Yes, I know there's more bandwidth there than the sum of the disks, but consider latency.] Second, to read from two disks on a mirror, you'd need two memory buffers.
--
This message posted from opensolaris.org
Anton B. Rang
2008-08-28 20:35:43 UTC
Permalink
Many mid-range/high-end RAID controllers work by having a small timeout on individual disk I/O operations. If the disk doesn't respond quickly, they'll issue an I/O to the redundant disk(s) to get the data back to the host in a reasonable time. Often they'll change parameters on the disk to limit how long the disk retries before returning an error for a bad sector (this is standardized for SCSI, I don't recall offhand whether any of this is standardized for ATA).

RAID 3 units, e.g. DataDirect, issue I/O to all disks simultaneously and when enough (N-1 or N-2) disks return data, they'll return the data to the host. At least they do that for full stripes. But this strategy works better for sequential I/O, not so good for random I/O, since you're using up extra bandwidth.

Host-based RAID/mirroring almost never takes this strategy for two reasons. First, the bottleneck is almost always the channel from disk to host, and you don't want to clog it. [Yes, I know there's more bandwidth there than the sum of the disks, but consider latency.] Second, to read from two disks on a mirror, you'd need two memory buffers.
--
This message posted from opensolaris.org
Miles Nordin
2008-08-28 01:27:27 UTC
Permalink
re> I really don't know how to please you.

dd from the raw device instead of through ZFS would be better. If you
could show that you can write data to a sector, and read back
different data, without getting an error, over and over, I'd be
totally stunned.

The netapp paper was different from your test in many ways that make
their claim that ``all drives silently corrupt data sometimes'' more
convincing than your claim that you have ``one drive which silently
corrupts data always and never returns UNC'':

* not a desktop. The circumstances were more tightly-controlled,
and their drive population installed in a repeated way

* their checksum measurement was better than ZFS's by breaking the
type of error up into three buckets instead of one, and their
filesystem more mature, and their filesystem is not already known
to count CKSUM errors for circumstances other than silent
corruption, which argues the checksums are less likely to come
from software bugs

* they make statistical arguments that at least some of the errors
are really coming from the drives by showing they have spatial
locality w.r.t. the LBA on the drive, and are correlated with
drive age and impending drive failure.

The paper was less convincing in one way:

* their drives are using nonstandard firmware

re> Anyone who has been around for a while will have similar
re> anecdotes.

yeah, you'd think, but my similar anecdote is that (a) I can get UNC's
repeatably on a specific bad sector that persist either forever or
until I write new data to that sector with dd, and do get them on at
least 10% of my drives per year, and (b) I get CKSUM errors from ZFS
all the time with my iSCSI ghetto-SAN and with an IDE/Firewire mirror,
often from things I can specifically trace back to
not-a-drive-failure, but so far never from something I can for certain
trace back to silent corruption by the disk drive.

I don't doubt that it happens, but CKSUM isn't a way to spot it. ZFS
may give me a way to stop it, but it doesn't give me an accurate way
to measure/notice it.

re> Indeed. Intuitively, the AFR and population is more easily
re> grokked by the masses.

It's nothing to do with masses. There's an error in your math. It's
not right under any circumstance.

Your point that a 100 drive population has bad/high odds of having
silent corruption within a year isn't diminished by the correction,
but it would be nice if you would own up to the statistics mistake
since we're taking you at your word on a lot of other statistics.
Post by Miles Nordin
so, it ends up being a freeze.
re> Untrue. There are disks which will retry forever.

I don't understand. ZFS freezes until the disk stops retrying and
returns an error. Because some disks never stop retrying and never
return an error, just lock up until they're power-cycled, it's untrue
that ZFS freezes? I think either you or I have lost the thread of the
argument in our reply chain bantering.

re> please file bugs.

k., I filed the NFS bug, but unfortunately I don't have output to cut
and paste into it. glad to see the 'zpool status' bug is there
already and includes the point that lots of other things are probably
hanging which shouldn't.
Richard Elling
2008-08-28 13:49:28 UTC
Permalink
Post by Miles Nordin
re> Indeed. Intuitively, the AFR and population is more easily
re> grokked by the masses.
It's nothing to do with masses. There's an error in your math. It's
not right under any circumstance.
There is no error in my math. I presented a failure rate for a time
interval,
you presented a probability of failure over a time interval. The two are
both correct, but say different things. Mathematically, an AFR > 100%
is quite possible and quite common. A probability of failure > 100% (1.0)
is not. In my experience, failure rates described as annualized failure
rates (AFR) are more intuitive than their mathematically equivalent
counterpart: MTBF.
-- richard
Miles Nordin
2008-08-28 16:54:26 UTC
Permalink
re> There is no error in my math. I presented a failure rate for
re> a time interval,

What is a ``failure rate for a time interval''?

AIUI, the failure rate for a time interval is 0.46% / yr, no matter how
many drives you have.
Jonathan Loran
2008-08-28 18:13:15 UTC
Permalink
Post by Miles Nordin
What is a ``failure rate for a time interval''?
Failure rate => Failures/unit time
Failure rate for a time interval => (Failures/unit time) * time

For example, if we have a failure rate:

Fr = 46% failures/month

Then the expectation value of a failure in one year:

Fe = 46% failures/month * 12 months = 5.52 failures


Jon
--
- _____/ _____/ / - Jonathan Loran - -
- / / / IT Manager -
- _____ / _____ / / Space Sciences Laboratory, UC Berkeley
- / / / (510) 643-5146 ***@ssl.berkeley.edu
- ______/ ______/ ______/ AST:7731^29u18e3
Miles Nordin
2008-08-28 18:42:59 UTC
Permalink
jl> Fe = 46% failures/month * 12 months = 5.52 failures

the original statistic wasn't of this kind. It was ``likelihood a
single drive will experience one or more failures within 12 months''.

so, you could say, ``If I have a thousand drives, about 4.66 of those
drives will silently-corrupt at least once within 12 months.'' It is
0.466% no matter how many drives you have.

And it's 4.66 drives, not 4.66 corruptions. The estimated number of
corruptions is higher because some drives will corrupt twice, or
thousands of times. It's not a BER, so you can't just add it like
Richard did.

If the original statistic in the paper were of the kind you're talking
about, it would be larger than 0.466%. I'm not sure it would capture
the situation well, though. I think you'd want to talk about bits of
recoverable data after one year, not corruption ``events'', and this
is not really measured well by the type of telemetry NetApp has. If
it were, though, it would still be the same size number no matter how
many drives you had.

The 37% I gave was ``one or more within a population of 100 drives
silently corrupts within 12 months.'' The 46% Richard gave has no
meaning, and doesn't mean what you just said. The only statistic
under discussion which (a) gets intimidatingly large as you increase
the number of drives, and (b) is a ratio rather than, say, an absolute
number of bits, is the one I gave.
Robert Milkowski
2008-08-28 13:55:10 UTC
Permalink
Hello Miles,

Wednesday, August 27, 2008, 10:51:49 PM, you wrote:

MN> It's not really enough for me, but what's more the case doesn't match
MN> what we were looking for: a device which ``never returns error codes,
MN> always returns silently bad data.'' I asked for this because you said
MN> ``However, not all devices return error codes which indicate
MN> unrecoverable reads,'' which I think is wrong. Rather, most devices
MN> sometimes don't, not some devices always don't.



Please look for slides 23-27 at http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf
--
Best regards,
Robert Milkowski mailto:***@task.gda.pl
http://milek.blogspot.com
Richard Elling
2008-08-28 15:04:37 UTC
Permalink
Post by Robert Milkowski
Hello Miles,
MN> It's not really enough for me, but what's more the case doesn't match
MN> what we were looking for: a device which ``never returns error codes,
MN> always returns silently bad data.'' I asked for this because you said
MN> ``However, not all devices return error codes which indicate
MN> unrecoverable reads,'' which I think is wrong. Rather, most devices
MN> sometimes don't, not some devices always don't.
Please look for slides 23-27 at http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf
You really don't have to look very far to find this sort of thing.
The scar just below my left knee is directly attributed to a bugid
fixed in patch 106129-12. Warning: the following link may
frighten experienced datacenter personnel, fortunately, the affected
device is long since EOL.
http://sunsolve.sun.com/search/document.do?assetkey=1-21-106129-12-1
-- richard
Miles Nordin
2008-08-28 16:55:45 UTC
Permalink
rm> Please look for slides 23-27 at
rm> http://unixdays.pl/i/unixdays-prezentacje/2007/milek.pdf

yeah, ok, ONCE AGAIN, I never said that checksums are worthless.

relling: some drives don't return errors on unrecoverable read events.
carton: I doubt that. Tell me a story about one that doesn't.

Your stories are about storage subsystems again, not drives. Also
most or all of the slides aren't about unrecoverable read events.
Justin
2008-08-25 03:41:26 UTC
Permalink
aye mate, I had the exact same problem, but where i work, we pay some pretty seriosu dollars for a direct 24/7 line to some of sun's engineers, so i decided to call them up. after spending some time with tech support, i never really got the thing resolved, and i instead ended up going back to debian for all of our simple ide-based file servers.

if you really just want zfs, you can add it to whatever installation you've got now (opensuse?) through something like zfs-fuse, but you might take a 10-15% performance hit. if you don't want that, and you're not too concerned with violating a few licenses, you can just add it to your installation yourself, the source code is out there. you know, roll your own. ;-)

you just might be trying too hard to force a round peg into a square hole.

hey, besides, where you work? i registered because i know a guy with the same name


This message posted from opensolaris.org
Todd H. Poole
2008-08-25 08:25:58 UTC
Permalink
jalex? As in Justin Alex?

If you're who I think you are, don't you have a pretty long list of things you need to get done for Jerry before your little vacation?


This message posted from opensolaris.org
Justin
2008-08-25 08:58:54 UTC
Permalink
alrigt, alright, but your fault. you left your workstation logged on, what was i supposed to do? not chime in?

grotty yank


This message posted from opensolaris.org
Bob Friesenhahn
2008-08-25 15:37:36 UTC
Permalink
Post by Todd H. Poole
So aside from telling me to "[never] try this sort of thing with
IDE" does anyone else have any other ideas on how to prevent
OpenSolaris from locking up whenever an IDE drive is abruptly
disconnected from a ZFS RAID-Z array?
I think that your expectations from ZFS are reasonable. However, it
is useful to determine if pulling the IDE drive locks the entire IDE
channel, which serves the other disks as well. This could happen at a
hardware level, or at a device driver level. If this happens, then
there is nothing that ZFS can do.

Bob
======================================
Bob Friesenhahn
***@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Todd H. Poole
2008-08-26 21:09:12 UTC
Permalink
Post by Bob Friesenhahn
I think that your expectations from ZFS are
reasonable. However, it is useful to determine if pulling the IDE drive locks
the entire IDE channel, which serves the other disks as well. This
could happen at a hardware level, or at a device driver level. If this
happens, then there is nothing that ZFS can do.
Gotcha. But just to let you know, there are 4 SATA ports on the motherboard, with each drive getting its own port... how should I go about testing to see whether pulling one IDE drive (remember, they're really SATA drives, but they're being presented to the OS by the pci-ide driver) locks the entire IDE channel if there's only one drive per channel? Or do you think it's possible that two ports on the motherboard could be on one "logical channel" (for lack of a better phrase) while the other two are on the other, and thus we could test one drive while another on the same "logical channel" is unplugged?

Also, remember that OpenSolaris freezes when this occurs, so I'm only going to have 2-3 seconds to execute a command before Terminal and - after a few more seconds, the rest of the machine - stop responding to input...

I'm all for trying to test this, but I might need some instruction.


This message posted from opensolaris.org
MC
2008-08-27 06:18:51 UTC
Permalink
Okay, so your ACHI hardware is not using an ACHI driver in solaris. A crash when pulling a cable is still not great, but it is understandable because that driver is old and bad and doesn't support hot swapping at all.

So there are two things to do here. File a bug about how pulling a sata cable crashes solaris when the device is using the old ide driver. And file another bug about how solaris recognizes your ACHI SATA hardware as old ide hardware.

The two bonus things to do are: come to the forum and bitch about the bugs to give them some attention, and come to the forum asking for help on making solaris recognize your ACHI SATA hardware properly :)

Good luck...
Post by Todd H. Poole
Gotcha. But just to let you know, there are 4 SATA
ports on the motherboard, with each drive getting its
own port... how should I go about testing to see
whether pulling one IDE drive (remember, they're
really SATA drives, but they're being presented to
the OS by the pci-ide driver) locks the entire IDE
channel if there's only one drive per channel? Or do
you think it's possible that two ports on the
motherboard could be on one "logical channel" (for
lack of a better phrase) while the other two are on
the other, and thus we could test one drive while
another on the same "logical channel" is unplugged?
Also, remember that OpenSolaris freezes when this
occurs, so I'm only going to have 2-3 seconds to
execute a command before Terminal and - after a few
more seconds, the rest of the machine - stop
responding to input...
I'm all for trying to test this, but I might need
some instruction.
This message posted from opensolaris.org
Florin Iucha
2008-08-27 12:51:03 UTC
Permalink
Post by MC
The two bonus things to do are: come to the forum and bitch about the bugs to give them some attention, and come to the forum asking for help on making solaris recognize your ACHI SATA hardware properly :)
Been there, done that. No t-shirt, though...

The Solaris kernel might be the best thing since MULTICS, but the lack
of drivers really hampers it's spread.

florin
--
Bruce Schneier expects the Spanish Inquisition.
http://geekz.co.uk/schneierfacts/fact/163
Tim
2008-08-27 16:43:38 UTC
Permalink
Post by MC
Okay, so your ACHI hardware is not using an ACHI driver in solaris. A
crash when pulling a cable is still not great, but it is understandable
because that driver is old and bad and doesn't support hot swapping at all.
His AHCI is not using AHCI because he's set it not to. If linux is somehow
ignoring the BIOS configuration, and attempting to load an AHCI driver for
the hardware anyways, that's *BROKEN* behavior. I've yet to see WHAT driver
linux was using because he was too busy having a pissing match to get that
USEFUL information back to the list.

--Tim
Miles Nordin
2008-08-27 17:58:37 UTC
Permalink
m> file another bug about how solaris recognizes your ACHI SATA
m> hardware as old ide hardware.

I don't have that board but AIUI the driver attachment's chooseable in
the BIOS Blue Screen of Setup, by setting the controller to
``Compatibility'' mode (pci-ide) or ``Native'' mode (AHCI). This
particular chip must be run in Compatibility mode because of bug
6665032.
James C. McPherson
2008-08-24 20:52:51 UTC
Permalink
Post by Tim
I'm pretty sure pci-ide doesn't support hot-swap. I believe you need ahci.
You're correct, it doesn't. Furthermore, to the best of
my knowledge, it won't ever support hotswap.


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
James C. McPherson
2008-08-24 12:30:08 UTC
Permalink
Post by Todd H. Poole
Hmm... I'm leaning away a bit from the hardware, but just in case you've
CPU: AMD Athlon X2 4850e 2.5GHz Socket AM2 45W Dual-Core Processor Model
ADH4850DOBOX
(http://www.newegg.com/Product/Product.aspx?Item=N82E16819103255)
Motherboard: GIGABYTE GA-MA770-DS3 AM2+/AM2 AMD 770 ATX All Solid
Capacitor AMD Motherboard
(http://www.newegg.com/Product/Product.aspx?Item=N82E16813128081)
..
Post by Todd H. Poole
The reason why I don't think there's a hardware issue is because before I
got OpenSolaris up and running, I had a fully functional install of
openSuSE 11.0 running (with everything similar to the original server) to
make sure that none of the components were damaged during shipping from
Newegg. Everything worked as expected.
Yes, but you're running a new operating system, new filesystem...
that's a mountain of difference right in front of you.


A few commands that you could provide the output from include:


(these two show any FMA-related telemetry)
fmadm faulty
fmdump -v

(this shows your storage controllers and what's connected to them)
cfgadm -lav

You'll also find messages in /var/adm/messages which might prove
useful to review.


Apart from that, your description of what you're doing to simulate
failure is

"however, whenever I unplug the SATA cable from one of the drives (to
simulate a catastrophic drive failure) while doing moderate reading from the
zpool (such as streaming HD video), not only does the video hang on the
remote machine (which is accessing the zpool via NFS), but the server
running OpenSolaris seems to either hang, or become incredibly unresponsive."


First and foremost, for me, this is a stupid thing to do. You've
got common-or-garden PC hardware which almost *definitely* does not
support hot plug of devices. Which is what you're telling us that
you're doing. Would try this with your pci/pci-e cards in this
system? I think not.


If you absolutely must do something like this, then please use
what's known as "coordinated hotswap" using the cfgadm(1m) command.


Viz:

(detect fault in disk c2t3d0, in some way)

# cfgadm -c unconfigure c2::dsk/c2t3d0
# cfgadm -c disconnect c2::dsk/c2t3d0

(go and swap the drive, plugin new drive with same cable)

# zpool replace -f poolname c2t3d0


What this will do is tell the kernel to do things in the
right order, and - for zpool - tell it to do an in-place
replacement of device c2t3d0 in your pool.


There are manpages and admin guides you could have a look
through, too:

http://docs.sun.com/app/docs/coll/40.17 (manpages)
http://docs.sun.com/app/docs/coll/47.23 (system admin collection)
http://docs.sun.com/app/docs/doc/817-2271 ZFS admin guide
http://docs.sun.com/app/docs/doc/819-2723 devices + filesystems guide



James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Miles Nordin
2008-08-25 18:55:38 UTC
Permalink
jcm> Don't _ever_ try that sort of thing with IDE. As I mentioned
jcm> above, IDE is not designed to be able to cope with [unplugging
jcm> a cable]

It shouldn't have to be designed for it, if there's controller
redundancy. On Linux, one drive per IDE bus (not using any ``slave''
drives) seems like it should be enough for any electrical issue, but
is not quite good enough in my experience, when there are two PATA
busses per chip. but one hard drive per chip seems to be mostly okay.
In this SATA-based case, not even that much separation was necessary
for Linux to survive on the same hardware, but I agree with you and
haven't found that level with PATA either.

OTOH, if the IDE drivers are written such that a confusing interaction
with one controller chip brings down the whole machine, then I expect
the IDE drivers to do better. If they don't, why advise people to buy
twice as much hardware ``because, you know, controllers can also fail,
so you should have some controller redundancy''---the advice is worse
than a waste of money, it's snake oil---a false sense of security.

jcm> You could start by taking us seriously when we tell you that
jcm> what you've been doing is not a good idea, and find other ways
jcm> to simulate drive failures.

well, you could suggest a method.

except that the whole point of the story is, Linux, without any
blather about ``green-line'' and ``self-healing,'' without any
concerted platform-wide effort toward availability at all, simply
works more reliably.

thp> So aside from telling me to "[never] try this sort of thing
thp> with IDE" does anyone else have any other ideas on how to
thp> prevent OpenSolaris from locking up whenever an IDE drive is
thp> abruptly disconnected from a ZFS RAID-Z array?

yeah, get a Sil3124 card, which will run in native SATA mode and be
more likely to work. Then, redo your test and let us know what
happens.

The not-fully-voiced suggestion to run your ATI SB600 in native/AHCI
mode instead of pci-ide/compatibility mode is probably a bad one
because of bug 6665032: the chip is only reliable in compatibility
mode. You could trade your ATI board for an nVidia board for about
the same price as the Sil3124 add-on card. AIUI from Linux wiki:

http://ata.wiki.kernel.org/index.php/SATA_hardware_features

...says the old nVidia chips use nv_sata driver, and the new ones use
the ahci driver, so both of these are different from pci-ide and more
likely to work. Get an old one (MCP61 or older), and a new one (MCP65
or newer), repeat your test and let us know what happens.

If the Sil3124 doesn't work, and nv_sata doesn't work, and AHCI on
newer-nVidia doesn't work, then hook the drives up to Linux running
IET on basically any old chip, and mount them from Solaris using the
built-in iSCSI initiator.

If you use iSCSI, you will find:

you will get a pause like with NT. Also, if one of the iSCSI targets
is down, 'zpool status' might hang _every time_ you run it, not just
the first time when the failure is detected. The pool itself will
only hang the first time. Also, you cannot boot unless all iSCSI
targets are available, but you can continue running if some go away
after booting.

Overall IMHO it's not as good as LVM2, but it's more robust than
plugging the drives into Solaris. It also gives you the ability to
run smartctl on the drives (by running it natively on Linux) with full
support for all commands, while someone here who I told to run
smartctl reported that on Solaris 'smartctl -a' worked but 'smartctl
-t' did not. I still have performance problems with iSCSI. I'm not
sure yet if they're unresolvable: there are a lot of tweakables with
iSCSI, like disabling Nagle's algorithm, and enabling RED on the
initiator switchport, but first I need to buy faster CPU's for the
targets.

mh> Dying or dead disks will still normally be able to
mh> communicate with the driver to some extent, so they are still
mh> "there".

The dead disks I have which don't spin also don't respond to
IDENTIFY(0) so they don't really communicate with the driver at all.
now, possibly, *possibly* they are still responsive after they fail,
and become unresponsive after the first time they're
rebooted---because I think they load part of their firmware off the
platters. Also, ATAPI standard says that while ``still
communicating'' drives are allowed to take up to 30sec to answer each
command, which is probably too long to freeze a whole system. and
still, just because ``possibly,'' it doesn't make sense to replace a
tested-working system with a tested-broken system, not even after
someone tells a complicated story trying to convince you the broken
system is actually secretly working, just completely impossible to
test, so you have to accept it based on stardust and fantasy.

js> yanking the drives like that can seriously damage the
js> drives or your motherboard.

no, it can't.

And if I want a software developer's opinion on what will electrically
damage my machine, I'll be sure to let you know first.

jcm> If you absolutely must do something like this, then please use
jcm> what's known as "coordinated hotswap" using the cfgadm(1m)
jcm> command.

jcm> Viz:

jcm> (detect fault in disk c2t3d0, in some way)

jcm> # cfgadm -c unconfigure c2::dsk/c2t3d0 # cfgadm -c disconnect
jcm> c2::dsk/c2t3d0

so....dont dont DONT do it because its STUPID and it might FRY YOUR
DISK AND MOTHERBOARD. but, if you must do it, please warn our
software first?

I shouldn't have to say it, but aside from being absurd this
warning-command completely defeats the purpose of the test.

jcm> Yes, but you're running a new operating system, new
jcm> filesystem... that's a mountain of difference right in front
jcm> of you.

so we do agree that Linux's not freezing in the same scenario
indicates the difference is inside that mountain, which, however
large, is composed entirely of SOFTWARE.

re> The behavior of ZFS to an error reported by an underlying
re> device driver is tunable by the zpool failmode property. By
re> default, it is set to "wait."

I think you like speculation well enough, so long as it's optimistic.

which is the tunable setting that causes other pools, ones not even
including failed devices, to freeze?

Why is the failmode property involved at all in a pool that still has
enough replicas to keep functioning?

cg> We really need to fix (B). It seems the "easy" fixes are:

cg> - Configure faster timeouts and fewer retries on redundant
cg> devices, similar to drive manufacturers' RAID edition
cg> firmware. This could be via driver config file, or (better)
cg> automatically via ZFS, similar to write cache behaviour.

cg> - Propagate timeouts quickly between layers (immediate soft
cg> fail without retry) or perhaps just to the fault management
cg> system

It's also important that things unrelated to the failure aren't
frozen. This was how I heard the ``green line'' marketing campaign
when it was pitched to me, and I found it really compelling because I
felt Linux had too little of this virtue. However compelling, I just
don't find it even slightly acquainted with reality.

I can understand ``unrelated'' is a tricky concept when the boot pool
is involved, but for example when it isn't involved: I've had problems
where one exported data pool's becoming FAULTED stops NFS service from
all other pools. The pool that FAULTED contained no Solaris binaries.

and the zpool status hangs people keep discovering.

I think this is a good test in general: configure two
almost-completely independent stacks through the same kernel:


NFS export NFS export

filesystem filesystem
pool pool

ZFS/NFS

driver driver

controller controller

disks disks


Simulate whatever you regard as a ``catastrophic'' or ``unplanned'' or
``really stupid'' failure, and see how big the shared region in the
middle can be without affecting the other stack. Right now, my
experience is even the stack above does not work. Maybe mountd gets
blocked or something, I don't know. Optimistically, we would of
course like this stack below to remain failure-separate:


NFS export NFS export

filesystem filesystem
pool pool

ZFS/NFS

driver

controller

disks disks


The OP is implying, on Linux that stack DOES keep failures separate.
However, even if ``hot plug'' (or ``hot unplug'' for demanding Linux
users) is not supported, at least this stack below should still be
failure-independent:


NFS export NFS export

filesystem filesystem
pool pool

ZFS/NFS

driver

controller controller

disks disks


I suspect it isn't because the less-demanding stack I started with
isn't failure-independent. There is probably more than one problem
making these failures spread more widely than they should, but so far
we can't even agree on what we wish were working.

I do think the failures need to be isolated better first, independent
of time. It's not ``a failure of a drive on the left should propogate
up the stack faster so that the stack on the right unfreezes before
anyone gets too upset.'' The stack on the right shouldn't freeze at
all.
Richard Elling
2008-08-26 18:10:53 UTC
Permalink
Post by Miles Nordin
jcm> Don't _ever_ try that sort of thing with IDE. As I mentioned
jcm> above, IDE is not designed to be able to cope with [unplugging
jcm> a cable]
It shouldn't have to be designed for it, if there's controller
redundancy. On Linux, one drive per IDE bus (not using any ``slave''
drives) seems like it should be enough for any electrical issue, but
is not quite good enough in my experience, when there are two PATA
busses per chip. but one hard drive per chip seems to be mostly okay.
In this SATA-based case, not even that much separation was necessary
for Linux to survive on the same hardware, but I agree with you and
haven't found that level with PATA either.
OTOH, if the IDE drivers are written such that a confusing interaction
with one controller chip brings down the whole machine, then I expect
the IDE drivers to do better. If they don't, why advise people to buy
twice as much hardware ``because, you know, controllers can also fail,
so you should have some controller redundancy''---the advice is worse
than a waste of money, it's snake oil---a false sense of security.
No snake oil. Pulling cables only simulates pulling cables. If you
are having difficulty with cables falling out, then this problem cannot
be solved with software. It *must* be solved with hardware.

But the main problem with "simulating disk failures by pulling cables"
is that the code paths executed during that test are different than those
executed when the disk fails in other ways. It is not simply an issue
of the success or failure of the test, but it is an issue of what you are
testing.

Studies have shown that pulled cables is not the dominant failure
mode in disk populations. Bairavasundaram et.al. [1] showed that
data checksum errors are much more common. In some internal Sun
studies, we also see unrecoverable read as the dominant disk failure
mode. ZFS will do well for these errors, regardless of the underlying
OS. AFAIK, none of the traditional software logical volume managers
nor the popular open source file systems (other than ZFS :-) address
this problem.

[1]
http://www.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf
-- richard
Miles Nordin
2008-08-26 18:38:16 UTC
Permalink
re> unrecoverable read as the dominant disk failure mode. [...]
re> none of the traditional software logical volume managers nor
re> the popular open source file systems (other than ZFS :-)
re> address this problem.

Other LVM's should address unrecoverable read errors as well or better
than ZFS, because that's when the drive returns an error instead of
data. Doing a good job with this error is mostly about not freezing
the whole filesystem for the 30sec it takes the drive to report the
error. Either the drives should be loaded with special firmware that
returns errors earlier, or the software LVM should read redundant data
and collect the statistic if the drive is well outside its usual
response latency. I would expect all the software volume managers
including ZFS fail to do this. It's really hard to test without
somehow getting a drive that returns read errors frequently, but isn't
about to die within the month---maybe ZFS should have an error
injector at driver-level instead of block-level, and a model for
time-based errors. One thing other LVM's seem like they may do better
than ZFS, based on not-quite-the-same-scenario tests, is not freeze
filesystems unrelated to the failing drive during the 30 seconds it's
waiting for the I/O request to return an error.

In terms of FUD about ``silent corruption'', there is none of it when
the drive clearly reports a sector is unreadable. Yes, traditional
non-big-storage-vendor RAID5, and all software LVM's I know of except
ZFS, depend on the drives to report unreadable sectors. And,
generally, drives do. so let's be clear about that and not try to imply
that the ``dominant failure mode'' causes silent corruption for
everyone except ZFS and Netapp users---it doesn't.

The Netapp paper focused on when drives silently return incorrect
data, which is different than returning an error. Both Netapp and ZFS
do checksums to protect from this. However Netapp never claimed this
failure mode was more common than reported unrecoverable read errors,
just that it was more interesting. I expect it's much *less* common.

Further, we know Netapp loaded special firmware into the enterprise
drives in that study because they wanted the larger sector size. They
are likely also loading special firmware into the desktop drives to
make them return errors sooner than 30 seconds. so, it's not
improbable that the Netapp drives are more prone to deliver silently
corrupt data instead of UNC/seek errors compared to off-the-shelf
drives.

Finally, for the Google paper, silent corruption ``didn't even make
the chart.'' so, saying something didn't make your chart and saying
that it doesn't happen are two different things, and your favoured
conclusion has a stake in maintaining that view, too.
Richard Elling
2008-08-26 21:26:34 UTC
Permalink
Post by Miles Nordin
re> unrecoverable read as the dominant disk failure mode. [...]
re> none of the traditional software logical volume managers nor
re> the popular open source file systems (other than ZFS :-)
re> address this problem.
Other LVM's should address unrecoverable read errors as well or better
than ZFS, because that's when the drive returns an error instead of
data.
ZFS handles that case as well.
Post by Miles Nordin
Doing a good job with this error is mostly about not freezing
the whole filesystem for the 30sec it takes the drive to report the
error.
That is not a ZFS problem. Please file bugs in the appropriate category.
Post by Miles Nordin
Either the drives should be loaded with special firmware that
returns errors earlier, or the software LVM should read redundant data
and collect the statistic if the drive is well outside its usual
response latency.
ZFS will handle this case as well.
Post by Miles Nordin
I would expect all the software volume managers
including ZFS fail to do this. It's really hard to test without
somehow getting a drive that returns read errors frequently, but isn't
about to die within the month---maybe ZFS should have an error
injector at driver-level instead of block-level, and a model for
time-based errors.
qv ztest.

Project comstar creates an opportunity for better testing in an open-source
way. However, it will only work for SCSI protocol and therefore does
not provide coverage for IDE devices -- which is not a long-term issue.
Post by Miles Nordin
One thing other LVM's seem like they may do better
than ZFS, based on not-quite-the-same-scenario tests, is not freeze
filesystems unrelated to the failing drive during the 30 seconds it's
waiting for the I/O request to return an error.
This is not operating in ZFS code.
Post by Miles Nordin
In terms of FUD about ``silent corruption'', there is none of it when
the drive clearly reports a sector is unreadable. Yes, traditional
non-big-storage-vendor RAID5, and all software LVM's I know of except
ZFS, depend on the drives to report unreadable sectors. And,
generally, drives do. so let's be clear about that and not try to imply
that the ``dominant failure mode'' causes silent corruption for
everyone except ZFS and Netapp users---it doesn't.
In my field data, the dominant failure mode for disks is unrecoverable
reads. If your software does not handle this case, then you should be
worried. We tend to recommend configuring ZFS to manage data
redundancy for this reason.
Post by Miles Nordin
The Netapp paper focused on when drives silently return incorrect
data, which is different than returning an error. Both Netapp and ZFS
do checksums to protect from this. However Netapp never claimed this
failure mode was more common than reported unrecoverable read errors,
just that it was more interesting. I expect it's much *less* common.
I would love for you produce data to that effect.
Post by Miles Nordin
Further, we know Netapp loaded special firmware into the enterprise
drives in that study because they wanted the larger sector size. They
are likely also loading special firmware into the desktop drives to
make them return errors sooner than 30 seconds. so, it's not
improbable that the Netapp drives are more prone to deliver silently
corrupt data instead of UNC/seek errors compared to off-the-shelf
drives.
I am not sure of the basis of your assertion. Can you explain
in more detail?
Post by Miles Nordin
Finally, for the Google paper, silent corruption ``didn't even make
the chart.'' so, saying something didn't make your chart and saying
that it doesn't happen are two different things, and your favoured
conclusion has a stake in maintaining that view, too.
The google paper[1] didn't deal with silent errors or corruption at all.
Section 2 describes in nice detail how they decided when a drive
was failed -- it was replaced. They also cite disk vendors who test
"failed" drives and many times the drives test clean (what they call
"no problem found"). This is not surprising because it is unlikely that
data corruption is detected in the systems under study.

[1] http://www.cs.cmu.edu/~bianca/fast07.pdf
-- richard
Mattias Pantzare
2008-08-27 01:32:58 UTC
Permalink
Post by Richard Elling
Post by Miles Nordin
Doing a good job with this error is mostly about not freezing
the whole filesystem for the 30sec it takes the drive to report the
error.
That is not a ZFS problem. Please file bugs in the appropriate category.
Who's problem is it? It can't be the device driver as that has no
knowledge of zfs
filesystems or redundancy.
Post by Richard Elling
Post by Miles Nordin
Either the drives should be loaded with special firmware that
returns errors earlier, or the software LVM should read redundant data
and collect the statistic if the drive is well outside its usual
response latency.
ZFS will handle this case as well.
How is ZFS handling this? Is there a timeout in ZFS?
Post by Richard Elling
Post by Miles Nordin
One thing other LVM's seem like they may do better
than ZFS, based on not-quite-the-same-scenario tests, is not freeze
filesystems unrelated to the failing drive during the 30 seconds it's
waiting for the I/O request to return an error.
This is not operating in ZFS code.
In what way is freezing a ZFS filesystem not operating in ZFS code?

Notice that he wrote filesystems unrelated to the failing drive.
Post by Richard Elling
Post by Miles Nordin
In terms of FUD about ``silent corruption'', there is none of it when
the drive clearly reports a sector is unreadable. Yes, traditional
non-big-storage-vendor RAID5, and all software LVM's I know of except
ZFS, depend on the drives to report unreadable sectors. And,
generally, drives do. so let's be clear about that and not try to imply
that the ``dominant failure mode'' causes silent corruption for
everyone except ZFS and Netapp users---it doesn't.
In my field data, the dominant failure mode for disks is unrecoverable
reads. If your software does not handle this case, then you should be
worried. We tend to recommend configuring ZFS to manage data
redundancy for this reason.
He is writing that all software LVM's will handle unrecoverable reads.

What is your definition of unrecoverable reads?
Richard Elling
2008-08-27 04:40:04 UTC
Permalink
Post by Mattias Pantzare
Post by Richard Elling
Post by Miles Nordin
Doing a good job with this error is mostly about not freezing
the whole filesystem for the 30sec it takes the drive to report the
error.
That is not a ZFS problem. Please file bugs in the appropriate category.
Who's problem is it? It can't be the device driver as that has no
knowledge of zfs
filesystems or redundancy.
In most cases it is the drivers below ZFS. For an IDE disk it
might be cmdk(7d) over ata(7d). For a USB disk it might be sd(7d)
over scsa2usb(7d) over ehci(7d). printconf -D will show which
device drivers are attached to your system.

If you search the ZFS source code, you will find very little error
handling of devices, by design.
Post by Mattias Pantzare
Post by Richard Elling
Post by Miles Nordin
Either the drives should be loaded with special firmware that
returns errors earlier, or the software LVM should read redundant data
and collect the statistic if the drive is well outside its usual
response latency.
ZFS will handle this case as well.
How is ZFS handling this? Is there a timeout in ZFS?
Not for this case, but if configured to manage redundancy, ZFS will
"read redundant data" from alternate devices.

A business metric such as reasonable transaction latency would live
at a level above ZFS.
Post by Mattias Pantzare
Post by Richard Elling
Post by Miles Nordin
One thing other LVM's seem like they may do better
than ZFS, based on not-quite-the-same-scenario tests, is not freeze
filesystems unrelated to the failing drive during the 30 seconds it's
waiting for the I/O request to return an error.
This is not operating in ZFS code.
In what way is freezing a ZFS filesystem not operating in ZFS code?
Notice that he wrote filesystems unrelated to the failing drive.
At the ZFS level, this is dictated by the failmode property.
Post by Mattias Pantzare
Post by Richard Elling
Post by Miles Nordin
In terms of FUD about ``silent corruption'', there is none of it when
the drive clearly reports a sector is unreadable. Yes, traditional
non-big-storage-vendor RAID5, and all software LVM's I know of except
ZFS, depend on the drives to report unreadable sectors. And,
generally, drives do. so let's be clear about that and not try to imply
that the ``dominant failure mode'' causes silent corruption for
everyone except ZFS and Netapp users---it doesn't.
In my field data, the dominant failure mode for disks is unrecoverable
reads. If your software does not handle this case, then you should be
worried. We tend to recommend configuring ZFS to manage data
redundancy for this reason.
He is writing that all software LVM's will handle unrecoverable reads.
I agree. And if ZFS is configured to manage redundancy and a disk
read returns EIO or the checksum does not match, then ZFS will
attempt to read from the redundant data. However, not all devices return
error codes which indicate unrecoverable reads. Also, data corrupted
in the data path between media and main memory may not have an
associated error condition reported.

I find comparing unprotected ZFS configurations with LVMs
using protected configurations to be disingenuous.
Post by Mattias Pantzare
What is your definition of unrecoverable reads?
I wrote data, but when I try to read, I don't get back what I wrote.
-- richard
Mattias Pantzare
2008-08-27 10:44:27 UTC
Permalink
Post by Richard Elling
Post by Mattias Pantzare
Post by Richard Elling
Post by Miles Nordin
Either the drives should be loaded with special firmware that
returns errors earlier, or the software LVM should read redundant data
and collect the statistic if the drive is well outside its usual
response latency.
ZFS will handle this case as well.
How is ZFS handling this? Is there a timeout in ZFS?
Not for this case, but if configured to manage redundancy, ZFS will
"read redundant data" from alternate devices.
No, ZFS will not, ZFS waits for the device driver to report an error,
after that it will read from alternate devices.

ZFS could detect that there is probably a problem with the device and
read from an alternate device much faster while it waits for the
device to answer.

You can't do this at any other level than ZFS.
Post by Richard Elling
Post by Mattias Pantzare
Post by Richard Elling
Post by Miles Nordin
One thing other LVM's seem like they may do better
than ZFS, based on not-quite-the-same-scenario tests, is not freeze
filesystems unrelated to the failing drive during the 30 seconds it's
waiting for the I/O request to return an error.
This is not operating in ZFS code.
In what way is freezing a ZFS filesystem not operating in ZFS code?
Notice that he wrote filesystems unrelated to the failing drive.
At the ZFS level, this is dictated by the failmode property.
But that is used after ZFS has detected an error?
Post by Richard Elling
I find comparing unprotected ZFS configurations with LVMs
using protected configurations to be disingenuous.
I don't think anyone is doing that.
Post by Richard Elling
Post by Mattias Pantzare
What is your definition of unrecoverable reads?
I wrote data, but when I try to read, I don't get back what I wrote.
There is only one case where ZFS is better, that is when wrong data is
returned. All other cases are managed by layers below ZFS. Wrong data
returned is not normaly called unrecoverable reads.
Richard Elling
2008-08-27 17:17:40 UTC
Permalink
Post by Mattias Pantzare
Post by Richard Elling
Post by Mattias Pantzare
Post by Richard Elling
Post by Miles Nordin
Either the drives should be loaded with special firmware that
returns errors earlier, or the software LVM should read redundant data
and collect the statistic if the drive is well outside its usual
response latency.
ZFS will handle this case as well.
How is ZFS handling this? Is there a timeout in ZFS?
Not for this case, but if configured to manage redundancy, ZFS will
"read redundant data" from alternate devices.
No, ZFS will not, ZFS waits for the device driver to report an error,
after that it will read from alternate devices.
Yes, ZFS will, ZFS waits for the device driver to report an error,
after that it will read from alternate devices.
Post by Mattias Pantzare
ZFS could detect that there is probably a problem with the device and
read from an alternate device much faster while it waits for the
device to answer.
Rather than complicating ZFS code with error handling code
which is difficult to port or maintain over time, ZFS leverages
the Solaris Fault Management Architecture. There is opportunity
to expand features using the flexible FMA framework. Feel free
to propose additional RFEs.
Post by Mattias Pantzare
You can't do this at any other level than ZFS.
Post by Richard Elling
Post by Mattias Pantzare
Post by Richard Elling
Post by Miles Nordin
One thing other LVM's seem like they may do better
than ZFS, based on not-quite-the-same-scenario tests, is not freeze
filesystems unrelated to the failing drive during the 30 seconds it's
waiting for the I/O request to return an error.
This is not operating in ZFS code.
In what way is freezing a ZFS filesystem not operating in ZFS code?
Notice that he wrote filesystems unrelated to the failing drive.
At the ZFS level, this is dictated by the failmode property.
But that is used after ZFS has detected an error?
I don't understand this question. Could you rephrase to clarify?
Post by Mattias Pantzare
Post by Richard Elling
I find comparing unprotected ZFS configurations with LVMs
using protected configurations to be disingenuous.
I don't think anyone is doing that.
harrumph
Post by Mattias Pantzare
Post by Richard Elling
Post by Mattias Pantzare
What is your definition of unrecoverable reads?
I wrote data, but when I try to read, I don't get back what I wrote.
There is only one case where ZFS is better, that is when wrong data is
returned. All other cases are managed by layers below ZFS. Wrong data
returned is not normaly called unrecoverable reads.
It depends on your perspective. T10 has provided a standard error
code for a device to tell a host that it experienced an unrecoverable
read error. However, we still find instances where what we wrote
is not what we read, whether it is detected at the media level or higher
in the software stack. In my pile of borken parts, I have devices
which fail to indicate an unrecoverable read, yet do indeed suffer
from forgetful media. To carry that discussion very far, it quickly
descends into the ability of the device's media checksums to detect
bad data -- even ZFS's checksums. But here is another case where
enterprise-class devices tend to perform better than consumer-grade
devices.
-- richard
Keith Bierman
2008-08-27 18:05:13 UTC
Permalink
Post by Richard Elling
In my pile of broken parts, I have devices
which fail to indicate an unrecoverable read, yet do indeed suffer
from forgetful media.
A long time ago, in a hw company long since dead and buried, I spent
some months trying to find an intermittent error in the last bits of
a complicated floating point application. It only occurred when disk
striping was turned on (but the OS and device codes checked cleanly).
In the end, it turned out that one of the device vendors had modified
the specification slightly (by like 1 nano-sec) and the result was
that least significant bits were often wrong when we drove the disk
cage to it's max.

Errors were occurring randomly (e.g. swapping, paging, etc.) but no
other application noticed. As the error was "within the margin of
error" a less stubborn analyst might have not made a serious of
federal cases about the non-determinism ;>

My point is that undetected errors happen all the time; that people
don't notice doesn't mean that they don't happen ...
--
Keith H. Bierman ***@gmail.com | AIM kbiermank
5430 Nassau Circle East |
Cherry Hills Village, CO 80113 | 303-997-2749
<speaking for myself*> Copyright 2008
Carson Gaspar
2008-08-26 18:56:19 UTC
Permalink
Post by Richard Elling
No snake oil. Pulling cables only simulates pulling cables. If you
are having difficulty with cables falling out, then this problem cannot
be solved with software. It *must* be solved with hardware.
But the main problem with "simulating disk failures by pulling cables"
is that the code paths executed during that test are different than those
executed when the disk fails in other ways. It is not simply an issue
of the success or failure of the test, but it is an issue of what you are
testing.
All of that may be true, but it doesn't change the fact that Solaris'
observed begaviour under these conditions is _abysmally_ bad, and for no
good reason.

It might not be a high priority to fix, but it would be nice if one of
the Sun folks would at least acknowledge that something is terribly
wrong here, rather than claiming it's not a problem.
--
Carson
Richard Elling
2008-08-26 20:15:20 UTC
Permalink
Post by Carson Gaspar
Post by Richard Elling
No snake oil. Pulling cables only simulates pulling cables. If you
are having difficulty with cables falling out, then this problem cannot
be solved with software. It *must* be solved with hardware.
But the main problem with "simulating disk failures by pulling cables"
is that the code paths executed during that test are different than those
executed when the disk fails in other ways. It is not simply an issue
of the success or failure of the test, but it is an issue of what you are
testing.
All of that may be true, but it doesn't change the fact that Solaris'
observed begaviour under these conditions is _abysmally_ bad, and for no
good reason.
Please file bugs. That is the best way to get things fixed.
The most appropriate forum for storage driver discussions will
be storage-discuss.
-- richard
Ron Halstead
2008-08-26 20:45:58 UTC
Permalink
Todd, 3 days ago you were asked what mode the BIOS was using, AHCI or IDE compatibility. Which is it? Did you change it? What was the result? A few other posters suggested the same thing but the thread went off into left field and I believe the question / suggestions got lost in the noise.

--ron


This message posted from opensolaris.org
Todd H. Poole
2008-08-27 06:53:27 UTC
Permalink
Howdy Ron,

Right, right - I know I dropped the ball on that one. Sorry, I haven't been able to log into OpenSolaris lately, and thus haven't been able to actually do anything useful... (lol, not to rag on OpenSolaris or anything, but it can also freeze just by logging in... See: http://defect.opensolaris.org/bz/show_bug.cgi?id=1681)

Ok, so, just to give a refresher of what's going on:
When everything is in it's default state (standard install of OpenSolaris, standard configuration of ZFS, factory-set BIOS settings, etc.) OpenSolaris will indeed freeze/hang/lock up, and generally become unusable _without exception_ on the hardware I've described above. I'm not confident enough to say that it will _always_ happen on _any_ machine using the 4 drive configuration of RAID-Z with the pci-ide driver and hardware set-up I've described thus far, but since I am not alone in experiencing this (see what my myxiplx experienced on his [different] hardware set-up), I don't think its an isolated case.

The factory-set BIOS settings for the 4 SATA II ports on my motherboard are [Native IDE]. I can change this setting from [Native IDE] to [RAID], [Legacy IDE], and {SATA->AHCI]

Changing the setting to [SATA->AHCI] prevents the machine from booting. There isn't any extra information that I can give aside from the fact that when I'm at the "SunOS Release 5.11 Version snv_86 64-bit" screen where the copyright is listed, the machine hangs right after listing "Hostname: ".

A restart didn't fix anything (that would sometimes fix the login bug I wrote about a few paragraphs up, but it didn't work for this).

By the way: Is there a way to pull up a text-only interface from the log in screen (or during the boot process?) without having to log in (or just sit there reading about "SunOS Release 5.11 Version snv_86 64-bit")? It would be nice if I could see a bit more information during boot, or if I didn't have to use gnome if I just wanted to get at the CLI anyways... On some OSes, if you want to access TTY1 through 6, you only need to press ESC during boot, or CTRL + ALT + F1 through F6 (or something similar) during the login screen to gain access to other non-GUI login screens...

Anyway, after changing the setting back to [Native IDE], the machine boots fine. And this time, the freeze-on-login bug didn't get me. Now, I know for a fact this motherboard supports SATA II (see link to manufacturer's website in earlier post), and that all 4 of these disks are _definitely_ SATA II disks (see hardware specifications listed in one of my earliest posts), and that I'm using all the right cables and everything... so, I don't know how to explore this any further...

Could it be that when I installed OpenSolaris, I was using the pci-ide (or [Native IDE]) setting on my BIOS, and thus if I were to change it, OpenSolaris might not know hot to handle that, and might refuse to boot? Or that maybe OpenSolaris only installed the drivers it thought it would need, and the stat-ahci one wasn't one of them?

Let me know what you think.

-Todd


This message posted from opensolaris.org
Tim
2008-08-27 16:21:14 UTC
Permalink
Post by Todd H. Poole
By the way: Is there a way to pull up a text-only interface from the log in
screen (or during the boot process?) without having to log in (or just sit
there reading about "SunOS Release 5.11 Version snv_86 64-bit")? It would be
nice if I could see a bit more information during boot, or if I didn't have
to use gnome if I just wanted to get at the CLI anyways... On some OSes, if
you want to access TTY1 through 6, you only need to press ESC during boot,
or CTRL + ALT + F1 through F6 (or something similar) during the login screen
to gain access to other non-GUI login screens...
On SXDE/Solaris, there's a dropdown menu that lets you select what type of
logon you'd like to use. I haven't touched 2008.11 so I have no idea if
it's got similar.
Post by Todd H. Poole
Anyway, after changing the setting back to [Native IDE], the machine boots
fine. And this time, the freeze-on-login bug didn't get me. Now, I know for
a fact this motherboard supports SATA II (see link to manufacturer's website
in earlier post), and that all 4 of these disks are _definitely_ SATA II
disks (see hardware specifications listed in one of my earliest posts), and
that I'm using all the right cables and everything... so, I don't know how
to explore this any further...
Could it be that when I installed OpenSolaris, I was using the pci-ide (or
[Native IDE]) setting on my BIOS, and thus if I were to change it,
OpenSolaris might not know hot to handle that, and might refuse to boot? Or
that maybe OpenSolaris only installed the drivers it thought it would need,
and the stat-ahci one wasn't one of them?
Did you do a reboot reconfigure? "reboot -- -r" or "init 6"?
Ross
2008-08-27 18:31:41 UTC
Permalink
Forgive me for being a bit wooly with this explanation (I've only recently moved over from Windows), but changing disk mode from IDE to SATA may well not work without a re-install, or at the very least messing around with boot settings. I've seen many systems which list SATA disks in front of IDE ones, so you changing the drives to SATA may now mean that instead of your OS being installed on drive 0, and your data on drive 1, you now have the data on drive 0 and the OS on drive 1.

You'll get through the first part of the boot process fine, but the second stage is where you usually have problems which sounds like what's happening to you. Unfortunately swapping hard disk controllers (which is what you're doing here) isn't as simple as just making the change and rebooting, and that would be just as true in Windows.

I do think some solaris drivers need a bit of work, but I suspect the standard SATA ones are pretty good, so there is a fair chance that you'll find hot plug works ok in SATA mode.

Ultimately however you're trying to get enterprise kinds of performance out of consumer kit, and no matter how good Solaris and ZFS are, they can't guarantee to work with that. I used to have the same opinion as you, but I'm starting to see now that ZFS isn't quite an exact match for traditional raid controllers. It's close, but you do need to think about the hardware too and make sure that can definately cope with what you're wanting to do. I think the sales literature is a little misleading in that sense.

Ross


This message posted from opensolaris.org
Tim
2008-08-27 18:38:00 UTC
Permalink
Post by Ross
Forgive me for being a bit wooly with this explanation (I've only recently
moved over from Windows), but changing disk mode from IDE to SATA may well
not work without a re-install, or at the very least messing around with boot
settings. I've seen many systems which list SATA disks in front of IDE
ones, so you changing the drives to SATA may now mean that instead of your
OS being installed on drive 0, and your data on drive 1, you now have the
data on drive 0 and the OS on drive 1.
Solaris does not do this. This is one of the many annoyances I have with
linux. The way they handle /dev is ridiculous. Did you add a new drive?
Let's renumber everything!

--Tim
Miles Nordin
2008-08-27 22:33:54 UTC
Permalink
t> Solaris does not do this.

yeah but the locators for local disks are still based on
pci/controller/channel not devid, so the disk will move to a different
device name if he changes BIOS from pci-ide to AHCI because it changes
the driver attachment. This may be the problem preventing his bootup,
rather than the known AHCI bug.

I'm not sure what's required to boot off a root pool that's moved
devices, maybe nothing, but for UFS roots it often required booting
off the install media, regenerating /dev (and /devices on sol9),
editing vfstab, u.s.w.

Linux device names don't move as much if you use LVM2, as some of the
distros do by default even for single-device systems. Device names
are then based on labels written onto the drive, which is a little
scary and adds a lot of confusion, but I think helps with this
moving-device problem and is analagous to what it sounds like ZFS
might do on the latest SXCE's that don't put zpool.cache in the boot
archive.
Tim
2008-08-27 22:40:56 UTC
Permalink
Post by Miles Nordin
t> Solaris does not do this.
yeah but the locators for local disks are still based on
pci/controller/channel not devid, so the disk will move to a different
device name if he changes BIOS from pci-ide to AHCI because it changes
the driver attachment. This may be the problem preventing his bootup,
rather than the known AHCI bug.
Except he was, and is referring to a non-root disk. If I'm using raw
devices and I unplug my root disk and move it somewhere else, I would expect
to have to update my boot loader.
Post by Miles Nordin
Linux device names don't move as much if you use LVM2, as some of the
distros do by default even for single-device systems. Device names
are then based on labels written onto the drive, which is a little
scary and adds a lot of confusion, but I think helps with this
moving-device problem and is analagous to what it sounds like ZFS
might do on the latest SXCE's that don't put zpool.cache in the boot
archive.
LVM hardly changes the way devices move around in Linux, or it's horrendous
handling of /dev. You are correct in that it is a step towards masking the
ugliness. I, however, do not consider it a fix. Unfortunately it's not
used in the majority of the sites I am involved in, and as such isn't any
sort of help. The administration overhead it adds is not worth the hassle
for the majority of my customers.

--Tim
Miles Nordin
2008-08-27 23:02:57 UTC
Permalink
t> Except he was, and is referring to a non-root disk.

wait, what? his root disk isn't plugged into the pci-ide controller?

t> LVM hardly changes the way devices move around in Linux,

fine, be pedantic. It makes systems boot and mount all their
filesystems including '/' even when you move disks around. agreed
now?

There's a simpler Linux way of doing this which I use on my Linux
systems: mounting by the UUID in the filesystem's superblock. But I
think RedHat is using LVM2 to do it.

Anyway modern Linux systems don't put names like /dev/sda in
/etc/fstab, and they don't use these names to find the root filesystem
either---they have all that LVM2 stuff in the early userspace.

Solaris seems to be going the same ``mount by label'' direction with
ZFS (except with zpool.cache, devid's, and mpxio, it's a bit of a
hybrid approach---when it goes out searching for labels, and when it
expects devices to be on the same bus/controller/channel, isn't
something I fully understand yet and I expect will only become clear
through experience).
MC
2008-08-27 06:08:39 UTC
Permalink
Pulling cables only simulates pulling cables. If you
are having difficulty with cables falling out, then this problem cannot
be solved with software. It *must* be solved with hardware.
I don't think anyone is asking for software to fix cables that fall out... they're asking for the OS to not crash, which they perceive to be better than a crash...


This message posted from opensolaris.org
Todd H. Poole
2008-08-27 07:27:21 UTC
Permalink
Howdy James,

While responding to halstead's post (see below), I had to restart several times to complete some testing. I'm not sure if that's important to these commands or not, but I just wanted to put it out there anyway.
Post by James C. McPherson
A few commands that you could provide the output from
(these two show any FMA-related telemetry)
fmadm faulty
fmdump -v
This is the output from both commands:

***@mediaserver:~# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 27 01:07:08 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD Major

Fault class : fault.fs.zfs.vdev.io
Description : The number of I/O errors associated with a ZFS device exceeded
acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD
for more information.
Response : The device has been offlined and marked as faulted. An attempt
will be made to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.



***@mediaserver:~# fmdump -v
TIME UUID SUNW-MSG-ID
Aug 27 01:07:08.2040 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD
100% fault.fs.zfs.vdev.io

Problem in: zfs://pool=mediapool/vdev=bfaa3595c0bf719
Affects: zfs://pool=mediapool/vdev=bfaa3595c0bf719
FRU: -
Location: -
Post by James C. McPherson
(this shows your storage controllers and what's
connected to them) cfgadm -lav
This is the output from cfgadm -lav

***@mediaserver:~# cfgadm -lav
Ap_Id Receptacle Occupant Condition Information
When Type Busy Phys_Id
usb2/1 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13:1
usb2/2 connected configured ok
Mfg: Microsoft Product: Microsoft 3-Button Mouse with IntelliEye(TM)
NConfigs: 1 Config: 0 <no cfg str descr>
unavailable usb-mouse n /devices/***@0,0/pci1458,***@13:2
usb3/1 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,2:1
usb3/2 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,2:2
usb4/1 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,3:1
usb4/2 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,3:2
usb5/1 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,4:1
usb5/2 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,4:2
usb6/1 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,5:1
usb6/2 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,5:2
usb6/3 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,5:3
usb6/4 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,5:4
usb6/5 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,5:5
usb6/6 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,5:6
usb6/7 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,5:7
usb6/8 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,5:8
usb6/9 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,5:9
usb6/10 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,5:10
usb7/1 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,1:1
usb7/2 empty unconfigured ok
unavailable unknown n /devices/***@0,0/pci1458,***@13,1:2

You'll notice that the only thing listed is my USB mouse... is that expected?
Post by James C. McPherson
You'll also find messages in /var/adm/messages which
might prove
useful to review.
If you really want, I can list the output from /var/adm/messages, but it doesn't seem to add anything new to what I've already copied and pasted.
Post by James C. McPherson
First and foremost, for me, this is a stupid thing to
do. You've got common-or-garden PC hardware which almost
*definitely* does not support hot plug of devices. Which is what you're
telling us that you're doing. Would try this with your pci/pci-e
cards in this system? I think not.
I would if I had some sort of set-up that supposedly promised me redundant PCI/PCI-E cards... You might think it's stupid, but how else could one be sure that the back-up PCI/PCI-E card would take over when the primary one died?

Unplugging one of them seems like a fine test to me - It's definitely the worst case scenario, and if the rig survives that, then I _know_ I would be able to rely on it for redundancy should one of the cards fail (which would most likely occur in a less spectacular fashion than a quick yank anyways)
Post by James C. McPherson
If you absolutely must do something like this, then
please use what's known as "coordinated hotswap" using the
cfgadm(1m) command.
(detect fault in disk c2t3d0, in some way)
# cfgadm -c unconfigure c2::dsk/c2t3d0
# cfgadm -c disconnect c2::dsk/c2t3d0
(go and swap the drive, plugin new drive with same
cable)
# zpool replace -f poolname c2t3d0
What this will do is tell the kernel to do things in
the right order, and - for zpool - tell it to do an
in-place replacement of device c2t3d0 in your pool.
Thanks for the command listings - they'll certainly prove useful if I should ever find myself in a situation where I have to manually swap a disk like you described. Unfortunately though, I'm with Miles Nordin (see below) on this one - I don't want to warn OpenSolaris of what I'm about to do... That would defeat the purpose of the test. Even with technologies (like S.M.A.R.T.) that are designed to give you a bit of a heads-up, as Heikki Suonsivu and Google have noted, they're not very reliable at all (research.google.com/archive/disk_failures.pdf).

And I want this test to be as rough as it gets. I don't want to play nice with this system... I want to drag it through the most tortuous worst-case scenario tests I can imagine, and if it survives with all my test data intact, then (and only then) will I begin to trust it.
Post by James C. McPherson
http://docs.sun.com/app/docs/coll/40.17 (manpages)
http://docs.sun.com/app/docs/coll/47.23 (system admin collection)
http://docs.sun.com/app/docs/doc/817-2271 ZFS admin guide
http://docs.sun.com/app/docs/doc/819-2723 devices + filesystems guide
Oohh... Thank you. Good Links. I'm bookmarking these for future reading. They'll definitely be helpful if we end up choosing to deploy OpenSolaris + ZFS for our media servers.

-Todd


This message posted from opensolaris.org
Richard Elling
2008-08-27 16:47:21 UTC
Permalink
Post by Todd H. Poole
And I want this test to be as rough as it gets. I don't want to play
nice with this system... I want to drag it through the most tortuous
worst-case scenario tests I can imagine, and if it survives with all
my test data intact, then (and only then) will I begin to trust it.
http://youtu.be/naKd9nARAes
:-)
-- richard
Todd H. Poole
2008-08-28 00:24:10 UTC
Permalink
Ah yes - that video is what got this whole thing going in the first place... I referenced it in one of my other posts much earlier. Heh... there's something gruesomely entertaining about brutishly taking a drill or sledge hammer to a piece of precision hardware like that.

But yes, that's the kind of torture test I would like to conduct, however, I'm operating on a limited test-budget right now, and I have to get the damn thing working in the first place before I start performing tests I can't easily reverse (I still have yet to fire up Bonnie++ and do some benchmarking), and most definitely before I can put on a show for those who control the draw strings to the purse...

But, imagine: walking into... oh say, I dunno... your manager's office, for example, and asking him to beat the hell out of one of your server's hard drives all the while promising him that no data would be lost, and none of his video on demand customers would ever notice an interruption in service. He might think you're crazy, but if it still works at the end of the day, your annual budget just might get a sizable increase to help you make all the other servers "sledge hammer resistant" like the first one. ;)

But that's just an example. That functionality could (and probably does) prove useful almost anywhere.


This message posted from opensolaris.org
Miles Nordin
2008-08-27 18:24:13 UTC
Permalink
Post by James C. McPherson
Would try this with
your pci/pci-e cards in this system? I think not.
thp> Unplugging one of them seems like a fine test to me

I've done it, with 32-bit 5 volt PCI, I forget why. I might have been
trying to use a board, but bypass the broken etherboot ROM on the
board. It was something like that.

IIRC it works sometimes, crashes the machine sometimes, and fries the
hardware eventually if you keep doing it long enough.

The exact same three cases are true of cold-plugging a PCI
card. It just works a-lot-more-often sometimes if you power down
first.

Does massively inappropriate hotplugging possibly weaken the hardware
so that it's more likely to pop later? maybe. Can you think of a
good test for that?

Believe it or not, sometimes accurate information is worth more than a
motherboard that cost $50 five years ago. Sometimes saving ten
minutes is worth more. or...<cough> recovering an openprom password.

Testing availability claims rather than accepting them on faith, or
rather than gaining experience in a slow, oozing, anecdotal way on
production machinery, is definitely not stupid. Testing them in a way
that compares one system to another is double-un-stupid.
James C. McPherson
2008-08-28 12:52:01 UTC
Permalink
Hi Todd,
sorry for the delay in responding, been head down rewriting
a utility for the last few days.
Post by Todd H. Poole
Howdy James,
While responding to halstead's post (see below), I had to restart several
times to complete some testing. I'm not sure if that's important to these
commands or not, but I just wanted to put it out there anyway.
Post by James C. McPherson
A few commands that you could provide the output from
(these two show any FMA-related telemetry)
fmadm faulty
fmdump -v
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 27 01:07:08 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD Major
Fault class : fault.fs.zfs.vdev.io
Description : The number of I/O errors associated with a ZFS device exceeded
acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD
for more information.
Response : The device has been offlined and marked as faulted. An attempt
will be made to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
TIME UUID SUNW-MSG-ID
Aug 27 01:07:08.2040 0d9c30f1-b2c7-66b6-f58d-9c6bcb95392a ZFS-8000-FD
100% fault.fs.zfs.vdev.io
Problem in: zfs://pool=mediapool/vdev=bfaa3595c0bf719
Affects: zfs://pool=mediapool/vdev=bfaa3595c0bf719
FRU: -
Location: -
In other emails in this thread you've mentioned the desire to
get an email (or some sort of notification) when Problems Happen(tm)
in your system, and the FMA framework is how we achieve that
in OpenSolaris.



# fmadm config
MODULE VERSION STATUS DESCRIPTION
cpumem-retire 1.1 active CPU/Memory Retire Agent
disk-transport 1.0 active Disk Transport Agent
eft 1.16 active eft diagnosis engine
fabric-xlate 1.0 active Fabric Ereport Translater
fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
io-retire 2.0 active I/O Retire Agent
snmp-trapgen 1.0 active SNMP Trap Generation Agent
sysevent-transport 1.0 active SysEvent Transport Agent
syslog-msgs 1.0 active Syslog Messaging Agent
zfs-diagnosis 1.0 active ZFS Diagnosis Engine
zfs-retire 1.0 active ZFS Retire Agent


You'll notice that we've got an SNMP agent there... and you
can acquire a copy of the FMA mib from the Fault Management
community pages (http://opensolaris.org/os/community/fm and
http://opensolaris.org/os/community/fm/mib/).
Post by Todd H. Poole
Post by James C. McPherson
(this shows your storage controllers and what's
connected to them) cfgadm -lav
This is the output from cfgadm -lav
Ap_Id Receptacle Occupant Condition Information
When Type Busy Phys_Id
usb2/1 empty unconfigured ok
usb2/2 connected configured ok
Mfg: Microsoft Product: Microsoft 3-Button Mouse with IntelliEye(TM)
NConfigs: 1 Config: 0 <no cfg str descr>
usb3/1 empty unconfigured ok
[snip]
Post by Todd H. Poole
usb7/2 empty unconfigured ok
You'll notice that the only thing listed is my USB mouse... is that expected?
Yup. One of the artefacts of the cfgadm architecture. cfgadm(1m)
works by using plugins - usb, FC, SCSI, SATA, pci hotplug, InfiniBand...
but not IDE.

I think you also were wondering how to tell what controller
instances your disks were using in IDE mode - two basic ways
of achieving this:

/usr/bin/iostat -En

and

/usr/sbin/format

Your IDE disks will attach using the cmdk driver and show up like this:

c1d0
c1d1
c2d0
c2d1

In AHCI/SATA mode they'd show up as

c1t0d0
c1t1d0
c1t2d0
c1t3d0

or something similar, depending on how the bios and the actual
controllers sort themselves out.
Post by Todd H. Poole
Post by James C. McPherson
You'll also find messages in /var/adm/messages which
might prove
useful to review.
If you really want, I can list the output from /var/adm/messages, but it
doesn't seem to add anything new to what I've already copied and pasted.
No need - you've got them if you need them.

[snip]
Post by Todd H. Poole
Post by James C. McPherson
http://docs.sun.com/app/docs/coll/40.17 (manpages)
http://docs.sun.com/app/docs/coll/47.23 (system admin collection)
http://docs.sun.com/app/docs/doc/817-2271 ZFS admin guide
http://docs.sun.com/app/docs/doc/819-2723 devices + filesystems guide
Oohh... Thank you. Good Links. I'm bookmarking these for future reading.
They'll definitely be helpful if we end up choosing to deploy OpenSolaris
+ ZFS for our media servers.
There's a heap of info there, getting started with it can be
like trying to drink from a fire hose :)


Best regards,
James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Ross
2008-08-24 08:04:25 UTC
Permalink
You're seeing exactly the same behaviour I found on my server, using a Supermicro AOC-SAT2-MV8 SATA controller. It's detailed on the forums under the topics "Supermicro AOC-SAT2-MV8 hang when drive removed", but unfortunately that topic split into 3 or 4 pieces so it's a pain to find.

I also reported it as a bug here:
http://bugs.opensolaris.org/view_bug.do?bug_id=6735931


This message posted from opensolaris.org
Ross
2008-08-24 08:31:39 UTC
Permalink
PS. Does your system definitely support SATA hot swap? Could you for example test it under windows to see if it runs fine there?

I suspect this is a Solaris driver problem, but it would be good to have confirmation that the hardware handles this fine.


This message posted from opensolaris.org
Todd H. Poole
2008-08-24 09:17:09 UTC
Permalink
Hmm... You know, that's a good question. I'm not sure if those SATA II ports support hot swap or not. The motherboard is fairly new, but taking a look at the specifications provided by Gigabyte (http://www.gigabyte.com.tw/Products/Motherboard/Products_Spec.aspx?ProductID=2874) doesn't seem to yield anything. To tell you the truth, I think they're just plain 'ol dumb SATA II ports - nothing fancy here.

But that's alright, because hot swappable isn't something I'm necessarily chasing after. It would be nice, of course, but the thing that we want the most is stability during hardware failures. For this particular server, it is _far_ more important for the thing to keep chugging along and blow right through as many hardware failures as it can. If it's still got 3 of those 4 drives (which implies at least 2 data and 1 parity, or 3 data and no parity) then I still want to be able to read and write to those NFS exports like nothing happened. Then, at the end of the day, if we need to bring the machine down in order to install a new disk and resilver the RAID-Z array, that is perfectly acceptable. We could do that around 6:00 or so when everyone goes home for the day and when its much more conv
enient for us and the users, and let the resilvering/repairing operation run over night.

I also read the PDF summary you included in your link to your other post. And it seems we're seeing similar behavior here. Although, in this case, things are even simpler: there are only 4 drives in the case (not 8), and there is no extra controller card (just the ports on the motherboard)... It's hard to get any more basic than that.

As for testing in other OSes, unfortunately I don't readily have a copy of Windows available. But even if I did, I wouldn't know where to begin: almost all of my experience in server administration has been with Linux. For what it's worth, I have already established the above (that is, the seamless experience) with OpenSuSE 11.0 as the operating system, LVM as the volume manager, madam as the RAID manager, and XFS as the filesystem, so I know it can work...

I just want to get it working with OpenSolaris and ZFS. :)


This message posted from opensolaris.org
Ross
2008-08-27 15:38:28 UTC
Permalink
Hi Todd,

Having finally gotten the time to read through this entire thread, I think Ralf said it best. ZFS can provide data integrity, but you're reliant on hardware and drivers for data availability.

In this case either your SATA controller, or the drivers for it don't cope at all well with a device going offline, so what you need is a SATA card that can handle that. Provided you have a controller that can cope with the disk errors, it should be able to return the appropriate status information to ZFS, which will in turn ensure your data is ok.

The technique obviously works or Sun's x4500 servers wouldn't be doing anywhere near as well as they are. The problem we all seem to be having is finding white box hardware that supports it.

I suspect your best bet would be to pick up a SAS controller based on the LSI chipsets used in the new x4540 server. There's been a fair bit of discussion here on these, and while there's a limitation in that you will have to manually keep track of drive names, I would expect it to handle disk failures (and pulling disks) much better, but you would probably be well advised asking the folks on the forums running those SAS controllers whether they've been able to pull disks sucessfully.

I think the solution you need is definately to get a better disk controller, and your choice is either a plain SAS controller, or a raid controller that can present individual disks in pass through mode since they *definately* are designed to handle failures.

Ross


This message posted from opensolaris.org
Loading...