Discussion:
Random file system corruption in 3.17 (not BTRFS related...?)
Robert White
2014-10-14 16:54:51 UTC
Permalink
Howdy,

So I run several gentoo systems and I upgraded two of them to kernel 3.17.0

One using BTRFS for root.
One using ext3 for root (via the ext4 driver)

_Both_ systems exhibited strange behavior (long pauses and then hangs
requiring hard-power) within several hours. Both then had random
filesystem damage.

On the BTRFS system much of my browser settings for firefox were
trashed, particularly the cookies and saved conifigurations for add-ons
(like which sites had scripts enabled/disabled in no-script) etc.

On the ext3/4 system there were several corruptions including a
pipe/special file with a large non-zero size that required I do a "fsck
-fyD /dev/sda3" to repair. (one comment from fsck was that the
pipe/special file "looked like a directory" or some such)

So I can say that corruption is taking place, but I suspect it is _not_
happening in the BTRFS specific code.

(ASIDE: both systems are older amd64 using built-in radeon display
hardware.)

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David Arendt
2014-10-14 17:22:11 UTC
Permalink
I didn't notice a corruption on other filesystems with kernel 3.17.0.
Also I didn't experience any hangs except when trying to mount a
corrupted btrfs but this was causing a hang within less than 10 seconds.
It could be that your problem is unrelated and that the corruption you
are experiencing is due to an unrelated hang followed by a hard
powerdown. Have you been able to capture any btrfs related kernel panics ?
Post by Robert White
Howdy,
So I run several gentoo systems and I upgraded two of them to kernel 3.17.0
One using BTRFS for root.
One using ext3 for root (via the ext4 driver)
_Both_ systems exhibited strange behavior (long pauses and then hangs
requiring hard-power) within several hours. Both then had random
filesystem damage.
On the BTRFS system much of my browser settings for firefox were
trashed, particularly the cookies and saved conifigurations for
add-ons (like which sites had scripts enabled/disabled in no-script) etc.
On the ext3/4 system there were several corruptions including a
pipe/special file with a large non-zero size that required I do a
"fsck -fyD /dev/sda3" to repair. (one comment from fsck was that the
pipe/special file "looked like a directory" or some such)
So I can say that corruption is taking place, but I suspect it is
_not_ happening in the BTRFS specific code.
(ASIDE: both systems are older amd64 using built-in radeon display
hardware.)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Robert White
2014-10-14 20:06:29 UTC
Permalink
Post by David Arendt
I didn't notice a corruption on other filesystems with kernel 3.17.0.
Also I didn't experience any hangs except when trying to mount a
corrupted btrfs but this was causing a hang within less than 10 seconds.
It could be that your problem is unrelated and that the corruption you
are experiencing is due to an unrelated hang followed by a hard
powerdown. Have you been able to capture any btrfs related kernel panics ?
My installation is _not_ well suited for capturing panics.

I have not been able to capture any panics on either system and I had to
just switch back to 3.16.3 as the two systems were my firewall (ext4)
and my primary laptop (BTRFS). I didn't want to grind them up with
repeated crashes and corruptions. I only let the firewall fault once
before switching back.

The laptop faulted and hung twice under 3.17.0 before I switched it
back, thinking it was a radeon graphics driver issue. Then I logged into
the firewall via ssh to check something and three shell commands or so
in, it went to lunch (but the firewall layer was still passing packets).

The only actual sign of filesystem corruption on the laptop was the
sudden absence or corruption of the (sqlite3 format) history and
settings files. But firefox was the only thing I'd been actively using.

Given the way the firewall jammed up and died, and the kind of
corruption (special files don't get updated that much, let alone to link
up a directory) -- and the fact that it ran fine as a firewall for
several hours then died as soon as I touched the file system. I suspect
that there is something fishy in dcache or the vnode layers.

It was too much too soon on two otherwise stable systems.

I offered this email here because I noticed that people were seeing
"BTRFS corruption" with 3.17 and I'd seen both BTRFS and EXT4 corruption
which suggests that BTRFS _isn't_ particularly culpable.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-10-14 22:35:14 UTC
Permalink
Post by Robert White
On the BTRFS system much of my browser settings for firefox were
trashed, particularly the cookies and saved conifigurations for add-ons
(like which sites had scripts enabled/disabled in no-script) etc.
FWIW, this reply is more toward the firefox corruption than the
why-particulars of the crash.

The prefs.js file in the profile dir holds addon settings and seems to be
particularly sensitive to corruption. At least here, firefox has created
several backups, prefs-1.js thru prefs-7.js, I suppose at upgrade. The
first time I lost settings I restored prefs-7.js (the newest/largest of
the backups) as prefs.js, and only lost a few settings that I had changed
since the last upgrade, which had changed the firefox interface so I had
to change my settings accordingly. The time or two since then that I
hard-crashed and lost my addons, I was able to replace the prefs.js file
from a recent /home backup.

Anyway, it's the prefs.js file that you want to restore. Whether it's
from the last prefs-N.js backup that firefox did, or from your own
backup, prefs.js is it.

As for cookies, history, etc. I didn't notice them going corrupt. I do
run raid1 btrfs and after a crash, do a scrub, which may recover some
files. And I run tight enough security that most cookies are session-
only (and no third-party), so that file won't be written to much, which
probably saves it. I don't know about history. Maybe it was corrupted
and I simply didn't notice it, as I don't use history that often.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Juan Orti Alcaine
2014-10-15 07:08:14 UTC
Permalink
Post by Robert White
Howdy,
=20
So I run several gentoo systems and I upgraded two of them to kernel=20
3.17.0
=20
One using BTRFS for root.
One using ext3 for root (via the ext4 driver)
=20
_Both_ systems exhibited strange behavior (long pauses and then hangs
requiring hard-power) within several hours. Both then had random
filesystem damage.
=20
On the BTRFS system much of my browser settings for firefox were
trashed, particularly the cookies and saved conifigurations for
add-ons (like which sites had scripts enabled/disabled in no-script)
etc.
=20
On the ext3/4 system there were several corruptions including a
pipe/special file with a large non-zero size that required I do a
"fsck -fyD /dev/sda3" to repair. (one comment from fsck was that the
pipe/special file "looked like a directory" or some such)
=20
So I can say that corruption is taking place, but I suspect it is
_not_ happening in the BTRFS specific code.
=20
(ASIDE: both systems are older amd64 using built-in radeon display=20
hardware.)
=20
I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha).=20
It has happened two times, each one after a clean reinstall and a wipe=20
of the old fs. In less than a day, both installations got corrupted and=
=20
the filesystems went readonly. When listing the contents, I saw many=20
directories with question marks.

My system has 4 drives and 2 fs:
- 1 SSD in single
- 3 HDD in RAID1

I do readonly snapshots every hour of all the subvolumes, so I have=20
hundreds of snapshots.

Now I'm back in 3.16.4 without any problems. I'm trying to reproduce my=
=20
setup in a virtual machine. If the corruption happens again, I'll send=20
you more data on this problem.

--=20
Juan Orti
https://miceliux.com

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan
2014-10-15 08:53:07 UTC
Permalink
Post by Juan Orti Alcaine
I've also experienced Btrfs corruptions with 3.17.0
I do readonly snapshots every hour of all the subvolumes, so I have
hundreds of snapshots.
That's a known issue with read-only snapshots in 3.17.0. There's quite a
thread on the list about it.

So I'd suggest either turning off read-only snapshots on 3.17 (which I'm
running here without snapshots, no problem), possibly switching to
writable snapshots as they don't seem to trigger the problem, or as you
mentioned doing already, going back to 3.16.x (x>2 due to another bug,
latest should be good), until the read-only snapshots issue with 3.17.0
is traced down and fixed.

Given the approximately two kernel cycles it took for the widely
reproduced but rather difficult to trace compression-related bug in 3.15
to be reported in 3.15 and traced and fixed in 3.17-rc2 and 3.16.2, I'd
guess a fix for this similarly widely reproduced read-only-snapshot-
related bug should be no later than 3.19-rc3 and 3.18.3, possibly rather
earlier if it proves easier to trace, especially since this one seems to
have been reported and recognized as widely occurring a bit faster than
the compression-related bug. But with testing, etc, it's still likely to
be late in the 3.18-rc cycle before mainline commit, so it'll probably be
rather late in the 3.17.x stable cycle, if it makes it at all. Unless it
gets picked as a long-term support kernel, the full 3.17 stable cycle
might in fact be blacklisted for btrfs due to this bug, much like the
full 3.15 stable cycle ended up being blacklisted due to the compression-
related bug.

So either switch your snapshots to writable if it's not going to
interfere with your use-case, or stay on the 3.16.x, x>2, stable series
until the problem is fixed, hopefully with 3.18.0, tho it might be 3.18.2
or so. I seriously doubt it'll be longer than that, because it's a well
reproduced bug which makes it both high priority and easy to test fixes
for.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2014-10-15 13:46:52 UTC
Permalink
Post by Robert White
Howdy,
So I run several gentoo systems and I upgraded two of them to kernel 3.17.0
One using BTRFS for root.
One using ext3 for root (via the ext4 driver)
_Both_ systems exhibited strange behavior (long pauses and then hang=
s
Post by Robert White
requiring hard-power) within several hours. Both then had random
filesystem damage.
On the BTRFS system much of my browser settings for firefox were
trashed, particularly the cookies and saved conifigurations for
add-ons (like which sites had scripts enabled/disabled in no-script)
etc.
On the ext3/4 system there were several corruptions including a
pipe/special file with a large non-zero size that required I do a
"fsck -fyD /dev/sda3" to repair. (one comment from fsck was that the
pipe/special file "looked like a directory" or some such)
So I can say that corruption is taking place, but I suspect it is
_not_ happening in the BTRFS specific code.
(ASIDE: both systems are older amd64 using built-in radeon display
hardware.)
I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha)=
=2E
It has happened two times, each one after a clean reinstall and a wip=
e
of the old fs. In less than a day, both installations got corrupted a=
nd
the filesystems went readonly. When listing the contents, I saw many
directories with question marks.
- 1 SSD in single
- 3 HDD in RAID1
Did it happen on both fs'es or just one? Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Juan Orti Alcaine
2014-10-15 14:05:19 UTC
Permalink
Post by Josef Bacik
I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alpha=
).
Post by Josef Bacik
It has happened two times, each one after a clean reinstall and a wi=
pe
Post by Josef Bacik
of the old fs. In less than a day, both installations got corrupted=20
and
the filesystems went readonly. When listing the contents, I saw many
directories with question marks.
=20
- 1 SSD in single
- 3 HDD in RAID1
=20
Did it happen on both fs'es or just one? Thanks,
=20
Josef
Both filesystems were corrupted. I have / in the SSD and /home in the=20
HDDs.

I didn't notice anything while working with the system, I only=20
discovered the problem when booting up after the second or third reboot=
=20
and seeing the service failing to start. Could it be something related=20
to the mount/umount logic?

--=20
Juan Orti
https://miceliux.com

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2014-10-15 14:30:40 UTC
Permalink
Post by Juan Orti Alcaine
Post by Josef Bacik
I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21 alph=
a).
Post by Juan Orti Alcaine
Post by Josef Bacik
It has happened two times, each one after a clean reinstall and a w=
ipe
Post by Juan Orti Alcaine
Post by Josef Bacik
of the old fs. In less than a day, both installations got corrupted=
and
Post by Juan Orti Alcaine
Post by Josef Bacik
the filesystems went readonly. When listing the contents, I saw man=
y
Post by Juan Orti Alcaine
Post by Josef Bacik
directories with question marks.
- 1 SSD in single
- 3 HDD in RAID1
Did it happen on both fs'es or just one? Thanks,
Josef
Both filesystems were corrupted. I have / in the SSD and /home in the=
HDDs.
Post by Juan Orti Alcaine
I didn't notice anything while working with the system, I only
discovered the problem when booting up after the second or third rebo=
ot
Post by Juan Orti Alcaine
and seeing the service failing to start. Could it be something relate=
d
Post by Juan Orti Alcaine
to the mount/umount logic?
We've found it, the Fedora guys are reverting the bad patch now, we'll=20
get the fix sent back to stable shortly. Sorry about that.

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Juan Orti Alcaine
2014-10-15 14:34:47 UTC
Permalink
Post by Josef Bacik
I've also experienced Btrfs corruptions with 3.17.0 (Fedora 21=20
alpha).
It has happened two times, each one after a clean reinstall and a=20
wipe
of the old fs. In less than a day, both installations got corrupte=
d=20
Post by Josef Bacik
and
the filesystems went readonly. When listing the contents, I saw ma=
ny
Post by Josef Bacik
directories with question marks.
=20
- 1 SSD in single
- 3 HDD in RAID1
=20
Did it happen on both fs'es or just one? Thanks,
=20
Josef
=20
Both filesystems were corrupted. I have / in the SSD and /home in th=
e=20
HDDs.
=20
I didn't notice anything while working with the system, I only
discovered the problem when booting up after the second or third=20
reboot
and seeing the service failing to start. Could it be something relat=
ed
to the mount/umount logic?
=20
=20
We've found it, the Fedora guys are reverting the bad patch now, we'l=
l
get the fix sent back to stable shortly. Sorry about that.
Thanks to you. Fortunately I have good backups.

--=20
Juan Orti
https://miceliux.com

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rich Freeman
2014-10-15 19:30:11 UTC
Permalink
We've found it, the Fedora guys are reverting the bad patch now, we'll get
the fix sent back to stable shortly. Sorry about that.
After reverting this commit, can the bad snapshots be
deleted/repaired/etc without wiping and restoring the entire
filesystem? Copying 2.3TB of data isn't a particularly fast
operation...

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2014-10-15 20:20:15 UTC
Permalink
Post by Rich Freeman
We've found it, the Fedora guys are reverting the bad patch now, we'll get
the fix sent back to stable shortly. Sorry about that.
After reverting this commit, can the bad snapshots be
deleted/repaired/etc without wiping and restoring the entire
filesystem? Copying 2.3TB of data isn't a particularly fast
operation...
I would certainly like to make fsck repair this sort of problem, let me
reproduce the corruption locally and then make fsck fix it and then you
can use that. Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Filipe David Manana
2014-10-17 16:26:41 UTC
Permalink
Post by Josef Bacik
Post by Rich Freeman
We've found it, the Fedora guys are reverting the bad patch now, we'll get
the fix sent back to stable shortly. Sorry about that.
After reverting this commit, can the bad snapshots be
deleted/repaired/etc without wiping and restoring the entire
filesystem? Copying 2.3TB of data isn't a particularly fast
operation...
I would certainly like to make fsck repair this sort of problem, let me
reproduce the corruption locally and then make fsck fix it and then you can
use that. Thanks,
I just sent out a patch for fsck to fix this issue - i.e. bad
read-only snapshots (inaccessible without errors, impossible to
delete, etc).
It fixes the snapshots if, and only if, you haven't run fsck in repair
mode (--repair) before, as that would touch back references and other
metadata as it didn't expect for root items to incorrect (which is
essentially what the snapshots bug made).

The patch is this one: https://patchwork.kernel.org/patch/5098331/

Also, if you have errors accessing files through a path that doesn't
contain any of the read-only snapshots, it's possible that it's the
corruption bug we had in 3.17 - bad extent map manipulation, that
manifests itself in several ways (e.g. reports:
http://www.spinics.net/lists/linux-btrfs/msg38045.html and
http://www.spinics.net/lists/linux-btrfs/msg37567.html).

Anyway, if you run into further issues, please report them.

thanks
Post by Josef Bacik
Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Filipe David Manana,

"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...