Drawing lessons from fatal SELinux bug #1054350

Discussion:

Kevin Kofler

2014-01-23 23:55:23 UTC

Hi,

it is time to analyze the fallout from the following catastrophic Fedora 20
regression:
https://bugzilla.redhat.com/show_bug.cgi?id=1054350
"rpm scriptlets are exiting with status 127"

The impact:
* EVERYONE with Fedora 20 installed with SELinux enabled and in enforcing
mode, and who updated to the current stable updates, was hit by this bug.
* The bug completely breaks upgrading any package through both GUI and CLI
tools. Even the fix itself cannot be installed correctly.
* The only possible workaround requires use of the command line. It is
IMPOSSIBLE to fix this using GUI tools installed by default. The
system-config-selinux tool which can be used to fix this in a pure GUI
method is NOT installed by default in Fedora 20 for some stupid reason
(because somebody decided to make it as painful as possible to disable that
SELinux junk? Now I have to install system-config-selinux first thing post-
install just so I can disable the dreaded thing), and of course cannot be
installed after the fact because of the bug. Normal users do not use
terminals, so they can only reinstall Fedora or (more likely) a competing
distribution (or even operating system)!
* The only possible workaround also requires root access to the machine.
PolicyKit policy allows all users to install official updates by default,
but those users then cannot fix the breakage without bothering an
administrator.
* As per the above, there are several installations that can be considered
BRICKED.
* We are losing users to Ubuntu because of this issue. People are explicitly
saying they are switching to Ubuntu because of this bug (e.g.
https://bugzilla.redhat.com/show_bug.cgi?id=1054312#c5 , later confirmed:
https://bugzilla.redhat.com/show_bug.cgi?id=1054312#c10 ), and I am sure
there are many more who are silently doing it without telling us.
* The bug now has 38 (!) duplicates in Bugzilla, plus many complaints on
IRC, mailing lists, comments to other unrelated bugs (the fix for which
cannot be installed due to the SELinux bug) etc.

So it is time to draw some lessons from this issue to prevent such a bug
from ever occurring again!

So, what happened:
* We are enabling SELinux enabled (enforcing) by default, a tool designed to
prevent anything it does not like from happening. (Reread this carefully:
The ONLY thing that tool is designed to do at all is PREVENT things. It does
not have a SINGLE feature other than being a roadblock and an annoyance.)
* SELinux works by shipping a "policy" that effectively tries to specify in
one single place (read: single point of failure!) everything any program in
Fedora (scalability disaster!) ever wants to do (second-guessing its actual
code, i.e., duplication of all logic!). (Note the 3 (!) major antipatterns
in a single-sentence (!) description of how SELinux works!)
* An update to that SELinux policy was shipped that BREAKS the most critical
tools in Fedora, the ones required to update the system and thus install the
fixes for any regressions, including the very regression that caused the
breakage. And also any automated workarounds are blocked by design.
* That update made it out to the stable updates! In other words, the
draconian Update Policies that were enacted in a vain attempt to prevent
such issues from happening utterly failed at catching this bug.

Meanwhile, SELinux is also causing similarly fatal issues in Rawhide:
https://bugzilla.redhat.com/show_bug.cgi?id=1052317
"selinux-policy preventing login through sddm and ssh"
which are still NOT fixed! At least in that case, RPM is apparently not
affected, but if you cannot log in to your system (SDDM is the default
display manager for KDE in Rawhide), it is totally unusable ("bricked")!

So, what needs to happen:
* SELinux must be disabled (or preferably, not installed in the first place,
to avoid wasting space for nothing) by default! Just consider the benefits
(none!) vs. the risks (what you are seeing now: bricked systems in both F20
and Rawhide, the users switching to other distributions). If we want to have
any users left, SELinux needs to go away NOW!
* The Update Policies must be repealed. This regression has shown us that
not only they totally failed at preventing it, but they are actively
contributing to exposing MORE users to broken updates by delaying regression
fixes. (This kind of regression fixes needs to go out DIRECTLY to stable!)

Last time an issue like that happened (the D-Bus regression that broke
updates), a big drama was made that ultimately lead to the (flawed) Update
Policies. And even a "catastrophe" that hit only a very small portion of our
users (those running the server part of bind) was used as a(n additional)
justification for the Update Policies, whereas this one now hits ALL users
who merely had the mishap of sticking to our flawed defaults (SELinux
enforcing). Why would we stick our heads in the sand this time?

DISABLE/DROP SELINUX NOW!

Thank you for your consideration,
Kevin Kofler

PS: I still recommend to ALL Fedora users to disable SELinux immediately
after installing Fedora. That is the most effective way to avoid ever being
hit by catastrophical breakage such as bug #1054350 or bug #1052317. But we
should not ship with a broken default in the first place!

Adam Williamson

2014-01-24 00:02:40 UTC