[systemd-devel] systemd kills mdmon if it was started manually by user

Post by Andrey Borzenkov
If user starts array manually (mdadm -A -s as example) from within
user session and array needs mdmon, mdmon becomes part of user session
├ user
│ └ root
│ └ 1
│ ├ 1916 login -- root
│ ├ 1930 -bash
│ ├ 1964 gpg-agent --keep-display --daemon --write-env-file
/root/.gnup... │ └ 2062 mdmon md127
It is then killed by systemd during shutdown as part of user session.
It results in dirty array on next boot.
Is there any magic that allows daemon to be exempted from killing?

While your raid should absolutely not be corrupted on next reboot
when mdmon receives a SIGTERM, I suspect you're using pam_systemd.so
(/etc/pam.d/system-auth), which automatically creates cgroups by login
session, which in turn gets killed when the user has "completely logged out".
That is why your mdadm gets terminated, too.
You can avoid that by adding create-session=0 to it, like:

# grep pam_systemd /etc/pam.d/systemd-auth
session optional pam_systemd.so create-session=0

Which is the recommented way if you want processes (created by the "user") to
live on even when this user has fully logged out.

Regards,
Christian Parpart.

p.s.: see pam_systemd(8)

Andrey Borzenkov

2010-12-04 12:08:05 UTC

Post by Christian Parpart

While your raid should absolutely not be corrupted on next reboot
when mdmon receives a SIGTERM,

This won't be corrupted but it will initiate rebuilt. I have reports
that such rebuild may take hours, costing performance and loss of
redundancy.

Post by Christian Parpart
I suspect you're using pam_systemd.so

Yes

Post by Christian Parpart
(/etc/pam.d/system-auth), which automatically creates cgroups by login
session, which in turn gets killed when the user has "completely logged out".
That is why your mdadm gets terminated, too.

Sure.

Post by Christian Parpart
# grep pam_systemd /etc/pam.d/systemd-auth
session optional pam_systemd.so create-session=0

But I do want user session to be created; and systemd was specifically
extended to properly terminate user sessions on shutdown. It is just
that mdmon does not belong to user session at all.

Post by Christian Parpart
Which is the recommented way if you want processes (created by the "user") to
live on even when this user has fully logged out.

mdmon does not belong to user. User is not even aware that it is
started. And it is likely not the last case. So systemd does need some
framework which can move such processes out of user session. It
probably needs some sd_daemon API to notify systemd that it is system
level task even if it was started as result of user interaction.

Tomasz Torcz

2010-12-04 12:14:13 UTC

This post might be inappropriate. Click to display it.

Andrey Borzenkov

2011-01-22 17:55:00 UTC

Sure.

Post by Christian Parpart
# grep pam_systemd /etc/pam.d/systemd-auth
session optional pam_systemd.so create-session=0

But I do want user session to be created; and systemd was specifically
extended to properly terminate user sessions on shutdown. It is just
that mdmon does not belong to user session at all.

Man page talks about kill-session= and kill-user= parameters, which
may be useful to you.

Post by Christian Parpart
Which is the recommented way if you want processes (created by the "user") to
live on even when this user has fully logged out.

Well, it is started by user, so it belongs to user. And
systemd has an API to start system-level task as a result of
user interaction: it is called "systemctl start mdmon.service".

mdmon is not a singleton - it is started for every array that needs it
(not each array needs it). Can you pass extra parameters that identify
object mdmon should monitor via systemctl?

Using udev to listen to new array event and starting mdmon from there
looks promising. I do not know whether it is possible to identify such
arrays at this point though nor do I have hardware to test.

Lennart Poettering

2011-01-25 03:44:35 UTC

Post by Andrey Borzenkov
mdmon does not belong to user. User is not even aware that it is
started. And it is likely not the last case. So systemd does need some
framework which can move such processes out of user session. It
probably needs some sd_daemon API to notify systemd that it is system
level task even if it was started as result of user interaction.

Well, it is started by user, so it belongs to user. And
systemd has an API to start system-level task as a result of
user interaction: it is called "systemctl start mdmon.service".

mdmon is not a singleton - it is started for every array that needs it
(not each array needs it). Can you pass extra parameters that identify
object mdmon should monitor via systemctl?

systemd supports instantiated services, for example to deal with the
gettys (e.g. "***@tty5.service"). It should be trivial to use the same
for mdmon (e.g. "***@md3.service").

Lennart

--
Lennart Poettering - Red Hat, Inc.

Andrey Borzenkov

2011-01-25 03:58:13 UTC

On Tue, Jan 25, 2011 at 6:44 AM, Lennart Poettering

Well, it is started by user, so it belongs to user. And
systemd has an API to start system-level task as a result of
user interaction: it is called "systemctl start mdmon.service".

mdmon is not a singleton - it is started for every array that needs it
(not each array needs it). Can you pass extra parameters that identify
object mdmon should monitor via systemctl?

systemd supports instantiated services, for example to deal with the

That's right, but the names are not known in advance and can change
between reboots. This means such units have to be generated
dynamically, exist until reboot (ramfs?) and be removed when array is
destroyed. Not sure it is really manageable.

And which instance should generate them? mdadm?

Lennart Poettering

2011-01-25 04:28:15 UTC

Post by Lennart Poettering
systemd supports instantiated services, for example to deal with the

Hmm? It should be sufficient to just write the service template properly
("***@.service") and then instantiate it when needed with "systemctl
start ***@xyz.service" or something equivalent. itMs a matter of
issuing a single dbus call.

Post by Andrey Borzenkov
And which instance should generate them? mdadm?

i think it is much nicer to spawn the necessary mdadm service instance
from a udev rule, but you could even run it from mdadm by invoking one
dbus call from it.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Andrey Borzenkov

2011-02-04 19:55:06 UTC

On Tue, Jan 25, 2011 at 7:28 AM, Lennart Poettering

Post by Lennart Poettering
systemd supports instantiated services, for example to deal with the

Hmm? It should be sufficient to just write the service template properly
issuing a single dbus call.

Post by Andrey Borzenkov
And which instance should generate them? mdadm?

i think it is much nicer to spawn the necessary mdadm service instance
from a udev rule,

Yes, this can be done relatively easily; as proof of concept:

SUBSYSTEM!="block", GOTO="systemd_md_end"
ACTION!="change", GOTO="systemd_md_end"
KERNEL!="md*", GOTO="systemd_md_end"
ATTR{md/metadata_version}=="external:[A-Za-z]*", RUN+="/bin/systemctl
start mdmon@%k.service"
LABEL="systemd_md_end"

where ***@.service is

[Unit]
Description=mdmon service
BindTo=dev-%i.device
After=dev-%i.device

[Service]
Type=forking
PIDFile=/dev/.mdadm/%i.pid
ExecStart=/sbin/mdmon --takeover %i

With the result

[***@localhost ~]# systemctl status ***@md127.service
***@md127.service - mdmon service
Loaded: loaded (/etc/systemd/system/***@.service)
Active: active (running) since Tue, 08 Feb 2011 09:43:30
-0500; 5min ago
Process: 1467 ExecStart=/sbin/mdmon --takeover %i
(code=exited, status=0/SUCCESS)
Main PID: 1468 (mdmon)
CGroup: name=systemd:/system/***@.service/md127
└ 1468 /sbin/mdmon --takeover md127

Setting SYSTEMD_WANTS would be more elegant solution, but it does not
work with current systemd implementation. It is capable of starting
requested units only on "add" event (effectively the very first time
device becomes plugged), while mdmon must be started on "change"
event, as only then we know whether mdmon is required at all.

Running mdmon via systemd in this way opens up interesting
possibility. E.g. service could be declared "immortal" and be exempt
from usual shutdown sequence ... or is it possible to do already?

Actually it can be implemented even without mdadm patches; apparently
it is possible to suppress normal starting of mdmon by setting
MDADM_NO_MDMON=1

Post by Lennart Poettering
but you could even run it from mdadm by invoking one
dbus call from it.

It may turn out to be necessary still. If container needs mdmon,
arrays it contains won't become read-write until mdmon is started. If
mdmon is started asynchronously by udev, there is window where someone
may try to use array before it is rw. As trivial example, mount unit
which depends on md device unit.

I do not think mdadm maintainer will be happy to add D-Bus dependency
to something that is likely to be included in initrd though :) But may
be we could simply try execl("/bin/systemctl", "start", ...) before
current execl("/sbin/mdmon",... )?

Lennart Poettering

2011-02-08 09:48:43 UTC

Post by Andrey Borzenkov
That's right, but the names are not known in advance and can change
between reboots. This means such units have to be generated
dynamically, exist until reboot (ramfs?) and be removed when array is
destroyed. Not sure it is really manageable.

Hmm? It should be sufficient to just write the service template properly
issuing a single dbus call.

Post by Andrey Borzenkov
And which instance should generate them? mdadm?

i think it is much nicer to spawn the necessary mdadm service instance
from a udev rule,

SUBSYSTEM!="block", GOTO="systemd_md_end"
ACTION!="change", GOTO="systemd_md_end"
KERNEL!="md*", GOTO="systemd_md_end"
ATTR{md/metadata_version}=="external:[A-Za-z]*", RUN+="/bin/systemctl
LABEL="systemd_md_end"

Nah, it's much better to simply use the SYSTEMD_WANTS var on the device.

Something like this:

...., ENV{SYSTEMD_WANTS}="mdmon@%k.service"

That way the device unit will simply have a wants dep on the service
unit, and this is prefectly discoverable.

Post by Andrey Borzenkov
Setting SYSTEMD_WANTS would be more elegant solution, but it does not
work with current systemd implementation. It is capable of starting
requested units only on "add" event (effectively the very first time
device becomes plugged), while mdmon must be started on "change"
event, as only then we know whether mdmon is required at all.

Oha, so you are actually aware of SYSTEMD_WANTS. Hmm. I need to think
about this. Why does md employ the change event? Is this really
necessary, smells a bit foul.

Post by Andrey Borzenkov
Running mdmon via systemd in this way opens up interesting
possibility. E.g. service could be declared "immortal" and be exempt
from usual shutdown sequence ... or is it possible to do already?

A service needs to conflict with shutdown.target to be shut down when we
go down normally. If your service does not conflict with shutdown.target
then it will stay around and be killed only after systemd is gone and
PID1 is systemd-shutdown which then kills all processes remaining
(independent of any idea of "service") and the unmounts all file
systems. Normally all services conflict with shutdown.target implicitly,
which you can turn off by setting DefaultDependencies=.

Post by Andrey Borzenkov
Actually it can be implemented even without mdadm patches; apparently
it is possible to suppress normal starting of mdmon by setting
MDADM_NO_MDMON=1

A this point mdmon is simply broken: if glibc or mdmon itself (or any
lib it is using) is upgraded, then mdmon will keep referencing the old
.so or binary as long as it is running. This means that the fs these
files are on cannot be remounted r/o. However mdmon insists on being
shutdown only after all fs got remounted ro. So you have a cyclic
ordering loop here: mdmon wants to be shut down after the remount, but
we need to shut it down before the remount.

This is unfixable unless a) mdmon learns reexecution of itself without
losing state (like most init systems so), or b) mdmon would stop
insisting on being shutdown only after the remount.

In my eyes b) is very much preferebale: It should be possible to shut
down mdmon like any other service. And if then some md related code
still needs to be run on late shutdown this should be done from a new
process. I would be willing to add some hooks for this, so that we can
execute arbitrary drop-in processes as part of the final shutdown loop.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Andrey Borzenkov

2011-02-08 10:52:41 UTC

On Tue, Feb 8, 2011 at 12:48 PM, Lennart Poettering

Hmm? It should be sufficient to just write the service template properly
issuing a single dbus call.

Post by Andrey Borzenkov
And which instance should generate them? mdadm?

i think it is much nicer to spawn the necessary mdadm service instance
from a udev rule,

Nah, it's much better to simply use the SYSTEMD_WANTS var on the device.
That way the device unit will simply have a wants dep on the service
unit, and this is prefectly discoverable.

Oha, so you are actually aware of SYSTEMD_WANTS. Hmm. I need to think
about this. Why does md employ the change event? Is this really
necessary, smells a bit foul.

I am probably the wrong one to ask, but here is what happens when
array is started (from udev perspective)

UDEV [1297507039.109828] add /devices/virtual/block/md127 (block)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/virtual/block/md127
SUBSYSTEM=block
DEVNAME=/dev/md127
DEVTYPE=disk
SEQNUM=1742
UDISKS_PRESENTATION_NOPOLICY=1
MAJOR=9
MINOR=127
TAGS=:systemd:

After this event device goes "plugged" and SYSTEMD_WANTS (if any) are
triggered. But at this point we have zero information about array to
decide anything.

UDEV [1297507039.211940] change /devices/virtual/block/md127 (block)
UDEV_LOG=3
ACTION=change
DEVPATH=/devices/virtual/block/md127
SUBSYSTEM=block
DEVNAME=/dev/md127
DEVTYPE=disk
SEQNUM=1743
MD_LEVEL=container
MD_DEVICES=2
MD_METADATA=ddf
MD_UUID=f8362f39:0436b20f:cf338104:afec436e
MD_DEVNAME=ddf0
UDISKS_PRESENTATION_NOPOLICY=1
MAJOR=9
MINOR=127
DEVLINKS=/dev/disk/by-id/md-uuid-f8362f39:0436b20f:cf338104:afec436e
/dev/md/ddf0
TAGS=:systemd:

At this point we know it is container, know that it has external
metadata and know that we need external metadata handler (mdmon). But
it is too late for systemd.

Post by Andrey Borzenkov
Actually it can be implemented even without mdadm patches; apparently
it is possible to suppress normal starting of mdmon by setting
MDADM_NO_MDMON=1

Ehh ...

a) mdmon is perfectly capable of restarting, it is already used to
take over mdmon launched in initrd. The problem is to know when to
restart - i.e. when respective libraries are changed. This is a job
for package management in distribution. It is already employed for
glibc, systemd and some others and can just as well be employed for
mdmon. And this is totally unrelated to systemd :)

b) having binary launched off some fs should not prevent this fs to be
remountd ro - binaries are not opened rw

Post by Lennart Poettering
This is unfixable unless a) mdmon learns reexecution of itself without
losing state (like most init systems so), or b) mdmon would stop
insisting on being shutdown only after the remount.

As far as I can tell, both is true today; but remounting is not
enough, unfortunately.

Post by Lennart Poettering
In my eyes b) is very much preferebale: It should be possible to shut
down mdmon like any other service. And if then some md related code
still needs to be run on late shutdown this should be done from a new
process. I would be willing to add some hooks for this, so that we can
execute arbitrary drop-in processes as part of the final shutdown loop.

mdmon is needed to ensure metadata were correctly updated. So it needs
to exist as long as metadata *may* be updated. For practical purposes
it means - until file system is unmounted and flushed to disks. I am
not sure that remounting ro stops all activity (at least, mounting ro
definitely *writes* to device using some filesystems).

Lennart Poettering

2011-02-08 11:07:31 UTC

Post by Andrey Borzenkov
I am probably the wrong one to ask, but here is what happens when
array is started (from udev perspective)

[...]

Post by Andrey Borzenkov
After this event device goes "plugged" and SYSTEMD_WANTS (if any) are
triggered. But at this point we have zero information about array to
decide anything.

[...]

Post by Andrey Borzenkov
At this point we know it is container, know that it has external
metadata and know that we need external metadata handler (mdmon). But
it is too late for systemd.

Kay, do you know why this "change" event is used here? Any chance we can
get rid of it?

Post by Andrey Borzenkov
Actually it can be implemented even without mdadm patches; apparently
it is possible to suppress normal starting of mdmon by setting
MDADM_NO_MDMON=1

Ehh ...
a) mdmon is perfectly capable of restarting, it is already used to
take over mdmon launched in initrd. The problem is to know when to
restart - i.e. when respective libraries are changed. This is a job
for package management in distribution. It is already employed for
glibc, systemd and some others and can just as well be employed for
mdmon. And this is totally unrelated to systemd :)

Really, you are sying there is a synchronous way to make mdmon reexec
itself? How does that work?

Post by Andrey Borzenkov
b) having binary launched off some fs should not prevent this fs to be
remountd ro - binaries are not opened rw

If you run a binary and then the package manager replaces it then the
running instance will still refer to the old copy and this will have the
effect that the file isn't actually deleted until the proces
exits/execs. And because that is the way it is the kernel will refuse
unmounting of the fs until you terminated/reexeced your process.

As far as I can tell, both is true today; but remounting is not
enough, unfortunately.

So, you are saying we can shut down mdmon without ill effects early?

Well, the root file systems cannot be unmounted, only remounted.

So, is there a way to invoke mdmon so that it flushes all metadata
changes to disk and immediately terminates then this should be all we
need for a clean solution. We'd then shutdown the normal instances of
mdmon down like any other daemon and simply invoke this metadata
flushing command as part of late shutdown.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Andrey Borzenkov

2011-02-08 13:54:03 UTC

On Tue, Feb 8, 2011 at 2:07 PM, Lennart Poettering

Post by Andrey Borzenkov
I am probably the wrong one to ask, but here is what happens when
array is started (from udev perspective)

[...]

Post by Andrey Borzenkov
After this event device goes "plugged" and SYSTEMD_WANTS (if any) are
triggered. But at this point we have zero information about array to
decide anything.

[...]

Post by Andrey Borzenkov
At this point we know it is container, know that it has external
metadata and know that we need external metadata handler (mdmon). But
it is too late for systemd.

Kay, do you know why this "change" event is used here? Any chance we can
get rid of it?

Post by Andrey Borzenkov
Actually it can be implemented even without mdadm patches; apparently
it is possible to suppress normal starting of mdmon by setting
MDADM_NO_MDMON=1

Ehh ...
a) mdmon is perfectly capable of restarting, it is already used to
take over mdmon launched in initrd. The problem is to know when to
restart - i.e. when respective libraries are changed. This is a job
for package management in distribution. It is already employed for
glibc, systemd and some others and can just as well be employed for
mdmon. And this is totally unrelated to systemd :)

Really, you are sying there is a synchronous way to make mdmon reexec
itself? How does that work?

I am not sure whether it qualifies as synchronous, but "mdmon
--takeover" will kill any existing mdmon for this and start monitoring
itself.

Post by Andrey Borzenkov
b) having binary launched off some fs should not prevent this fs to be
remountd ro - binaries are not opened rw

As far as I can tell, both is true today; but remounting is not
enough, unfortunately.

So, you are saying we can shut down mdmon without ill effects early?

At least that's what I see. You can shutdown mdmon and continue to
work with file system, even if it is mounted rw. Under some conditions
mount will hang; i.e.

start array
kill mdmon
try to mount

mount will hang. If you start mdmon, it is mounted. But if you now

umount
kill mdmon
mount

it is mounted just fine.

Well, the root file systems cannot be unmounted, only remounted.
So, is there a way to invoke mdmon so that it flushes all metadata
changes to disk and immediately terminates then this should be all we
need for a clean solution. We'd then shutdown the normal instances of
mdmon down like any other daemon and simply invoke this metadata
flushing command as part of late shutdown.

Hmm ... it looks like you just need to

start mdmon
do mdadm --wait-clean

After this you can kill mdmon again (assuming decide is no more in use).

Lennart Poettering

2011-02-08 17:28:22 UTC

Post by Andrey Borzenkov
a) mdmon is perfectly capable of restarting, it is already used to
take over mdmon launched in initrd. The problem is to know when to
restart - i.e. when respective libraries are changed. This is a job
for package management in distribution. It is already employed for
glibc, systemd and some others and can just as well be employed for
mdmon. And this is totally unrelated to systemd :)

Really, you are sying there is a synchronous way to make mdmon reexec
itself? How does that work?

I am not sure whether it qualifies as synchronous, but "mdmon
--takeover" will kill any existing mdmon for this and start monitoring
itself.

I wonder if this is really fully synchronous, i.e. that a) there is no
point in time where mdmon is not running during this restart and b) the
mdmom --takeover command returns when the new daemon is fully up, and
not right-away.

Post by Lennart Poettering
Well, the root file systems cannot be unmounted, only remounted.
So, is there a way to invoke mdmon so that it flushes all metadata
changes to disk and immediately terminates then this should be all we
need for a clean solution. We'd then shutdown the normal instances of
mdmon down like any other daemon and simply invoke this metadata
flushing command as part of late shutdown.

Hmm ... it looks like you just need to
start mdmon
do mdadm --wait-clean
After this you can kill mdmon again (assuming decide is no more in use).

Well, it would be nice if the md utils would offer something doing this
without spawning multiple processes and killing them again.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Dan Williams

2011-10-23 08:00:36 UTC

On Tue, Feb 8, 2011 at 9:28 AM, Lennart Poettering

Really, you are sying there is a synchronous way to make mdmon reexec
itself? How does that work?

I am not sure whether it qualifies as synchronous, but "mdmon
--takeover" will kill any existing mdmon for this and start monitoring
itself.

Hmm ... it looks like you just need to
start mdmon
do mdadm --wait-clean
After this you can kill mdmon again (assuming decide is no more in use).

Well, it would be nice if the md utils would offer something doing this
without spawning multiple processes and killing them again.

/me wonders why his raid5 resyncs every boot on Fedora 15 and has
found this old thread.

I'm tempted to:

1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on Fedora)
2/ arrange for mdadm --wait-clean --scan to be called after all
filesytems have been mounted read only

...but a few things strike me. This does not seem to be what was
being proposed above. Systemd does not treat dm devices like a
service and takes care to shut them down explicitly (but in that case
there is an api that it can call). Is it time for a libmd.so, so
systemd can invoke the "--wait-clean --scan" process itself? Probably
simpler to just SIGTERM mdmon and wait for it.

--
Dan

Thomas Jarosch

2011-10-24 08:04:59 UTC

Is it time for a libmd.so, so systemd can invoke the "--wait-clean --scan"
process itself? Probably simpler to just SIGTERM mdmon and wait for it.

The mdadm code makes good use of non-reentrant functions like ctime(),
readdir() and others. Luckily systemd is single threaded.

If we provide a "public" interface, that would need fixing though.

Cheers,
Thomas

NeilBrown

2011-10-25 01:40:50 UTC

Post by Dan Williams
On Tue, Feb 8, 2011 at 9:28 AM, Lennart Poettering

Really, you are sying there is a synchronous way to make mdmon reexec
itself? How does that work?

I am not sure whether it qualifies as synchronous, but "mdmon
--takeover" will kill any existing mdmon for this and start monitoring
itself.

Hmm ... it looks like you just need to
start mdmon
do mdadm --wait-clean
After this you can kill mdmon again (assuming decide is no more in use).

Well, it would be nice if the md utils would offer something doing this
without spawning multiple processes and killing them again.

/me wonders why his raid5 resyncs every boot on Fedora 15 and has
found this old thread.
1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on Fedora)
2/ arrange for mdadm --wait-clean --scan to be called after all
filesytems have been mounted read only
...but a few things strike me. This does not seem to be what was
being proposed above. Systemd does not treat dm devices like a
service and takes care to shut them down explicitly (but in that case
there is an api that it can call). Is it time for a libmd.so, so
systemd can invoke the "--wait-clean --scan" process itself? Probably
simpler to just SIGTERM mdmon and wait for it.
--
Dan

Hi Dan,
could you please explain in a bit more detail exactly what you think it is
that is going wrong for you?

I don't think it is anything like the original problem, as I don't think
you are starting array manually.

I think your problem is that 'mdmon' is being killed too early at shutdown.
Clear we need to get whatever-kills-user-processes to skip mdmon - maybe by
writing the pid to some magic file that 'ignore_proc' already knows about?

Ultimately we probably want to get udev to start mdmon for us and have
mdadm notice and not start it itself.
We also need to get udev to notice arrays that are being reshaped and to
start the mdadm which montiors the reshape so that mdadm doesn't have to
fork it itself.

That should fix the original problem, but I don't think it addresses your
problem at all.

I don't have a Fedora install so I cannot hunt around to see what is
happening.

I don't like the idea for a 'libmd.so' at all - certainly not until the
problem is properly understood and other solutions (like running
scripts) prove ineffective.

NeilBrown

Lennart Poettering

2011-10-31 11:06:13 UTC

Post by Dan Williams

Post by Lennart Poettering
Well, it would be nice if the md utils would offer something doing this
without spawning multiple processes and killing them again.

/me wonders why his raid5 resyncs every boot on Fedora 15 and has
found this old thread.
1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on Fedora)

This will not help you.

We nowadays jump back into the initrd when we shut down, so that the
initrd disassembles everything it assembled at boot time. This for the
first time enables us to ensure that all layers of our stack are in a
sane state (i.e. fully offline) when we shut down, regardless in which
way you stack it.

However, just excluding mdmom from being killed will not make this work,
simply because jumping into initrd only works sensibly if we can drop
all references to all previous mounts which requires us to have only one
process running at that time, and one process only.

It always boils down to the same thing: mdmon must be something we can
shutdown cleanly like every other process. Excluding it from that will
just move the problem around, but not fix it.

Post by Dan Williams
2/ arrange for mdadm --wait-clean --scan to be called after all
filesytems have been mounted read only

Won't help you really either, since we have to kill all processes before
we jump into the initrd to fully disassemble mounts and storage.

There'll always be this chicken and egg problem: we cannot disassmble
all storage until all processes are gone and we are back in the
initrd. But mdmon wants to stay running after we

Post by Dan Williams
...but a few things strike me. This does not seem to be what was
being proposed above. Systemd does not treat dm devices like a
service and takes care to shut them down explicitly (but in that case
there is an api that it can call). Is it time for a libmd.so, so
systemd can invoke the "--wait-clean --scan" process itself? Probably
simpler to just SIGTERM mdmon and wait for it.

We actually try to disassemble md already, i.e. we call the
DM_DEV_REMOVE ioctl for all left-over devices. I am not really
interested to link against libdm itself.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Lennart Poettering

2011-10-31 11:15:20 UTC

Post by Lennart Poettering
We actually try to disassemble md already, i.e. we call the
DM_DEV_REMOVE ioctl for all left-over devices. I am not really
interested to link against libdm itself.

Sorry, wasn't fully woken up yet and mixed up dm and md here. Ignore
this sentence...

Lennart

--
Lennart Poettering - Red Hat, Inc.

NeilBrown

2011-11-02 00:44:16 UTC

On Mon, 31 Oct 2011 12:06:13 +0100 Lennart Poettering

Post by Dan Williams

Post by Lennart Poettering
Well, it would be nice if the md utils would offer something doing this
without spawning multiple processes and killing them again.

/me wonders why his raid5 resyncs every boot on Fedora 15 and has
found this old thread.
1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on Fedora)

This sounds particularly elegant.
Is there some part of the filesystem, that survives through the whole process
- from before / is mounted until after it is unmounted?
Presumably this would be /run if anything.

mdmon must be running from the time that / becomes writable until after it
becomes readonly.
If we can have it from before it is mounted until after it is unmounted, that
might be even better.
(It is possible to start a new one which replaces the old one but if that was
only used for version upgrades, that would be best).

So if mdmon has a 'cwd' and all open files in /run (and the executable
elsewhere in the same filesystem), could it easily survive the 'kill all
processes before unmounting /' thing?

Post by Lennart Poettering
However, just excluding mdmom from being killed will not make this work,
simply because jumping into initrd only works sensibly if we can drop
all references to all previous mounts which requires us to have only one
process running at that time, and one process only.
It always boils down to the same thing: mdmon must be something we can
shutdown cleanly like every other process. Excluding it from that will
just move the problem around, but not fix it.

My ideal would be that you just ignore mdmon.
After unmounting '/', you shutdown md arrays with "mdadm -Ss" and then mdmon
will spontaneously disappear.

Post by Dan Williams
2/ arrange for mdadm --wait-clean --scan to be called after all
filesytems have been mounted read only

Won't help you really either, since we have to kill all processes before
we jump into the initrd to fully disassemble mounts and storage.
There'll always be this chicken and egg problem: we cannot disassmble
all storage until all processes are gone and we are back in the
initrd. But mdmon wants to stay running after we

We actually try to disassemble md already, i.e. we call the
DM_DEV_REMOVE ioctl for all left-over devices. I am not really
interested to link against libdm itself.

:-)
I get used to this .. people confusing md and dm, people confusing nfs-client
with nfs-server, people confusing me with some other Mr Brown :-)

NeilBrown

Lennart Poettering

2011-11-02 01:16:15 UTC

Post by Lennart Poettering
We nowadays jump back into the initrd when we shut down, so that the
initrd disassembles everything it assembled at boot time. This for the
first time enables us to ensure that all layers of our stack are in a
sane state (i.e. fully offline) when we shut down, regardless in which
way you stack it.

Yes. /run is usually mounted by the initrd these days, and the initrd
itself places its binaries in /run/initramfs/ which systemd then
pivot_root()s into at shutdown.

Post by NeilBrown
mdmon must be running from the time that / becomes writable until after it
becomes readonly.

I'd really prefer if we could somehow make it something that isn't
special and we could just shutdown

Post by NeilBrown
If we can have it from before it is mounted until after it is unmounted, that
might be even better.

Well, that could work if mdmon is invoked in the initrd only. If mdmon
is always running from the initrd this would solve the issue that it
keeps files on the real root referenced thus making unmounting of /
impossible.

However, there might be complexities here: what happens if the user
creates an MD device during normal operation, so that mdmon is started
at runtime, and not from the initrd?

That said I definitely prefer that if mdmon really wants to avoid
systemd and live independent of it that it does so by being invoked from
the initrd, so that it runs completely independently from all systemd
book keeping.

If this is what you want, then we could come up with a simple scheme
like "a process owned by root who has +t set on /proc/$PID/stat" is
excluded from systemd's killing.

But again, I really think that mdmon should just be fixed to become a
daemon that can be shtu down at any time.

Post by NeilBrown
(It is possible to start a new one which replaces the old one but if that was
only used for version upgrades, that would be best).

If you do upgrades like that then you end up with a version of mdmon
running that is still referencing the root dir. That means the initrd
disassembling will break.

Post by NeilBrown
So if mdmon has a 'cwd' and all open files in /run (and the executable
elsewhere in the same filesystem), could it easily survive the 'kill all
processes before unmounting /' thing?

Right now no. But if the +t scheme would work for you we could at
that. But you'd need a good story how to handle upgrades and arrays that
are assembled during ruintime (i.e. after initrd)?

My ideal would be that you just ignore mdmon.
After unmounting '/', you shutdown md arrays with "mdadm -Ss" and then mdmon
will spontaneously disappear.

That's still a chicken and egg problem. We cannot unmount / until all
references to files on / are dropped. For that we need all processes
running from it terminated. That means mdmon needs to go first, and only
then we can unmount /.

Lennart

--
Lennart Poettering - Red Hat, Inc.

NeilBrown

2011-11-02 02:03:34 UTC

Yes. /run is usually mounted by the initrd these days, and the initrd
itself places its binaries in /run/initramfs/ which systemd then
pivot_root()s into at shutdown.

Post by NeilBrown
mdmon must be running from the time that / becomes writable until after it
becomes readonly.

I'd really prefer if we could somehow make it something that isn't
special and we could just shutdown

It must remain running until the array that it manages is read-only and will
never be written to again. Then it can be shutdown gracefully.
It may be awkward to shut it down gracefully at the moment - I'm not sure. I
can certainly fix that.

Post by NeilBrown
If we can have it from before it is mounted until after it is unmounted, that
might be even better.

Each instance of mdmon manages a set of arrays and must remain running
until all of those arrays are readonly (or shut down). This allows it to
record that all writes have completed and mark the array as 'clean' so a
resync isn't needed at next boot.

If a user creates an array while the system it running, it will not have the
root filesystem on it. So between unmounting the last non-root filesystem
and unmounting root it is perfectly OK to stop that mdmon.

Post by Lennart Poettering
That said I definitely prefer that if mdmon really wants to avoid
systemd and live independent of it that it does so by being invoked from
the initrd, so that it runs completely independently from all systemd
book keeping.
If this is what you want, then we could come up with a simple scheme
like "a process owned by root who has +t set on /proc/$PID/stat" is
excluded from systemd's killing.

You couldn't just do the equivalent of
fuser -k /some/filesystem
umount /some/filesystem

iterating over filesystems with '/' last?

Then anything that only uses the /run filesystem will survive.

Post by Lennart Poettering
But again, I really think that mdmon should just be fixed to become a
daemon that can be shtu down at any time.

Post by NeilBrown
(It is possible to start a new one which replaces the old one but if that was
only used for version upgrades, that would be best).

If you do upgrades like that then you end up with a version of mdmon
running that is still referencing the root dir. That means the initrd
disassembling will break.

True. A version upgrade would need to stash the binary in /run.
It might be better to go the 'remount-readonly - then stop mdmon' route.

Right now no. But if the +t scheme would work for you we could at
that. But you'd need a good story how to handle upgrades and arrays that
are assembled during ruintime (i.e. after initrd)?

My ideal would be that you just ignore mdmon.
After unmounting '/', you shutdown md arrays with "mdadm -Ss" and then mdmon
will spontaneously disappear.

Does, or can, systemd remount '/' readonly before trying to unmount it and
allow some task to run at that point?

I guess it still needs to be able to differentiate processes that are holding
write-access to the filesystem and so need to be killed, from processes are
only holding read-access and so can be permitted to remain.

Probably easiest for mdmon just register itself as "Leave this until / is
readonly" - maybe by putting it's pid file in
/run/preserve-until-readonly/mdmon-devname.pid

I don't quite get your "+t on /proc/$PID/stat" suggestion:

# chmod +t /proc/self/stat
chmod: changing permissions of `/proc/self/stat': Operation not permitted

NeilBrown

Lennart Poettering

2011-11-02 13:32:25 UTC

Post by Lennart Poettering
I'd really prefer if we could somehow make it something that isn't
special and we could just shutdown

The big thing is that if things are done that way you'll always have the
chicken and egg problem: you really need to shut down mdmon before
unmounting root, but currently you require us to do it in the other
order too.

Post by NeilBrown
If we can have it from before it is mounted until after it is unmounted, that
might be even better.

Why doesn't the kernel do that on its own?

Post by NeilBrown
If a user creates an array while the system it running, it will not have the
root filesystem on it. So between unmounting the last non-root filesystem
and unmounting root it is perfectly OK to stop that mdmon.

Well, that complicates things quite a bit, since that way the shutdown
logic has two very different paths.

You couldn't just do the equivalent of
fuser -k /some/filesystem
umount /some/filesystem
iterating over filesystems with '/' last?
Then anything that only uses the /run filesystem will survive.

What we do right now is this:

kill_all_processes();
do {
umount_all_file_systems_we_can();
read_only_mount_all_remaining_file_systems();
} while (we_had_some_success_with_that());
jump_into_initrd();

As long as mdmon references a file from the root disk we cannot umount
it, so the loop wouldn't be effective.

Post by NeilBrown
(It is possible to start a new one which replaces the old one but if that was
only used for version upgrades, that would be best).

If you do upgrades like that then you end up with a version of mdmon
running that is still referencing the root dir. That means the initrd
disassembling will break.

True. A version upgrade would need to stash the binary in /run.
It might be better to go the 'remount-readonly - then stop mdmon' route.

It is not sufficient to stash the binary in /run, you'd also need to
include your own libc and in fact every single other library or file you
use.

Why? If a system is upgraded library files are deleted and replaced by
new ones. If a process stays running with the original libraries mapped
the file system cannot be remounted read-only, since the file is only
deleted in theory, but needs to be deleted on disk, which can only
happen if the file is not referenced anymore. Hence, if the user does an
upgrade of *any* of the files mdmon has open we will not be able to
remount the fs these files are from read-only if the user did an upgrade
of any of the files.

Post by Lennart Poettering
That's still a chicken and egg problem. We cannot unmount / until all
references to files on / are dropped. For that we need all processes
running from it terminated. That means mdmon needs to go first, and only
then we can unmount /.
Lennart

Does, or can, systemd remount '/' readonly before trying to unmount it and
allow some task to run at that point?

Well, we try that as last resort.

Post by NeilBrown
I guess it still needs to be able to differentiate processes that are holding
write-access to the filesystem and so need to be killed, from processes are
only holding read-access and so can be permitted to remain.

Basically what I saying here is that it's a really bad idea that mdmon
insists to stay around until after the file system is unmounted, even
though it itself is running from it. And the fact that mdmon doesn't
have any of those files open for writing doesn't help you very much
here, due to the upgrade/delete issue.

Post by NeilBrown
# chmod +t /proc/self/stat
chmod: changing permissions of `/proc/self/stat': Operation not permitted

Uh oh, I was sure that one could actually change the access mode of
files in /proc. Seems I was wrong. An alternative solution might be to
do argv[0][0]='!' in your code, to tell systemd to exclude your process
from killing. THis wouldbe inspired from shells changing the first char
of argv to "-" for login shells.

But again, I believe the right solution is to fix mdmon to make it
something that can be shut down normally at any time. That might mean
that some of its code has to move to the kernel, but otherwise you'll
always have this chicken and egg problem, and you cannot fix it properly.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Kay Sievers

2011-11-02 14:33:29 UTC

Post by Lennart Poettering
I'd really prefer if we could somehow make it something that isn't
special and we could just shutdown

Yeah, that's just madness.

I talked to Harald, and the currently preferred idea is the version
where we start mdmon in the initramfs and never touch it again, and it
runs until the initramfs unmounts the rootfs and shuts down the box.

In that picture, the mdmon process is conceptually more like a kernel
thread than a userspace process. It can not be updated, can not be
restarted. The only way to update it is to rebuild initramfs and
reboot the box.

Post by NeilBrown
Each instance of mdmon manages a set of arrays and must remain running
until all of those arrays are readonly (or shut down). This allows it to
record that all writes have completed and mark the array as 'clean' so a
resync isn't needed at next boot.

Why doesn't the kernel do that on its own?

Because somebody was naive enough to think that userspace can tear
down the base it lives on, which in reality is just a total mess in
the real world. :)

Post by NeilBrown
True. A version upgrade would need to stash the binary in /run.
It might be better to go the 'remount-readonly - then stop mdmon' route.

It is not sufficient to stash the binary in /run, you'd also need to
include your own libc and in fact every single other library or file you
use.

I don't think any of these update games in a running system make much
sense in the end.

But again, I believe the right solution is to fix mdmon to make it
something that can be shut down normally at any time. That might mean
that some of its code has to move to the kernel, but otherwise you'll
always have this chicken and egg problem, and you cannot fix it properly.

That would be the ideal solution. Having the roofs depending on a
tools that runs off the rootfs just asks for serious trouble. If all
that can't move to the kernel, the initramfs-only solution with the
above mentioned constrains, seems like the best option.

People who like to put their rootfs on a userspace managed raid device
just get what they asked for. :)

Kay

Lennart Poettering

2011-11-02 15:17:49 UTC

Post by Lennart Poettering
The big thing is that if things are done that way you'll always have the
chicken and egg problem: you really need to shut down mdmon before
unmounting root, but currently you require us to do it in the other
order too.

Yeah, that's just madness.
I talked to Harald, and the currently preferred idea is the version
where we start mdmon in the initramfs and never touch it again, and it
runs until the initramfs unmounts the rootfs and shuts down the box.
In that picture, the mdmon process is conceptually more like a kernel
thread than a userspace process. It can not be updated, can not be
restarted. The only way to update it is to rebuild initramfs and
reboot the box.

OK, I guess that means we'll need to define a way how we can recognize
the process then, to avoid killing it by systemd, similar to how we
exclude kernel threads from killing.

Kernel threads we detect by checking whether /proc/$PID/cmdline is
empty, hence I'd suggest we use the first char of argv[0][0] here, to
detect whether something is a process to avoid killing. Question is
which char to choose for that. I am tempted to use '@'.

That means we'd:

a) patch systemd to check whether argv[0][0] of a process is '@' and
owned by root and exclude it from killing on shutdown.

b) patch mdmon to set argv[0][0] of itself to '@' iff it is running from
the initrd. If it is run from the main system it should not set that and
just be shut down like any other service.

c) make sure that mdmon run from the initrd is never upgrade during
normal operation, only via dracut rebuild and reboot.

If this is acceptable I am will cook up the patch for a).

Lennart

--
Lennart Poettering - Red Hat, Inc.

Kay Sievers

2011-11-02 15:21:39 UTC

Post by Lennart Poettering
Kernel threads we detect by checking whether /proc/$PID/cmdline is
empty, hence I'd suggest we use the first char of argv[0][0] here, to
detect whether something is a process to avoid killing. Question is

Maybe introduce a 'initramfs' cgroup and move the pids there?

Kay

Lennart Poettering

2011-11-02 15:29:13 UTC

Maybe introduce a 'initramfs' cgroup and move the pids there?

Well, in which hierarchy? I am a bit concerned about having other
subsystems muck with the systemd cgroup hierarchy, before systemd has
set it up.

I can see some elegance in having all code from the initrd that remains
running during boot in a cgroup of its own, but that's probably
orthogonal to finding a way to recognize processes not to kill at
shutdown. Why? Because there's stuff like Plymouth which also stays
around from the initramfs, but actually is something we *do* want to
kill on shutdown.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Williams, Dan J

2011-11-02 22:18:00 UTC

On Wed, Nov 2, 2011 at 8:29 AM, Lennart Poettering

Maybe introduce a 'initramfs' cgroup and move the pids there?

Well, in which hierarchy? I am a bit concerned about having other
subsystems muck with the systemd cgroup hierarchy, before systemd has
set it up.
I can see some elegance in having all code from the initrd that remains
running during boot in a cgroup of its own, but that's probably
orthogonal to finding a way to recognize processes not to kill at
shutdown. Why? Because there's stuff like Plymouth which also stays
around from the initramfs, but actually is something we *do* want to
kill on shutdown.

So how about rather than binaries self modifying themselves as "please
don't kill me" with argv[][], shutdown can just avoid process where
/proc/$PID/cmdline starts with /run/initramfs? Then it's up to where
the initramfs runs the binary to determine which instances it wants
provenance over versus leaving to the init system.

For manually started arrays maybe we should arrange for an
initramfs-started-mdmon to spawn new instances for user started
containers, rather than using the local /sbin/mdmon. Then the "mdadm
-Ss" initiated by /run/initramfs/shutdown can reliably stop any md
device regardless of how it was started.

--
Dan

Lennart Poettering

2011-11-02 23:39:00 UTC

Post by Williams, Dan J
On Wed, Nov 2, 2011 at 8:29 AM, Lennart Poettering

Maybe introduce a 'initramfs' cgroup and move the pids there?

Well, in which hierarchy? I am a bit concerned about having other
subsystems muck with the systemd cgroup hierarchy, before systemd has
set it up.
I can see some elegance in having all code from the initrd that remains
running during boot in a cgroup of its own, but that's probably
orthogonal to finding a way to recognize processes not to kill at
shutdown. Why? Because there's stuff like Plymouth which also stays
around from the initramfs, but actually is something we *do* want to
kill on shutdown.

Nope, whether something should be excluded of killing during shutdown is
orthogonal to being part of the initramfs. For example, Plymouth
(i.e. the graphical boot splash thingy) is started form initrd too, but
we definitely want to kill it on shut down.

I am a bit concerned about checks against paths since initrd might play
some namespacing games and the paths might not appear to the main system
they way you'd expect.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Williams, Dan J

2011-11-03 00:28:50 UTC

On Wed, Nov 2, 2011 at 4:39 PM, Lennart Poettering

Post by Williams, Dan J
On Wed, Nov 2, 2011 at 8:29 AM, Lennart Poettering

Maybe introduce a 'initramfs' cgroup and move the pids there?

Well, in which hierarchy? I am a bit concerned about having other
subsystems muck with the systemd cgroup hierarchy, before systemd has
set it up.
I can see some elegance in having all code from the initrd that remains
running during boot in a cgroup of its own, but that's probably
orthogonal to finding a way to recognize processes not to kill at
shutdown. Why? Because there's stuff like Plymouth which also stays
around from the initramfs, but actually is something we *do* want to
kill on shutdown.

In the plymouth case the path would be /bin/plymouth, the initramfs
would need to take special care to run mdmon from /run/initramfs to
identify it as needing the initramfs environment to carry out its
shutdown.

Post by Lennart Poettering
I am a bit concerned about checks against paths since initrd might play
some namespacing games and the paths might not appear to the main system
they way you'd expect.

The initramfs needs to be modified to either tell mdmon it is running
from the initramfs or arrange for /proc/$MDMON_PID/cwd to appear to be
from /run/initramfs. I only like the latter because it works with
existing mdmon binaries, but we may need shutdown to always leave
mdmon alone...

For user started md arrays the shutdown sequence still goes:

killall --> umount

...and we would need to express::

killall (but mdmon) --> umount --> mdadm -Ss (stops all not in use arrays)

So maybe we do the argv "@" tagging in all cases and systemd never
kills mdmon but arranges for all (stoppable) md devices to be stopped,
then rely on /run/initramfs/shutdown to handle the rootfs blockdev.

--
Dan

Williams, Dan J

2011-11-02 17:21:29 UTC

On Wed, Nov 2, 2011 at 8:17 AM, Lennart Poettering

Yeah, that's just madness.
I talked to Harald, and the currently preferred idea is the version
where we start mdmon in the initramfs and never touch it again, and it
runs until the initramfs unmounts the rootfs and shuts down the box.
In that picture, the mdmon process is conceptually more like a kernel
thread than a userspace process. It can not be updated, can not be
restarted. The only way to update it is to rebuild initramfs and
reboot the box.

Well, there are two cases to consider:

1/ user starts the array manually and stops it with mdadm -Ss (mdmon
automatically shuts down). No need for a service mdmon just follows
the lifespan of the array.

2/ user starts the array but then expects it to be around until system shutdown

In the latter case let the initramfs-mdmon takeover all arrays with
"mdmon --takeover --all". But if all arrays may eventually be
re-parented to an mdmon instance from /run, why not always start mdmon
from there?

Lennart Poettering

2011-11-02 23:35:56 UTC

Post by Lennart Poettering
owned by root and exclude it from killing on shutdown.
the initrd. If it is run from the main system it should not set that and
just be shut down like any other service.

1/ user starts the array manually and stops it with mdadm -Ss (mdmon
automatically shuts down). No need for a service mdmon just follows
the lifespan of the array.
2/ user starts the array but then expects it to be around until system shutdown
In the latter case let the initramfs-mdmon takeover all arrays with
"mdmon --takeover --all". But if all arrays may eventually be
re-parented to an mdmon instance from /run, why not always start mdmon
from there?

Well I am not sure how mdmon works, but let's say you booted up with an
initrd lacking mdmon. Then, while the machine is up you set up a some md
device, and start mdmon for that. At this point it will be independent
of the initrd. But that should be OK since at shutdown time it can be
detached cleanly without any special magic, too, since mdmon is not
stored on that md device. So if you boot from md you need mdmon in the
initrd. If you just use md outside of the root disk, then running mdmon
as a normal service (i.e. one that is shut down like any other) should
be perfectly fine.

This why I suggested that only mdmon run from the initrd should set
argv[0][0] = '@', because only that one needs the special handling that
it cannot be terminated properly on shut down. The one running from the
normal system can be treated like a standard systemd service.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Williams, Dan J

2011-11-02 18:16:00 UTC

Post by Kay Sievers
People who like to put their rootfs on a userspace managed raid device
just get what they asked for. :)

Proper care and feeding of mdmon and userspace managed block devices /
filesystems is a solvable problem. To me the ":)" runs the risk of
implying we don't think we can get this right.

--
Dan

Kay Sievers

2011-11-02 18:49:19 UTC

Post by Kay Sievers
People who like to put their rootfs on a userspace managed raid device
just get what they asked for. :)

Proper care and feeding of mdmon and userspace managed block devices /
filesystems is a solvable problem. To me the ":)" runs the risk of
implying we don't think we can get this right.

It implied that I think it is totally insane what you guys try to
accomplish. Managing the rootfs blockdev with tools contained in the
rootfs itself is just crazy. No smiley this time.

Kay

Williams, Dan J

2011-11-02 19:31:42 UTC

Post by Kay Sievers
People who like to put their rootfs on a userspace managed raid device
just get what they asked for. :)

Proper care and feeding of mdmon and userspace managed block devices /
filesystems is a solvable problem. To me the ":)" runs the risk of
implying we don't think we can get this right.

It implied that I think it is totally insane what you guys try to
accomplish. Managing the rootfs blockdev with tools contained in the
rootfs itself is just crazy. No smiley this time.

Yes, much clearer. Which is why the "never let mdmon run from an fs
it is managing" is better than the current dance that was implemented
to address the need to drop initramfs memory and get around a lack of
having a filesystem (like /run) that persisted from early boot. But
we now run back into the problem of pinning initramfs memory. Does
systemd already expect that the full initramfs sticks around to handle
shutdown? If so then we have come full circle and don't really need
the "mdmon --takeover" functionality versus just letting the
initramfs-mdmon handle their entire lifetime of the rootfs blockdev.

--
Dan

Kay Sievers

2011-11-02 19:51:59 UTC

Post by Kay Sievers
People who like to put their rootfs on a userspace managed raid device
just get what they asked for. :)

Proper care and feeding of mdmon and userspace managed block devices /
filesystems is a solvable problem. To me the ":)" runs the risk of
implying we don't think we can get this right.

It implied that I think it is totally insane what you guys try to
accomplish. Managing the rootfs blockdev with tools contained in the
rootfs itself is just crazy. No smiley this time.

It all depends on the initramfs implementation. Systemd is not
involved here and has no knowledge about what was left behind, it just
checks if there is binary in /run provided by initramfs, and then it
calls this binary instead of just bringing down the box itself.

So far only dracut implements this shutdown logic, which is just a
go-back-to initramfs and disassemble/shut down everything that was
assembled before the initramfs started the real init.

I wouldn't be surprised if we see more of these use cases from
subsystems which put their rootfs on something that needs to be
managed from userspace.

The pinned memory for the tools in initramfs that stay around in tmpfs
is probably the price to pay for these kinds of setups of the rootfs,
when subsystems want to avoid adding the needed logic to the kernel to
safely shut down the rootfs.

Kay

NeilBrown

2011-11-07 02:52:41 UTC

Why doesn't the kernel do that on its own?

Because the kernel doesn't know about the format of the metadata that
describes the array.

Post by NeilBrown
You couldn't just do the equivalent of
fuser -k /some/filesystem
umount /some/filesystem
iterating over filesystems with '/' last?
Then anything that only uses the /run filesystem will survive.

kill_all_processes();
do {
umount_all_file_systems_we_can();
read_only_mount_all_remaining_file_systems();
} while (we_had_some_success_with_that());
jump_into_initrd();
As long as mdmon references a file from the root disk we cannot umount
it, so the loop wouldn't be effective.

What exactly is "kill_all_processes()"? is it SIGTERM or SIGKILL or both
with a gap or ???

I assume a SIGKILL. I don't mind a SIGTERM and it could be useful to
expedite mdmon cleaning up.

However there is an important piece missing. When you remount,ro a
filesystem, the block device doesn't get told so it thinks it is still open
read/write. So md cannot tell mdmon that the array is now read-only
It would make a lot of sense for mdmon to exit after receiving a SIGTERM as
soon as the device is marked read-only. But it just doesn't know.

We can probably fix that, but that doesn't really help for now.

I think I would like:

- add to the above loop "stop any virtual devices that we can".
Exactly how to do that if /proc and /sys are already unmounted
is unclear. Is one or both of these kept around somewhere?

- allow processes to be marked some way so they get SIGTERM but not
SIGKILL. I'm happy adding magic char to argv[0].

We should be able to make it work with those changes - if they are possible.

Thanks,

NeilBrown

Kay Sievers

2011-11-07 03:42:54 UTC

However there is an important piece missing. When you remount,ro a
filesystem, the block device doesn't get told so it thinks it is still open
read/write. So md cannot tell mdmon that the array is now read-only

That ro/rw flag is visible in /proc/self/mountinfo, shouldn't it be
possible for mdmon to poll() that file and let the kernel wake stuff
up when the ro/rw flag changes, like we do for the usual mount changes
already?

Kay

NeilBrown

2011-11-07 04:30:37 UTC

The ro/rw flag for file systems is in /proc/self/mountinfo.

However I want the ro/rw flag for the block device.
A block device can be partitioned so it might have multiple filesystems on it.
and it might have swap too.
or a dm table or another md device or an open file descriptor or ....

Yes, I could maybe parse various different files and try to work out what is
going on. But the kernel can easily *know* what is going on.

Making this work "perfectly" would require md dropping its write-access to
member devices when the last write-access to the top level device goes. And
the same for dm and loop and .....

But just filesystems would go a long way to catching the common cases
correctly.

Thanks,
NeilBrown

Lennart Poettering

2011-11-07 12:00:49 UTC

This post might be inappropriate. Click to display it.

Williams, Dan J

2011-11-07 19:09:19 UTC

On Mon, Nov 7, 2011 at 4:00 AM, Lennart Poettering

Post by Lennart Poettering
Why doesn't the kernel do that on its own?

Because the kernel doesn't know about the format of the metadata that
describes the array.

Yupp, my suggestion would be to change that.

It's quite a bit of idiosyncratic code that needs to be duplicated in
kernel space and userspace (since userspace always needs to know how
to parse the metadata for array assembly). All to record a dirty bit
that flips at most every 5 seconds, or a disk failure event which is
even less frequent. Throw in policy constraints like restricting
which block devices can become part of the raid set. Rinse and repeat
for every possible metadata format.

[..]

Post by NeilBrown
What exactly is "kill_all_processes()"? is it SIGTERM or SIGKILL or both
with a gap or ???

SIGTERM followed by SIGKILL after 5s if the programs do not react to
that in time. But note that this logic only applies to processes which
for some reason managed to escape systemd's usual cgroup-based killing
logic. Normal services are hence already killed at that time, and only
processes which moved themselves out of any cgroup or for which the
service files disabled killing might survive to this point.

So I think mdmon should always try to escape itself from cgroup based
killing. It follows the lifespan of the array, and if the array is
not stopped by the cgroup exit (or the array lifespan is not
controlled in a service file), then mdmon must keep running.

[..]

Post by Lennart Poettering
<snip>
terminate_all_mount_and_service_units();
kill_all_remaining_processes();
do {
umount_all_remaining_file_systems_we_can();
read_only_mount_all_remaining_file_systems();
detach_all_remaining_loop_devices();
detach_all_remaining_swap_devices();
detach_all_remaining_dm_devices();

So I've started putting together a md_detach_all() routine that will
attempt to stop all md devices (via sysfs). Where all mdmon instances
have missed the initial killall with the argv '@' flagging.

Like the dm implementation it will address all but the root md device.

Post by Lennart Poettering
} while (we_had_some_success_with_that());
jump_into_initrd();

The final act of the initramfs is then "mdadm --wait-clean --scan" to
communicate with the rootfs-blockdev-mdmon to be sure the metadata has
been marked clean. All other mdmon instances should have exited
naturally when their md devices stopped, but the "--wait-clean --scan"
will have ensured shutdown can progress safely.

Post by Lennart Poettering
You have relatively flexible control of the first step in this code. The
second step is then the hammer that tries to fix up what this step
didn't accomplish. My suggestion to check argv[0][0] was to avoid the
hammer.

I notice that if the rootfs is on a dm or md device systemd/shutdown
will always fall through to ultimate_send_signal() which will not
discriminate against processes flagged with '@'. Since we aren't
stopping the root md device I wonder if ultimate_send_signal() should
also ignore flagged processes, or whether the failure to stop the root
device is to be expected and let shutdown skip ultimate_send_signal()
if the only remaining work is shutting down the rootfs-blockdev. I'm
leaning towards the latter.

--
Dan

Lennart Poettering

2011-11-08 14:43:37 UTC

Post by NeilBrown
What exactly is "kill_all_processes()"? is it SIGTERM or SIGKILL or both
with a gap or ???

Well, I think when it gets killed by the cgroup-based killer then it
should try to tear down its MD device.

In the mdmon service file use SendSIGKILL=no to disable sending of
SIGKILL after the initial SIGTERM. With KillSignal= you chan choose the
signal you first want to be killed with, if you don't want it to be
SIGTERM. With KillMode= you can choose whether only the main process of
the service, all processes of the service, or no processes of the
service shall be killed. With TimeoutSec= you can set the timeout
between the SIGTERM and the SIGKILL. See systemd.service(5) for more
information.

I notice that if the rootfs is on a dm or md device systemd/shutdown
will always fall through to ultimate_send_signal() which will not
stopping the root md device I wonder if ultimate_send_signal() should
also ignore flagged processes, or whether the failure to stop the root
device is to be expected and let shutdown skip ultimate_send_signal()
if the only remaining work is shutting down the rootfs-blockdev. I'm
leaning towards the latter.

The idea was to skip processes flgged with '@' in both the
ultimate_send_signal() and send_signal() calls.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Williams, Dan J

2011-11-08 23:27:00 UTC

On Tue, Nov 8, 2011 at 6:43 AM, Lennart Poettering

Post by Williams, Dan J
So I think mdmon should always try to escape itself from cgroup based
killing. It follows the lifespan of the array, and if the array is
not stopped by the cgroup exit (or the array lifespan is not
controlled in a service file), then mdmon must keep running.

Well, I think when it gets killed by the cgroup-based killer then it
should try to tear down its MD device.

We can easily fall off the complexity cliff trying to tear down the MD
device because it can be pinned by a mounted filesystem or being
claimed anywhere in an arbitrary stack of DM or MD devices. I did not
think cgroup exit would umount() filesystems?

[..]

Post by Williams, Dan J
I notice that if the rootfs is on a dm or md device systemd/shutdown
will always fall through to ultimate_send_signal() which will not
stopping the root md device I wonder if ultimate_send_signal() should
also ignore flagged processes, or whether the failure to stop the root
device is to be expected and let shutdown skip ultimate_send_signal()
if the only remaining work is shutting down the rootfs-blockdev. I'm
leaning towards the latter.

ultimate_send_signal() and send_signal() calls.

Ok, that makes it easier.

--
Dan

Michal Soltys

2011-11-08 00:11:53 UTC

Post by Lennart Poettering
kill_all_processes();
do {
umount_all_file_systems_we_can();
read_only_mount_all_remaining_file_systems();
} while (we_had_some_success_with_that());
jump_into_initrd();
As long as mdmon references a file from the root disk we cannot umount
it, so the loop wouldn't be effective.

I've peeked into systemd, and from what I can see, it /only/ jumps back
to initramfs (prepare_new_root() and pivot_to_new_root()) if shutdown
"binary" is present on initramfs. And whenever mdmon is still running or
not, is not in any way determinent for pivot_root(2) call to succeed (or
... ?).

If /run/initramfs/shutdown is not present, then systemd just do the
things the old way as far as I can see - it doesn't even attempt to
pivot. And if it doesn't, the it can't umount the root (being itself
tied to it) ?

So essentially, if systemd execs /shutdown (after pivoting to
/run/initramfs) - then it's dracut's modules.d/99shutdown, which itself
sources hooks from other modules to do the rest of cleaning job. And
that should take care of all the remaining stuff (including terminating
mdmon in graceful way, and then umounting /oldroot). Either way - pretty
simple to add the necessary functionality to dracut.

So wouldn't simply a systemd's cgroup named say - immortals - with mdmon
(by default) in it suffice ? Pivot back as usual, leave mdmon alive, let
the dracut (or anything else used for initramfs) do the rest of the job
(properly).

p.s.
Sorry if I missed something obvious, it was a quick and late peek over
systemd's shutdown.c.

Michal Soltys

2011-11-08 16:46:18 UTC

Post by Michal Soltys
I've peeked into systemd, and from what I can see, it /only/ jumps back
to initramfs (prepare_new_root() and pivot_to_new_root()) if shutdown
"binary" is present on initramfs. And whenever mdmon is still running or
not, is not in any way determinent for pivot_root(2) call to succeed (or
... ?).
If /run/initramfs/shutdown is not present, then systemd just do the
things the old way as far as I can see - it doesn't even attempt to
pivot. And if it doesn't, the it can't umount the root (being itself
tied to it) ?
So essentially, if systemd execs /shutdown (after pivoting to
/run/initramfs) - then it's dracut's modules.d/99shutdown, which itself
sources hooks from other modules to do the rest of cleaning job. And
that should take care of all the remaining stuff (including terminating
mdmon in graceful way, and then umounting /oldroot). Either way - pretty
simple to add the necessary functionality to dracut.
So wouldn't simply a systemd's cgroup named say - immortals - with mdmon
(by default) in it suffice ? Pivot back as usual, leave mdmon alive, let
the dracut (or anything else used for initramfs) do the rest of the job
(properly).

I did some testings today, and it's all working nicely as expected.
Actually I modified "classic" rc scripts I'm using under sysinit to
perform full umount/detach (using similar methods to systemd), with
mdmon happily living through everything. The only things needed after
pivot_root were:

mdmon --takeover --all
telinit U

(so obviously my dracut image had mdmon, telinit and init, and slightly
adjusted shutdown script).

Then everything from oldroot could be nicely and cleanly umounted.

Even more elegant would be if e.g. mdmon had added option such as:

--reroot <newroot>

to chroot() and reopen its files under <newroot>, and then systemd would
call

mdmon --reroot /run/initramfs --all --takeover

after - prepare_new_root() and before - pivot_to_new_root()

Then even existing intiramfs image could (probably) be mdmon-agnostic.

Michal Soltys

2011-11-08 20:32:07 UTC

Post by Michal Soltys
Then even existing intiramfs image could (probably) be mdmon-agnostic.

Actually:

chroot /run/initramfs mdmon --takeover --all

worked just fine (after preparing new root - so after all mount --binds,
and before pivot_root(8)).

So in context of systemd instead of sysv scripts - a fork / chroot /
exec mdmon / wait - instead of killing it would do the thing, followed
by pivot_to_new_root().

Actually anything that could benefit from "immortality" in one or the
other way (perhaps udevd, so e.g. lvm doesn't need --noudevsync ? - when
taken over inside dracut's shutdown or anything similar after going back
to initramfs) that can be pre-chrooted into /run/initramfs and exec'ed,
should work just fine. For the record, udevd worked properly with pivot
survival.

Williams, Dan J

2011-11-08 22:29:54 UTC

Post by Michal Soltys

Post by Michal Soltys
Then even existing intiramfs image could (probably) be mdmon-agnostic.

chroot /run/initramfs mdmon --takeover --all

One of the suggestion earlier in the thread is not mess with takeover
at all. The rootfs md device is always monitored by a mdmon instance
launched from /run/initramfs. The only way to update mdmon is to
recreate the initramfs and reboot (which is similar to the experience
of trying to update raidfoo.ko for the rootfs blockdev).

Lennart Poettering

2011-02-09 14:01:39 UTC

Post by Andrey Borzenkov
At this point we know it is container, know that it has external
metadata and know that we need external metadata handler (mdmon). But
it is too late for systemd.

Kay, do you know why this "change" event is used here? Any chance we can
get rid of it?

So, it seems that the "change" event does make some sense here. I have
now added a new property to systemd: if you set SYSTEMD_READY=0 on a
udev device then systemd will consider it unplugged even if it shows up
in the udev tree. If this property is not set for a device, or is set to
1 we will conisder the device plugged.

To make this md stuff compatible with systemd we hence just need to set
SYSTEMD_READY=0 during the "new" event and drop it when the device is
fully set up.

Andrey, since you are playing around with this, do you happen to know
which attribute we should check to set SYSTEMD_READY=0 properly? It
would be cool if we could come up with a default rule for inclusion in
our systemd rules file that will ensure the device only shows up when it
is ready.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Lennart Poettering

2011-01-07 00:40:24 UTC

Post by Christian Parpart

Post by Andrey Borzenkov
It is then killed by systemd during shutdown as part of user session.
It results in dirty array on next boot.
Is there any magic that allows daemon to be exempted from killing?

While your raid should absolutely not be corrupted on next reboot
when mdmon receives a SIGTERM,

This won't be corrupted but it will initiate rebuilt. I have reports
that such rebuild may take hours, costing performance and loss of
redundancy.

Well, eventually we need to be able to kill mdmon. Otherwise we might
not be able to remount the root dir r/o. How exactly is mdmon supposed
to behave on shutdown?

Lennart

--
Lennart Poettering - Red Hat, Inc.

Lennart Poettering

2011-01-07 00:38:27 UTC

Post by Andrey Borzenkov
If user starts array manually (mdadm -A -s as example) from within
user session and array needs mdmon, mdmon becomes part of user session

Are you suggesting that mdadm forks off mdmon from within the user
session? This is horribly ugly and broken and they shouldn't do that.

Post by Andrey Borzenkov
├ user
│ └ root
│ └ 1
│ ├ 1916 login -- root
│ ├ 1930 -bash
│ ├ 1964 gpg-agent --keep-display --daemon --write-env-file /root/.gnup...
│ └ 2062 mdmon md127
It is then killed by systemd during shutdown as part of user session.

Well, only if you enable that the user session is completely killed on
logout, which we currently don't do by default.

I wonder if it would make sense to add an option which kills user
sessions on log out only for uid != 0. This might help here, but only
half-way, since sudo would still break. But anyway, I'll add this to the
todo list.

Post by Andrey Borzenkov
It results in dirty array on next boot.

Hmm, that shouldn't happen.

Post by Andrey Borzenkov
Is there any magic that allows daemon to be exempted from killing?

Well, I have been discussing this with Kay and we'll most likely add
something like DontKillOnShutdown=yes or so, which if added to a unit
file will exempt it from killing during the normal service shutdown
phase, and the first killing spree (but not the second, post-umount
killing spree). But that of course would require mdmon to be started
like any other daemon, and not forked off mdadm.

That should mostly fix the problem, but then again I do believe that the
whole idea of mdmon is just borked, since it will necessarily pin page
from the root fs into memory which will create all kinds of problems,
for example after upgrades (i.e. mdmon maps libc into memory, libc gets
updated, the old libc deleted, which cannot be written to disk as long
as mdmon stays running pinning it, which will disallow the ultimate
unmounting/remounting of the fs).

Lennart

--
Lennart Poettering - Red Hat, Inc.

Michael Biebl

2011-01-07 01:09:32 UTC

Post by Lennart Poettering
Well, I have been discussing this with Kay and we'll most likely add
something like DontKillOnShutdown=yes or so, which if added to a unit

Make that KillOnShutdown=no, please.

--
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?

Roman Mamedov

2011-01-07 01:17:46 UTC

On Fri, 7 Jan 2011 02:09:32 +0100

Post by Michael Biebl

Post by Lennart Poettering
Well, I have been discussing this with Kay and we'll most likely add
something like DontKillOnShutdown=yes or so, which if added to a unit

Make that KillOnShutdown=no, please.

Agreed :) That reminds me of "hal-disable-polling --enable-polling"
( http://ur1.ca/2rmis )

--
With respect,
Roman

NeilBrown

2011-01-07 01:16:28 UTC

Post by Andrey Borzenkov
If user starts array manually (mdadm -A -s as example) from within
user session and array needs mdmon, mdmon becomes part of user session

Are you suggesting that mdadm forks off mdmon from within the user
session? This is horribly ugly and broken and they shouldn't do that.

What alternative would you suggest?

A daemon needs to be running while certain md arrays are running and writable.

NeilBrown

Lennart Poettering

2011-01-07 01:42:01 UTC