Discussion:
[systemd-devel] Allow stop jobs to be killed during shutdown
Tom Horsley
2014-01-24 14:53:28 UTC
Permalink
There doesn't appear to be any way to convince systemd
to abandon utterly unimportant "stop jobs" during
shutdown and advance to actually important things
like cleanly syncing and un-mounting local hard
disks.

For example, there are bugs like this:

https://bugzilla.redhat.com/show_bug.cgi?id=1023820

Who really cares if we wait on a user daemon to
stop when we are shutting down the whole system?
And this is a bug anyway, there isn't actually
anything to wait on. Bugs like this will reappear
(this is inevitable).

More relevant perhaps is waiting on an NFS filesystem
to unmount when I happen to know the frigging
remote system has gone down and won't be talking to
me, yet systemd insists on waiting practically
forever for the umount.

Couldn't systemd listen for Ctrl-Alt-Delete on the
console keyboard and stop waiting on whatever
stop job(s) it is hung up on at the moment?

There really are times the poor fool sitting in
front of the system has better information than
systemd. (And might like to get his system shutdown
cleanly so he can get home on time :-).
Lennart Poettering
2014-01-24 15:03:49 UTC
Permalink
Post by Tom Horsley
There doesn't appear to be any way to convince systemd
to abandon utterly unimportant "stop jobs" during
shutdown and advance to actually important things
like cleanly syncing and un-mounting local hard
disks.
https://bugzilla.redhat.com/show_bug.cgi?id=1023820
Who really cares if we wait on a user daemon to
stop when we are shutting down the whole system?
And this is a bug anyway, there isn't actually
anything to wait on. Bugs like this will reappear
(this is inevitable).
More relevant perhaps is waiting on an NFS filesystem
to unmount when I happen to know the frigging
remote system has gone down and won't be talking to
me, yet systemd insists on waiting practically
forever for the umount.
Couldn't systemd listen for Ctrl-Alt-Delete on the
console keyboard and stop waiting on whatever
stop job(s) it is hung up on at the moment?
There really are times the poor fool sitting in
front of the system has better information than
systemd. (And might like to get his system shutdown
cleanly so he can get home on time :-).
It is our job to shutdown all services cleanly. A number of services
needs this, since they need to bring their files into a safe state
before quitting, and mark them as "offline". We cannot just drop that.

Note however, that we add have timeouts on all service shutdown
commands, so when some service hangs it will be forcibly aborted with
SIGKILL after 90s.

That all said, you can just shutdown with "systemctl poweroff -f"
instead of normal "systemctl poweroff". This will still bring the file
systems in order and things, but wil not bother with shutting down
system services cleanly, but simply SIGTERM and SIGKILL them after a
much shorter timeout.

However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.

Lennart
--
Lennart Poettering, Red Hat
Reindl Harald
2014-01-24 15:09:59 UTC
Permalink
Post by Lennart Poettering
It is our job to shutdown all services cleanly. A number of services
needs this, since they need to bring their files into a safe state
before quitting, and mark them as "offline". We cannot just drop that.
Note however, that we add have timeouts on all service shutdown
commands, so when some service hangs it will be forcibly aborted with
SIGKILL after 90s.
That all said, you can just shutdown with "systemctl poweroff -f"
instead of normal "systemctl poweroff". This will still bring the file
systems in order and things, but wil not bother with shutting down
system services cleanly, but simply SIGTERM and SIGKILL them after a
much shorter timeout.
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order
then bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1023820
and https://bugzilla.redhat.com/show_bug.cgi?id=1023788 should be fixed
much faster or never make it in a stable release
Lennart Poettering
2014-01-24 15:15:04 UTC
Permalink
Post by Reindl Harald
Post by Lennart Poettering
It is our job to shutdown all services cleanly. A number of services
needs this, since they need to bring their files into a safe state
before quitting, and mark them as "offline". We cannot just drop that.
Note however, that we add have timeouts on all service shutdown
commands, so when some service hangs it will be forcibly aborted with
SIGKILL after 90s.
That all said, you can just shutdown with "systemctl poweroff -f"
instead of normal "systemctl poweroff". This will still bring the file
systems in order and things, but wil not bother with shutting down
system services cleanly, but simply SIGTERM and SIGKILL them after a
much shorter timeout.
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order
then bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1023820
I have so far never encountered this issue, but I fear this is a bug
where somebody who can reproduce this needs to sit down and debug a
bit...

Lennart
--
Lennart Poettering, Red Hat
Ivan Shapovalov
2014-01-24 17:10:22 UTC
Permalink
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
It is our job to shutdown all services cleanly. A number of services
needs this, since they need to bring their files into a safe state
before quitting, and mark them as "offline". We cannot just drop that.
Note however, that we add have timeouts on all service shutdown
commands, so when some service hangs it will be forcibly aborted with
SIGKILL after 90s.
That all said, you can just shutdown with "systemctl poweroff -f"
instead of normal "systemctl poweroff". This will still bring the file
systems in order and things, but wil not bother with shutting down
system services cleanly, but simply SIGTERM and SIGKILL them after a
much shorter timeout.
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order
then bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1023820
I have so far never encountered this issue, but I fear this is a bug
where somebody who can reproduce this needs to sit down and debug a
bit...
Lennart
Any advices on how to do that?
I have both the issue (reproducible on each shutdown) and will to debug.
--
Ivan Shapovalov / intelfx /
Lennart Poettering
2014-01-24 17:46:06 UTC
Permalink
Post by Ivan Shapovalov
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order
then bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1023820
I have so far never encountered this issue, but I fear this is a bug
where somebody who can reproduce this needs to sit down and debug a
bit...
Lennart
Any advices on how to do that?
I have both the issue (reproducible on each shutdown) and will to debug.
Well, enable the debug shell, and then from there try to figure out why
things are stuck. i.e. whether it is systemd --user that really never
exits. Or whether it actually exits but PID 1 doesn't notice it. And
then if you figured out which of the two cases, you'd have to figure out
why that is...

Lennart
--
Lennart Poettering, Red Hat
Andrey Borzenkov
2014-01-24 18:32:40 UTC
Permalink
В Fri, 24 Jan 2014 18:46:06 +0100
Post by Lennart Poettering
Post by Ivan Shapovalov
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order
then bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1023820
I have so far never encountered this issue, but I fear this is a bug
where somebody who can reproduce this needs to sit down and debug a
bit...
Lennart
Any advices on how to do that?
I have both the issue (reproducible on each shutdown) and will to debug.
Well, enable the debug shell, and then from there try to figure out why
things are stuck. i.e. whether it is systemd --user that really never
exits. Or whether it actually exits but PID 1 doesn't notice it. And
User systemd does not attempt to exit because it does not see signal
(RTMIN+24). I'm ready to try to debug it further but simply do not know
where - epoll handling by systemd is black magic to me.
Post by Lennart Poettering
then if you figured out which of the two cases, you'd have to figure out
why that is...
Lennart
Ivan Shapovalov
2014-01-25 16:20:54 UTC
Permalink
Post by Lennart Poettering
Post by Ivan Shapovalov
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order
then bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1023820
I have so far never encountered this issue, but I fear this is a bug
where somebody who can reproduce this needs to sit down and debug a
bit...
Lennart
Any advices on how to do that?
I have both the issue (reproducible on each shutdown) and will to debug.
Well, enable the debug shell, and then from there try to figure out why
things are stuck. i.e. whether it is systemd --user that really never
exits. Or whether it actually exits but PID 1 doesn't notice it. And
then if you figured out which of the two cases, you'd have to figure out
why that is...
Lennart
So, in my case it apparently was due to PEBKAC. The user instance, which I
(ab)use for running some desktop services, receives "start exit.target", then
stops all running services and at some point something enqueues "isolate
default.target", which happily outweighs exit.target. As a result, service
shutdown never finishes and the user instance never exits.

Now I've fixed this (in my own scripts) and the user instance seems to exit
normally. The distro is Arch, for the record.
--
Ivan Shapovalov / intelfx /
Andrey Borzenkov
2014-01-26 08:09:23 UTC
Permalink
В Fri, 24 Jan 2014 18:46:06 +0100
Post by Lennart Poettering
Post by Ivan Shapovalov
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order
then bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1023820
I have so far never encountered this issue, but I fear this is a bug
where somebody who can reproduce this needs to sit down and debug a
bit...
Lennart
Any advices on how to do that?
I have both the issue (reproducible on each shutdown) and will to debug.
Well, enable the debug shell, and then from there try to figure out why
things are stuck. i.e. whether it is systemd --user that really never
exits. Or whether it actually exits but PID 1 doesn't notice it. And
then if you figured out which of the two cases, you'd have to figure out
why that is...
I finally managed to reproduce it with user instance running with debug
level (before *any* attempt to add debugging, strace, whatever resulted
in problem disappearing).

It seems that /bin/kill -RTMIN+24 is being killed itself. I wonder - is
it possible that it is the same SIGTERM that is used by PID 1 to stop
***@0service?

Jan 26 11:53:58 linux-1a7f systemd[1942]: Received SIGTERM from PID 1 (systemd).
Jan 26 11:53:58 linux-1a7f systemd[1942]: Activating special unit exit.target
Jan 26 11:53:58 linux-1a7f systemd[1942]: Trying to enqueue job exit.target/start/replace
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job exit.target/start as 3
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job systemd-exit.service/start as 4
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job shutdown.target/start as 5
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job default.target/stop as 7
Jan 26 11:53:58 linux-1a7f systemd[1942]: Enqueued job exit.target/start as 3
Jan 26 11:53:58 linux-1a7f systemd[1942]: Stopping Default.
Jan 26 11:53:58 linux-1a7f systemd[1942]: default.target changed active -> dead
Jan 26 11:53:58 linux-1a7f systemd[1942]: Job default.target/stop finished, result=done
Jan 26 11:53:58 linux-1a7f systemd[1942]: Stopped target Default.
Jan 26 11:53:58 linux-1a7f systemd[1942]: Starting Shutdown.
Jan 26 11:53:58 linux-1a7f systemd[1942]: shutdown.target changed dead -> active
Jan 26 11:53:58 linux-1a7f systemd[1942]: Job shutdown.target/start finished, result=done
Jan 26 11:53:59 linux-1a7f systemd[1942]: Reached target Shutdown.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Starting Exit the Session...
Jan 26 11:53:59 linux-1a7f systemd[1942]: About to execute: /usr/bin/kill -s 58 $MANAGERPID
Jan 26 11:53:59 linux-1a7f systemd[1942]: Forked /usr/bin/kill as 1951
Jan 26 11:53:59 linux-1a7f systemd[1942]: systemd-exit.service changed dead -> start
Jan 26 11:53:59 linux-1a7f systemd[1942]: Set up jobs progress timerfd.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Collecting default.target
Jan 26 11:53:59 linux-1a7f systemd[1942]: Received SIGCHLD from PID 1943 ((sd-pam)).
Jan 26 11:53:59 linux-1a7f systemd[1942]: Got SIGCHLD for process 1943 ((sd-pam))
Jan 26 11:53:59 linux-1a7f systemd[1942]: Child 1943 died (code=exited, status=0/SUCCESS)
Jan 26 11:53:59 linux-1a7f systemd[1942]: Received SIGCHLD from PID 1951 ((kill)).
Jan 26 11:53:59 linux-1a7f systemd[1942]: Got SIGCHLD for process 1951 ((kill))
Jan 26 11:53:59 linux-1a7f systemd[1942]: Child 1951 died (code=killed, status=15/TERM)
Jan 26 11:53:59 linux-1a7f systemd[1942]: Child 1951 belongs to systemd-exit.service
Jan 26 11:53:59 linux-1a7f systemd[1942]: systemd-exit.service: main process exited, code=killed, status=15/TERM
Jan 26 11:53:59 linux-1a7f systemd[1942]: systemd-exit.service changed start -> dead
Jan 26 11:53:59 linux-1a7f systemd[1942]: Job systemd-exit.service/start finished, result=done
Jan 26 11:53:59 linux-1a7f systemd[1942]: Started Exit the Session.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Closed jobs progress timerfd.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Starting Exit the Session.
Jan 26 11:53:59 linux-1a7f systemd[1942]: exit.target changed dead -> active
Jan 26 11:53:59 linux-1a7f systemd[1942]: Job exit.target/start finished, result=done
Jan 26 11:53:59 linux-1a7f systemd[1942]: Reached target Exit the Session.
Andrey Borzenkov
2014-01-26 13:18:28 UTC
Permalink
В Sun, 26 Jan 2014 12:09:23 +0400
Post by Andrey Borzenkov
В Fri, 24 Jan 2014 18:46:06 +0100
Post by Lennart Poettering
Post by Ivan Shapovalov
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order
then bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1023820
I have so far never encountered this issue, but I fear this is a bug
where somebody who can reproduce this needs to sit down and debug a
bit...
Lennart
Any advices on how to do that?
I have both the issue (reproducible on each shutdown) and will to debug.
Well, enable the debug shell, and then from there try to figure out why
things are stuck. i.e. whether it is systemd --user that really never
exits. Or whether it actually exits but PID 1 doesn't notice it. And
then if you figured out which of the two cases, you'd have to figure out
why that is...
I finally managed to reproduce it with user instance running with debug
level (before *any* attempt to add debugging, strace, whatever resulted
in problem disappearing).
It seems that /bin/kill -RTMIN+24 is being killed itself. I wonder - is
it possible that it is the same SIGTERM that is used by PID 1 to stop
I'm almost sure it is. cg_kill_recursive is in no way atomic, so it can
easily hit new process that was spawned since service stop had been
initiated.

Unfortunately, setting KillMode=process is not allowed:

Jan 26 17:12:30 linux-1a7f systemd[1]: ***@0.service has PAM enabled. Kill mode must be set to 'control-group'. Refusing.

Probably ***@.service should be exempt from this rule. It is supposed
to handle all services started by it itself, it *is* service manager
after all?
Post by Andrey Borzenkov
Jan 26 11:53:58 linux-1a7f systemd[1942]: Received SIGTERM from PID 1 (systemd).
Jan 26 11:53:58 linux-1a7f systemd[1942]: Activating special unit exit.target
Jan 26 11:53:58 linux-1a7f systemd[1942]: Trying to enqueue job exit.target/start/replace
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job exit.target/start as 3
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job systemd-exit.service/start as 4
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job shutdown.target/start as 5
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job default.target/stop as 7
Jan 26 11:53:58 linux-1a7f systemd[1942]: Enqueued job exit.target/start as 3
Jan 26 11:53:58 linux-1a7f systemd[1942]: Stopping Default.
Jan 26 11:53:58 linux-1a7f systemd[1942]: default.target changed active -> dead
Jan 26 11:53:58 linux-1a7f systemd[1942]: Job default.target/stop finished, result=done
Jan 26 11:53:58 linux-1a7f systemd[1942]: Stopped target Default.
Jan 26 11:53:58 linux-1a7f systemd[1942]: Starting Shutdown.
Jan 26 11:53:58 linux-1a7f systemd[1942]: shutdown.target changed dead -> active
Jan 26 11:53:58 linux-1a7f systemd[1942]: Job shutdown.target/start finished, result=done
Jan 26 11:53:59 linux-1a7f systemd[1942]: Reached target Shutdown.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Starting Exit the Session...
Jan 26 11:53:59 linux-1a7f systemd[1942]: About to execute: /usr/bin/kill -s 58 $MANAGERPID
Jan 26 11:53:59 linux-1a7f systemd[1942]: Forked /usr/bin/kill as 1951
Jan 26 11:53:59 linux-1a7f systemd[1942]: systemd-exit.service changed dead -> start
Jan 26 11:53:59 linux-1a7f systemd[1942]: Set up jobs progress timerfd.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Collecting default.target
Jan 26 11:53:59 linux-1a7f systemd[1942]: Received SIGCHLD from PID 1943 ((sd-pam)).
Jan 26 11:53:59 linux-1a7f systemd[1942]: Got SIGCHLD for process 1943 ((sd-pam))
Jan 26 11:53:59 linux-1a7f systemd[1942]: Child 1943 died (code=exited, status=0/SUCCESS)
Jan 26 11:53:59 linux-1a7f systemd[1942]: Received SIGCHLD from PID 1951 ((kill)).
Jan 26 11:53:59 linux-1a7f systemd[1942]: Got SIGCHLD for process 1951 ((kill))
Jan 26 11:53:59 linux-1a7f systemd[1942]: Child 1951 died (code=killed, status=15/TERM)
Jan 26 11:53:59 linux-1a7f systemd[1942]: Child 1951 belongs to systemd-exit.service
Jan 26 11:53:59 linux-1a7f systemd[1942]: systemd-exit.service: main process exited, code=killed, status=15/TERM
Jan 26 11:53:59 linux-1a7f systemd[1942]: systemd-exit.service changed start -> dead
Jan 26 11:53:59 linux-1a7f systemd[1942]: Job systemd-exit.service/start finished, result=done
Jan 26 11:53:59 linux-1a7f systemd[1942]: Started Exit the Session.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Closed jobs progress timerfd.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Starting Exit the Session.
Jan 26 11:53:59 linux-1a7f systemd[1942]: exit.target changed dead -> active
Jan 26 11:53:59 linux-1a7f systemd[1942]: Job exit.target/start finished, result=done
Jan 26 11:53:59 linux-1a7f systemd[1942]: Reached target Exit the Session.
Andrey Borzenkov
2014-01-26 15:21:16 UTC
Permalink
В Sun, 26 Jan 2014 17:18:28 +0400
Post by Andrey Borzenkov
В Sun, 26 Jan 2014 12:09:23 +0400
Post by Andrey Borzenkov
В Fri, 24 Jan 2014 18:46:06 +0100
Post by Lennart Poettering
Post by Ivan Shapovalov
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order
then bugs like https://bugzilla.redhat.com/show_bug.cgi?id=1023820
I have so far never encountered this issue, but I fear this is a bug
where somebody who can reproduce this needs to sit down and debug a
bit...
Lennart
Any advices on how to do that?
I have both the issue (reproducible on each shutdown) and will to debug.
Well, enable the debug shell, and then from there try to figure out why
things are stuck. i.e. whether it is systemd --user that really never
exits. Or whether it actually exits but PID 1 doesn't notice it. And
then if you figured out which of the two cases, you'd have to figure out
why that is...
I finally managed to reproduce it with user instance running with debug
level (before *any* attempt to add debugging, strace, whatever resulted
in problem disappearing).
It seems that /bin/kill -RTMIN+24 is being killed itself. I wonder - is
it possible that it is the same SIGTERM that is used by PID 1 to stop
I'm almost sure it is. cg_kill_recursive is in no way atomic, so it can
easily hit new process that was spawned since service stop had been
initiated.
to handle all services started by it itself, it *is* service manager
after all?
I rebuilt systemd without this restriction, set KillMode=process for
***@.service and this fixed things here.

So there are two problems associated with user instance.

1. Using KillMode=control-group is wrong. Each service managed by user
instance has own requirements how it is stopped. Just sending everything
SIGTERM without even trying service ExecStop first is obviously
incorrect.

2. ***@.service has single timeout, but it manages unknown in advance
number of services each needing unknown timeout. While having some
capping to total timeout looks sensible, only user itself may estimate
the value. But service ***@.system is system-level service which use
cannot configure ...
Post by Andrey Borzenkov
Post by Andrey Borzenkov
Jan 26 11:53:58 linux-1a7f systemd[1942]: Received SIGTERM from PID 1 (systemd).
Jan 26 11:53:58 linux-1a7f systemd[1942]: Activating special unit exit.target
Jan 26 11:53:58 linux-1a7f systemd[1942]: Trying to enqueue job exit.target/start/replace
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job exit.target/start as 3
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job systemd-exit.service/start as 4
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job shutdown.target/start as 5
Jan 26 11:53:58 linux-1a7f systemd[1942]: Installed new job default.target/stop as 7
Jan 26 11:53:58 linux-1a7f systemd[1942]: Enqueued job exit.target/start as 3
Jan 26 11:53:58 linux-1a7f systemd[1942]: Stopping Default.
Jan 26 11:53:58 linux-1a7f systemd[1942]: default.target changed active -> dead
Jan 26 11:53:58 linux-1a7f systemd[1942]: Job default.target/stop finished, result=done
Jan 26 11:53:58 linux-1a7f systemd[1942]: Stopped target Default.
Jan 26 11:53:58 linux-1a7f systemd[1942]: Starting Shutdown.
Jan 26 11:53:58 linux-1a7f systemd[1942]: shutdown.target changed dead -> active
Jan 26 11:53:58 linux-1a7f systemd[1942]: Job shutdown.target/start finished, result=done
Jan 26 11:53:59 linux-1a7f systemd[1942]: Reached target Shutdown.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Starting Exit the Session...
Jan 26 11:53:59 linux-1a7f systemd[1942]: About to execute: /usr/bin/kill -s 58 $MANAGERPID
Jan 26 11:53:59 linux-1a7f systemd[1942]: Forked /usr/bin/kill as 1951
Jan 26 11:53:59 linux-1a7f systemd[1942]: systemd-exit.service changed dead -> start
Jan 26 11:53:59 linux-1a7f systemd[1942]: Set up jobs progress timerfd.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Collecting default.target
Jan 26 11:53:59 linux-1a7f systemd[1942]: Received SIGCHLD from PID 1943 ((sd-pam)).
Jan 26 11:53:59 linux-1a7f systemd[1942]: Got SIGCHLD for process 1943 ((sd-pam))
Jan 26 11:53:59 linux-1a7f systemd[1942]: Child 1943 died (code=exited, status=0/SUCCESS)
Jan 26 11:53:59 linux-1a7f systemd[1942]: Received SIGCHLD from PID 1951 ((kill)).
Jan 26 11:53:59 linux-1a7f systemd[1942]: Got SIGCHLD for process 1951 ((kill))
Jan 26 11:53:59 linux-1a7f systemd[1942]: Child 1951 died (code=killed, status=15/TERM)
Jan 26 11:53:59 linux-1a7f systemd[1942]: Child 1951 belongs to systemd-exit.service
Jan 26 11:53:59 linux-1a7f systemd[1942]: systemd-exit.service: main process exited, code=killed, status=15/TERM
Jan 26 11:53:59 linux-1a7f systemd[1942]: systemd-exit.service changed start -> dead
Jan 26 11:53:59 linux-1a7f systemd[1942]: Job systemd-exit.service/start finished, result=done
Jan 26 11:53:59 linux-1a7f systemd[1942]: Started Exit the Session.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Closed jobs progress timerfd.
Jan 26 11:53:59 linux-1a7f systemd[1942]: Starting Exit the Session.
Jan 26 11:53:59 linux-1a7f systemd[1942]: exit.target changed dead -> active
Jan 26 11:53:59 linux-1a7f systemd[1942]: Job exit.target/start finished, result=done
Jan 26 11:53:59 linux-1a7f systemd[1942]: Reached target Exit the Session.
Tom Gundersen
2014-01-26 16:23:54 UTC
Permalink
Hi Andrey,
Post by Andrey Borzenkov
Post by Andrey Borzenkov
I'm almost sure it is. cg_kill_recursive is in no way atomic, so it can
easily hit new process that was spawned since service stop had been
initiated.
Thanks for debugging this!
Post by Andrey Borzenkov
Post by Andrey Borzenkov
to handle all services started by it itself, it *is* service manager
after all?
I don't think we want any processes to survive the exit of
***@.service, so KillMode=process feels wrong. However, isn't the
problem that we are going into the "kill control-group" mode too soon,
before ***@.serivce has had a chance of cleaning itself up
gracefully?
Post by Andrey Borzenkov
I rebuilt systemd without this restriction, set KillMode=process for
So there are two problems associated with user instance.
1. Using KillMode=control-group is wrong. Each service managed by user
instance has own requirements how it is stopped. Just sending everything
SIGTERM without even trying service ExecStop first is obviously
incorrect.
I guess what we want is to first send SIGTERM only to the systemd
--user process, and only after a timeout start sending SIGTERM to all
the processes in the control group? I.e., wouldn't a ExecStop entry in
Post by Andrey Borzenkov
number of services each needing unknown timeout. While having some
capping to total timeout looks sensible, only user itself may estimate
cannot configure ...
I think it really makes sense to have a system-wide timeout on these
things (possibly a high one), we don't want the user to delay things
without limit. The user already has the possibility of putting their
own limits if they want to (but they must of course be shorter than
the system-wide one).

Cheers,

Tom
Andrey Borzenkov
2014-01-26 17:16:13 UTC
Permalink
В Sun, 26 Jan 2014 17:23:54 +0100
Post by Tom Gundersen
Post by Andrey Borzenkov
Post by Andrey Borzenkov
to handle all services started by it itself, it *is* service manager
after all?
I don't think we want any processes to survive the exit of
problem that we are going into the "kill control-group" mode too soon,
gracefully?
Yes.
Post by Tom Gundersen
Post by Andrey Borzenkov
I rebuilt systemd without this restriction, set KillMode=process for
So there are two problems associated with user instance.
1. Using KillMode=control-group is wrong. Each service managed by user
instance has own requirements how it is stopped. Just sending everything
SIGTERM without even trying service ExecStop first is obviously
incorrect.
I guess what we want is to first send SIGTERM only to the systemd
--user process, and only after a timeout start sending SIGTERM to all
the processes in the control group? I.e., wouldn't a ExecStop entry in
Does not work. systemd sends SIGTERM as soon as ExecStop finished.

Jan 26 21:00:14 linux-1a7f systemd[1]: Stopping User Manager for 0...
Jan 26 21:00:14 linux-1a7f systemd[1]: About to execute: /usr/bin/kill -15 $MAINPID
Jan 26 21:00:14 linux-1a7f systemd[1]: Forked /usr/bin/kill as 1978
Jan 26 21:00:14 linux-1a7f systemd[1]: ***@0.service changed running -> stop
Jan 26 21:00:14 linux-1a7f systemd[1978]: Executing: /usr/bin/kill -15 1886
Jan 26 21:00:14 linux-1a7f systemd[1886]: Received SIGTERM from PID 1978 (kill).
Jan 26 21:00:14 linux-1a7f systemd[1886]: Activating special unit exit.target
Jan 26 21:00:14 linux-1a7f systemd[1886]: Trying to enqueue job exit.target/start/replace
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job exit.target/start as 9
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job systemd-exit.service/start as 10
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job shutdown.target/start as 11
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job -.slice/stop as 12
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job default.target/stop as 13
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job test.service/stop as 14
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job paths.target/stop as 15
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job timers.target/stop as 16
Jan 26 21:00:14 linux-1a7f systemd[1886]: Installed new job sockets.target/stop as 17
Jan 26 21:00:14 linux-1a7f systemd[1886]: Enqueued job exit.target/start as 9
Jan 26 21:00:14 linux-1a7f systemd[1886]: Stopping Test service with stop delay...
Jan 26 21:00:14 linux-1a7f systemd[1886]: About to execute: /bin/sleep 10
Jan 26 21:00:14 linux-1a7f systemd[1886]: Forked /bin/sleep as 2001
Jan 26 21:00:14 linux-1a7f systemd[1886]: test.service changed exited -> stop
Jan 26 21:00:14 linux-1a7f systemd[2001]: Executing: /bin/sleep 10
Jan 26 21:00:14 linux-1a7f systemd[1886]: Stopping Default.
...
Jan 26 21:00:14 linux-1a7f systemd[1]: Child 1978 died (code=exited, status=0/SUCCESS)
Jan 26 21:00:14 linux-1a7f systemd[1]: Child 1978 belongs to ***@0.service
Jan 26 21:00:14 linux-1a7f systemd[1]: ***@0.service: control process exited, code=exited status=0
Jan 26 21:00:14 linux-1a7f systemd[1]: ***@0.service got final SIGCHLD for state stop
Jan 26 21:00:14 linux-1a7f systemd[1]: ***@0.service changed stop -> stop-sigterm

I believe someone already mentioned this problem. In general, we cannot
assume that ExecStop is synchronous. It may just signal main process to
exit. systemd should wait until $MAINPID exits (or timeout) before
continuing further processing.
Post by Tom Gundersen
Post by Andrey Borzenkov
number of services each needing unknown timeout. While having some
capping to total timeout looks sensible, only user itself may estimate
cannot configure ...
I think it really makes sense to have a system-wide timeout on these
things (possibly a high one), we don't want the user to delay things
without limit. The user already has the possibility of putting their
own limits if they want to (but they must of course be shorter than
the system-wide one).
I mostly agree, except current 90 seconds look too small and this
definitely requires better communication to user (like auto exit from
quiet mode) so system does not appear to be hung.

There is also practical issue - we have two levels - PID 1 instance and
user instance (multiple users actually). Does it make sense to display
each individual user service as it shuts down? This would facilitate
troubleshooting. But then we have interleaved output from multiple
(independent) instances ...
Zbigniew Jędrzejewski-Szmek
2014-01-27 06:43:55 UTC
Permalink
Post by Andrey Borzenkov
В Sun, 26 Jan 2014 17:23:54 +0100
Post by Tom Gundersen
Post by Andrey Borzenkov
Post by Andrey Borzenkov
to handle all services started by it itself, it *is* service manager
after all?
I don't think we want any processes to survive the exit of
problem that we are going into the "kill control-group" mode too soon,
gracefully?
Yes.
Post by Tom Gundersen
Post by Andrey Borzenkov
I rebuilt systemd without this restriction, set KillMode=process for
So there are two problems associated with user instance.
1. Using KillMode=control-group is wrong. Each service managed by user
instance has own requirements how it is stopped. Just sending everything
SIGTERM without even trying service ExecStop first is obviously
incorrect.
I guess what we want is to first send SIGTERM only to the systemd
--user process, and only after a timeout start sending SIGTERM to all
the processes in the control group? I.e., wouldn't a ExecStop entry in
Does not work. systemd sends SIGTERM as soon as ExecStop finished.
Looks like we need a setting like SendKillSignalTo=main-pid|all|control-pid.
Or something like that.

Also the TimeoutStopSec on ***@.service should be probably increased
to 10 min or so.
Post by Andrey Borzenkov
I believe someone already mentioned this problem. In general, we cannot
assume that ExecStop is synchronous. It may just signal main process to
exit. systemd should wait until $MAINPID exits (or timeout) before
continuing further processing.
ExecStop is required to be synchronous, i.e. the service should be stopped
when it returns. /bin/kill is not going to work here.

Zbyszek
Andrey Borzenkov
2014-01-27 12:13:01 UTC
Permalink
В Mon, 27 Jan 2014 07:43:55 +0100
Post by Zbigniew Jędrzejewski-Szmek
Looks like we need a setting like SendKillSignalTo=main-pid|all|control-pid.
Quoting from this thread
Post by Zbigniew Jędrzejewski-Szmek
Post by Tom Gundersen
I don't think we want any processes to survive the exit of
So why would we need this setting? Final kill should go to all remaining
processes; what is scenario where it would *not*?

So may be we simply need to make final kill independent of KillMode.

I'm fine with having separate setting for SIGTERM and SIGKILL, just do
not see use case for the latter.

Having KillMode=process and ensuring that final kill really attempts to
kill everything would be enough to fix this problem. Except in this
case KillMode probably has to be renamed to something else, like
SendTermSignalTo= ...
Tom Gundersen
2014-01-27 12:17:48 UTC
Permalink
Post by Andrey Borzenkov
В Mon, 27 Jan 2014 07:43:55 +0100
Post by Zbigniew Jędrzejewski-Szmek
Looks like we need a setting like SendKillSignalTo=main-pid|all|control-pid.
Quoting from this thread
Post by Zbigniew Jędrzejewski-Szmek
Post by Tom Gundersen
I don't think we want any processes to survive the exit of
So why would we need this setting? Final kill should go to all remaining
processes; what is scenario where it would *not*?
So may be we simply need to make final kill independent of KillMode.
I'm fine with having separate setting for SIGTERM and SIGKILL, just do
not see use case for the latter.
Having KillMode=process and ensuring that final kill really attempts to
kill everything would be enough to fix this problem. Except in this
case KillMode probably has to be renamed to something else, like
SendTermSignalTo= ...
I think we could avoid all this be just not using async signals at all
unless something is broken and we need to go on a killing-spree.

-t
Tom Gundersen
2014-01-27 12:15:55 UTC
Permalink
On Mon, Jan 27, 2014 at 7:43 AM, Zbigniew Jędrzejewski-Szmek
Post by Zbigniew Jędrzejewski-Szmek
Post by Andrey Borzenkov
В Sun, 26 Jan 2014 17:23:54 +0100
Post by Tom Gundersen
Post by Andrey Borzenkov
Post by Andrey Borzenkov
to handle all services started by it itself, it *is* service manager
after all?
I don't think we want any processes to survive the exit of
problem that we are going into the "kill control-group" mode too soon,
gracefully?
Yes.
Post by Tom Gundersen
Post by Andrey Borzenkov
I rebuilt systemd without this restriction, set KillMode=process for
So there are two problems associated with user instance.
1. Using KillMode=control-group is wrong. Each service managed by user
instance has own requirements how it is stopped. Just sending everything
SIGTERM without even trying service ExecStop first is obviously
incorrect.
I guess what we want is to first send SIGTERM only to the systemd
--user process, and only after a timeout start sending SIGTERM to all
the processes in the control group? I.e., wouldn't a ExecStop entry in
Does not work. systemd sends SIGTERM as soon as ExecStop finished.
Looks like we need a setting like SendKillSignalTo=main-pid|all|control-pid.
Or something like that.
to 10 min or so.
Post by Andrey Borzenkov
I believe someone already mentioned this problem. In general, we cannot
assume that ExecStop is synchronous. It may just signal main process to
exit. systemd should wait until $MAINPID exits (or timeout) before
continuing further processing.
ExecStop is required to be synchronous, i.e. the service should be stopped
when it returns. /bin/kill is not going to work here.
Good point, I had missed that (I assumed there was a timeout). So
something like a synchronous "systemctl --user stop" should do it, no?

-t
Andrey Borzenkov
2014-01-27 15:11:46 UTC
Permalink
В Mon, 27 Jan 2014 13:15:55 +0100
Post by Tom Gundersen
On Mon, Jan 27, 2014 at 7:43 AM, Zbigniew Jędrzejewski-Szmek
Post by Zbigniew Jędrzejewski-Szmek
Post by Andrey Borzenkov
В Sun, 26 Jan 2014 17:23:54 +0100
Post by Tom Gundersen
Post by Andrey Borzenkov
Post by Andrey Borzenkov
to handle all services started by it itself, it *is* service manager
after all?
I don't think we want any processes to survive the exit of
problem that we are going into the "kill control-group" mode too soon,
gracefully?
Yes.
Post by Tom Gundersen
Post by Andrey Borzenkov
I rebuilt systemd without this restriction, set KillMode=process for
So there are two problems associated with user instance.
1. Using KillMode=control-group is wrong. Each service managed by user
instance has own requirements how it is stopped. Just sending everything
SIGTERM without even trying service ExecStop first is obviously
incorrect.
I guess what we want is to first send SIGTERM only to the systemd
--user process, and only after a timeout start sending SIGTERM to all
the processes in the control group? I.e., wouldn't a ExecStop entry in
Does not work. systemd sends SIGTERM as soon as ExecStop finished.
Looks like we need a setting like SendKillSignalTo=main-pid|all|control-pid.
Or something like that.
to 10 min or so.
Post by Andrey Borzenkov
I believe someone already mentioned this problem. In general, we cannot
assume that ExecStop is synchronous. It may just signal main process to
exit. systemd should wait until $MAINPID exits (or timeout) before
continuing further processing.
ExecStop is required to be synchronous, i.e. the service should be stopped
when it returns. /bin/kill is not going to work here.
Good point, I had missed that (I assumed there was a timeout). So
something like a synchronous "systemctl --user stop" should do it, no?
Yes, except "systemd --user" is defined only for a *current* user.
Extending it to "systemd --user=<UID>" would be a solution (it must be
numerical UID as nothing more is available in ***@.service). I played
with su, but it does not work with UID - it want user name,
Colin Guthrie
2014-01-27 11:27:31 UTC
Permalink
Post by Andrey Borzenkov
Post by Tom Gundersen
I guess what we want is to first send SIGTERM only to the systemd
--user process, and only after a timeout start sending SIGTERM to all
the processes in the control group? I.e., wouldn't a ExecStop entry in
Does not work. systemd sends SIGTERM as soon as ExecStop finished.
Could you not use the same hack that apache httpd needs?

http://pkgs.fedoraproject.org/cgit/httpd.git/tree/httpd.service#n28
Post by Andrey Borzenkov
ExecStop=/bin/kill -WINCH ${MAINPID}
# We want systemd to give httpd some time to finish gracefully, but still want
# it to kill httpd after TimeoutStopSec if something went wrong during the
# graceful stop. Normally, Systemd sends SIGTERM signal right after the
# ExecStop, which would kill httpd. We are sending useless SIGCONT here to give
# httpd time to finish.
KillSignal=SIGCONT
It's probably not "nice" to do that, but it should give the necessary
time to let things clean up properly...

A proper fix is probably highly desirable tho'!

Col
--
Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
Tribalogic Limited http://www.tribalogic.net/
Open Source:
Mageia Contributor http://www.mageia.org/
PulseAudio Hacker http://www.pulseaudio.org/
Trac Hacker http://trac.edgewall.org/
Colin Guthrie
2014-01-27 12:15:54 UTC
Permalink
[Mailing list CC'ed again]
В Mon, 27 Jan 2014 11:27:31 +0000
Post by Colin Guthrie
Post by Andrey Borzenkov
Post by Tom Gundersen
I guess what we want is to first send SIGTERM only to the systemd
--user process, and only after a timeout start sending SIGTERM to all
the processes in the control group? I.e., wouldn't a ExecStop entry in
Does not work. systemd sends SIGTERM as soon as ExecStop finished.
Could you not use the same hack that apache httpd needs?
http://pkgs.fedoraproject.org/cgit/httpd.git/tree/httpd.service#n28
No.
systemd user instance needs SIGTERM to start "shutdown" procedure.
systemd system instance does not allow SIGTERM to be sent to the
$MAINPID only. Sending SIGTERM to all processes at the very beginning
is wrong.
ExecStop=/bin/kill -WINCH ${MAINPID}
could be used to tell the user session to start it's shudown procedure,
but rather than -WINCH as in the httpd case, we'd just send SIGTERM here
instead.

So systemdPID1 would trigger the ExecStop (triggering the user session
shutdown) and then not do the normal round of killing due to the
KillSignal, wait for a timeout (which is quite long) and only then do a
SIGKILL (which is brutal but you'd hope the user session would have done
all it's work by then and killed as much as is humanly possible (spawned
off root processes stuck in the user's cgroup not withstanding...).

But perhaps I'm still missing something and this won't work :(

And of course as mentioned originally it would be nice to provide better
semantics to control this.

Col
--
Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
Tribalogic Limited http://www.tribalogic.net/
Open Source:
Mageia Contributor http://www.mageia.org/
PulseAudio Hacker http://www.pulseaudio.org/
Trac Hacker http://trac.edgewall.org/
Andrey Borzenkov
2014-01-27 12:25:24 UTC
Permalink
В Mon, 27 Jan 2014 12:15:54 +0000
Post by Colin Guthrie
[Mailing list CC'ed again]
В Mon, 27 Jan 2014 11:27:31 +0000
Post by Colin Guthrie
Post by Andrey Borzenkov
Post by Tom Gundersen
I guess what we want is to first send SIGTERM only to the systemd
--user process, and only after a timeout start sending SIGTERM to all
the processes in the control group? I.e., wouldn't a ExecStop entry in
Does not work. systemd sends SIGTERM as soon as ExecStop finished.
Could you not use the same hack that apache httpd needs?
http://pkgs.fedoraproject.org/cgit/httpd.git/tree/httpd.service#n28
No.
systemd user instance needs SIGTERM to start "shutdown" procedure.
systemd system instance does not allow SIGTERM to be sent to the
$MAINPID only. Sending SIGTERM to all processes at the very beginning
is wrong.
ExecStop=/bin/kill -WINCH ${MAINPID}
could be used to tell the user session to start it's shudown procedure,
but rather than -WINCH as in the httpd case, we'd just send SIGTERM here
instead.
Ah, well. So systemd will not allow to say KillMode=none but happily
accepts dummy signal which does nothing. How consistent :)

This could be considered as workaround for a released distro where
***@.service does not do anything useful anyway. Right. Thank you for
an idea!
Post by Colin Guthrie
But perhaps I'm still missing something and this won't work :(
No, I expect it to work. Just losing final "graceful stop" step, but
this should have been handled by systemd user instance already in the
first place.
Colin Guthrie
2014-01-27 13:45:19 UTC
Permalink
Post by Andrey Borzenkov
Post by Colin Guthrie
Post by Andrey Borzenkov
ExecStop=/bin/kill -WINCH ${MAINPID}
could be used to tell the user session to start it's shudown procedure,
but rather than -WINCH as in the httpd case, we'd just send SIGTERM here
instead.
Ah, well. So systemd will not allow to say KillMode=none but happily
accepts dummy signal which does nothing. How consistent :)
This could be considered as workaround for a released distro where
an idea!
Done a bit of testing with this hack today.

Doesn't seem to cause any problems, and while I didn't have a reliable
reproducer, I have done lots of reboots on my VM and I'd expect at least
one of them to have stalled by now.

So I might push this one in too.

Col
--
Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
Tribalogic Limited http://www.tribalogic.net/
Open Source:
Mageia Contributor http://www.mageia.org/
PulseAudio Hacker http://www.pulseaudio.org/
Trac Hacker http://trac.edgewall.org/
Lennart Poettering
2014-01-27 14:51:34 UTC
Permalink
Post by Tom Gundersen
Post by Andrey Borzenkov
I rebuilt systemd without this restriction, set KillMode=process for
So there are two problems associated with user instance.
1. Using KillMode=control-group is wrong. Each service managed by user
instance has own requirements how it is stopped. Just sending everything
SIGTERM without even trying service ExecStop first is obviously
incorrect.
I guess what we want is to first send SIGTERM only to the systemd
--user process, and only after a timeout start sending SIGTERM to all
the processes in the control group? I.e., wouldn't a ExecStop entry in
Well, it would, but I am really sure KillMode=mixed would be the better approach...
Post by Tom Gundersen
Post by Andrey Borzenkov
number of services each needing unknown timeout. While having some
capping to total timeout looks sensible, only user itself may estimate
cannot configure ...
I think it really makes sense to have a system-wide timeout on these
things (possibly a high one), we don't want the user to delay things
without limit. The user already has the possibility of putting their
own limits if they want to (but they must of course be shorter than
the system-wide one).
Yeah, I fully agree. We need a timeout here that is mandated by the
system, and cannot be overridden, so that the user cannot find a way to
circumvent kill requests by the admin. However, it certainly makes sense
to make it a bit higher than the systemd user instance's own timeouts.

Lennart
--
Lennart Poettering, Red Hat
Zbigniew Jędrzejewski-Szmek
2014-01-27 15:13:16 UTC
Permalink
Post by Lennart Poettering
Post by Tom Gundersen
Post by Andrey Borzenkov
I rebuilt systemd without this restriction, set KillMode=process for
So there are two problems associated with user instance.
1. Using KillMode=control-group is wrong. Each service managed by user
instance has own requirements how it is stopped. Just sending everything
SIGTERM without even trying service ExecStop first is obviously
incorrect.
I guess what we want is to first send SIGTERM only to the systemd
--user process, and only after a timeout start sending SIGTERM to all
the processes in the control group? I.e., wouldn't a ExecStop entry in
Well, it would, but I am really sure KillMode=mixed would be the better approach...
Post by Tom Gundersen
Post by Andrey Borzenkov
number of services each needing unknown timeout. While having some
capping to total timeout looks sensible, only user itself may estimate
cannot configure ...
I think it really makes sense to have a system-wide timeout on these
things (possibly a high one), we don't want the user to delay things
without limit. The user already has the possibility of putting their
own limits if they want to (but they must of course be shorter than
the system-wide one).
Yeah, I fully agree. We need a timeout here that is mandated by the
system, and cannot be overridden, so that the user cannot find a way to
circumvent kill requests by the admin. However, it certainly makes sense
to make it a bit higher than the systemd user instance's own timeouts.
A bit higher is probably not enough, since a user instance might need
to shutdown a few things in order, and more than one might have to time
out. It'd probably make sense to decrease the timeouts in --user instances
to something substantially smaller than in --system, and than make the
timeout for ***@.service a multiple of that.

Zbyszek
Lennart Poettering
2014-01-27 14:48:50 UTC
Permalink
Post by Andrey Borzenkov
Post by Lennart Poettering
Post by Ivan Shapovalov
Any advices on how to do that?
I have both the issue (reproducible on each shutdown) and will to debug.
Well, enable the debug shell, and then from there try to figure out why
things are stuck. i.e. whether it is systemd --user that really never
exits. Or whether it actually exits but PID 1 doesn't notice it. And
then if you figured out which of the two cases, you'd have to figure out
why that is...
I finally managed to reproduce it with user instance running with debug
level (before *any* attempt to add debugging, strace, whatever resulted
in problem disappearing).
It seems that /bin/kill -RTMIN+24 is being killed itself. I wonder - is
it possible that it is the same SIGTERM that is used by PID 1 to stop
Ah, bummer! Yikes!

Thanks for tracking this done, this really sounds like you nailed the
problem. Now, how to fix it?

Hmm, so, I would claim this is a shortcoming of
KillMode=control-group, which is the default for everything. There has
been an item on the TODO list to maybe introduce a KillMode=mixed
setting, which would send SIGTERM only to the main process, and the
SIGKILL later on to all processes. I am pretty sure that this would
solve the issue at hand quite nicely here, because the systemd user
instance would get a nice chance to clean up its own act, before the
systemd system instance would make tabula rasa...

Lennart
--
Lennart Poettering, Red Hat
Andrey Borzenkov
2014-01-29 15:29:53 UTC
Permalink
В Mon, 27 Jan 2014 15:48:50 +0100
Post by Lennart Poettering
Post by Andrey Borzenkov
Post by Lennart Poettering
Post by Ivan Shapovalov
Any advices on how to do that?
I have both the issue (reproducible on each shutdown) and will to debug.
Well, enable the debug shell, and then from there try to figure out why
things are stuck. i.e. whether it is systemd --user that really never
exits. Or whether it actually exits but PID 1 doesn't notice it. And
then if you figured out which of the two cases, you'd have to figure out
why that is...
I finally managed to reproduce it with user instance running with debug
level (before *any* attempt to add debugging, strace, whatever resulted
in problem disappearing).
It seems that /bin/kill -RTMIN+24 is being killed itself. I wonder - is
it possible that it is the same SIGTERM that is used by PID 1 to stop
Ah, bummer! Yikes!
Thanks for tracking this done, this really sounds like you nailed the
problem. Now, how to fix it?
Hmm, so, I would claim this is a shortcoming of
KillMode=control-group, which is the default for everything. There has
been an item on the TODO list to maybe introduce a KillMode=mixed
setting, which would send SIGTERM only to the main process, and the
SIGKILL later on to all processes. I am pretty sure that this would
solve the issue at hand quite nicely here, because the systemd user
instance would get a nice chance to clean up its own act, before the
systemd system instance would make tabula rasa...
I still favor alternative approach - let systemd wait for main PID
to exit after ExecStop instead. This is functionally equivalent to the
above with slight advantages

- it will probably decrease number of timeouts because systemd will go
on killing spree as soon as main PID exits, not after final timeout.

- it is more generic as it allows any available method to trigger
service stop, not just a signal.

Comments are welcome :)

From: Andrey Borzenkov <***@gmail.com>
Subject: [PATCH] add WaitForMainPIDOnStop option

WaitForMainPIDOnStop=true will wait for exit of main PID in addition to
command set as ExecStop.

Use it in ***@.service to allow user systemd to complete exit transaction
before starting final kill.

---
man/systemd.service.xml | 13 +++++++++++++
src/core/dbus-service.c | 1 +
src/core/load-fragment-gperf.gperf.m4 | 1 +
src/core/service.c | 15 +++++++++++++--
src/core/service.h | 1 +
units/***@.service.in | 2 ++
6 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/man/systemd.service.xml b/man/systemd.service.xml
index d316ab5..c121c95 100644
--- a/man/systemd.service.xml
+++ b/man/systemd.service.xml
@@ -1016,6 +1016,19 @@ ExecStart=/bin/echo $ONE $TWO ${TWO}
<option>none</option>.</para></listitem>
</varlistentry>

+ <varlistentry>
+ <term><varname>WaitForMainPIDOnStop=</varname></term>
+
+ <listitem><para>Takes a boolean value
+ that specifies whether systemd should
+ additionally wait for the main PID of a service
+ to exit after executing ExecStop command.
+ Default is to wait for completion of ExecStop
+ command only. Defaults to
+ <option>no</option>.</para>
+ </listitem>
+ </varlistentry>
+
</variablelist>

<para>Check
diff --git a/src/core/dbus-service.c b/src/core/dbus-service.c
index 3db9339..35a3c2f 100644
--- a/src/core/dbus-service.c
+++ b/src/core/dbus-service.c
@@ -54,6 +54,7 @@ const sd_bus_vtable bus_service_vtable[] = {
SD_BUS_PROPERTY("RootDirectoryStartOnly", "b", bus_property_get_bool, offsetof(Service, root_directory_start_only), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("RemainAfterExit", "b", bus_property_get_bool, offsetof(Service, remain_after_exit), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("GuessMainPID", "b", bus_property_get_bool, offsetof(Service, guess_main_pid), SD_BUS_VTABLE_PROPERTY_CONST),
+ SD_BUS_PROPERTY("WaitForMainPIDOnStop", "b", bus_property_get_bool, offsetof(Service, stop_waits_for_main_pid), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("MainPID", "u", bus_property_get_pid, offsetof(Service, main_pid), SD_BUS_VTABLE_PROPERTY_EMITS_CHANGE),
SD_BUS_PROPERTY("ControlPID", "u", bus_property_get_pid, offsetof(Service, control_pid), SD_BUS_VTABLE_PROPERTY_EMITS_CHANGE),
SD_BUS_PROPERTY("BusName", "s", NULL, offsetof(Service, bus_name), SD_BUS_VTABLE_PROPERTY_CONST),
diff --git a/src/core/load-fragment-gperf.gperf.m4 b/src/core/load-fragment-gperf.gperf.m4
index 59b2a64..85aaef7 100644
--- a/src/core/load-fragment-gperf.gperf.m4
+++ b/src/core/load-fragment-gperf.gperf.m4
@@ -167,6 +167,7 @@ Service.PermissionsStartOnly, config_parse_bool, 0,
Service.RootDirectoryStartOnly, config_parse_bool, 0, offsetof(Service, root_directory_start_only)
Service.RemainAfterExit, config_parse_bool, 0, offsetof(Service, remain_after_exit)
Service.GuessMainPID, config_parse_bool, 0, offsetof(Service, guess_main_pid)
+Service.WaitForMainPIDOnStop, config_parse_bool, 0, offsetof(Service, stop_waits_for_main_pid)
Service.RestartPreventExitStatus, config_parse_set_status, 0, offsetof(Service, restart_ignore_status)
Service.SuccessExitStatus, config_parse_set_status, 0, offsetof(Service, success_status)
m4_ifdef(`HAVE_SYSV_COMPAT',
diff --git a/src/core/service.c b/src/core/service.c
index d949f7a..8e0eb6c 100644
--- a/src/core/service.c
+++ b/src/core/service.c
@@ -1269,6 +1269,7 @@ static void service_dump(Unit *u, FILE *f, const char *prefix) {
"%sRootDirectoryStartOnly: %s\n"
"%sRemainAfterExit: %s\n"
"%sGuessMainPID: %s\n"
+ "%sWaitForMainPIDOnStop: %s\n"
"%sType: %s\n"
"%sRestart: %s\n"
"%sNotifyAccess: %s\n",
@@ -1279,6 +1280,7 @@ static void service_dump(Unit *u, FILE *f, const char *prefix) {
prefix, yes_no(s->root_directory_start_only),
prefix, yes_no(s->remain_after_exit),
prefix, yes_no(s->guess_main_pid),
+ prefix, yes_no(s->stop_waits_for_main_pid),
prefix, service_type_to_string(s->type),
prefix, service_restart_to_string(s->restart),
prefix, notify_access_to_string(s->notify_access));
@@ -2953,11 +2955,17 @@ static void service_sigchld_event(Unit *u, pid_t pid, int code, int status) {

case SERVICE_START_POST:
case SERVICE_RELOAD:
- case SERVICE_STOP:
/* Need to wait until the operation is
* done */
break;

+ case SERVICE_STOP:
+ /* If requested, wait for both main and control
+ PID to finish */
+ if (s->stop_waits_for_main_pid && !control_pid_good(s))
+ service_enter_signal(s, SERVICE_STOP_SIGTERM, f);
+ break;
+
case SERVICE_START:
if (s->type == SERVICE_ONESHOT) {
/* This was our main goal, so let's go on */
@@ -3116,7 +3124,10 @@ static void service_sigchld_event(Unit *u, pid_t pid, int code, int status) {
break;

case SERVICE_STOP:
- service_enter_signal(s, SERVICE_STOP_SIGTERM, f);
+ /* If requested, wait for both main and control
+ PID to finish */
+ if (!s->stop_waits_for_main_pid || main_pid_good(s) <= 0)
+ service_enter_signal(s, SERVICE_STOP_SIGTERM, f);
break;

case SERVICE_STOP_SIGTERM:
diff --git a/src/core/service.h b/src/core/service.h
index 1992926..919d54b 100644
--- a/src/core/service.h
+++ b/src/core/service.h
@@ -163,6 +163,7 @@ struct Service {
bool root_directory_start_only;
bool remain_after_exit;
bool guess_main_pid;
+ bool stop_waits_for_main_pid;

/* If we shut down, remember why */
ServiceResult result;
diff --git a/units/***@.service.in b/units/***@.service.in
index bfc9b70..ac3caf3 100644
--- a/units/***@.service.in
+++ b/units/***@.service.in
@@ -14,4 +14,6 @@ User=%i
PAMName=systemd-user
Type=notify
ExecStart=-@rootlibexecdir@/systemd --user
+ExecStop=@KILL@ -TERM $MANAGERPID
+WaitForMainPIDOnStop=yes
Slice=user-%i.slice
--
tg: (2d5bdf5..) u/wait_for_mainPID_on_stop (depends on: master)
Lennart Poettering
2014-02-04 23:10:51 UTC
Permalink
Post by Andrey Borzenkov
Post by Lennart Poettering
Thanks for tracking this done, this really sounds like you nailed the
problem. Now, how to fix it?
Hmm, so, I would claim this is a shortcoming of
KillMode=control-group, which is the default for everything. There has
been an item on the TODO list to maybe introduce a KillMode=mixed
setting, which would send SIGTERM only to the main process, and the
SIGKILL later on to all processes. I am pretty sure that this would
solve the issue at hand quite nicely here, because the systemd user
instance would get a nice chance to clean up its own act, before the
systemd system instance would make tabula rasa...
I still favor alternative approach - let systemd wait for main PID
to exit after ExecStop instead. This is functionally equivalent to the
above with slight advantages
I am really not convinced that ExecStop= should be allowed to be
asynchronous. (Which is what you suggest we do, right?) In fact, it's
already problem enough that we pretend we allow ExecReload= to be
asynchronous like that... It's a question of allowing bad code
through... Either people let us shutdown a service, or they do it
themselves, but allowing a crappy (asynchronous) shutdown routine sounds
wrong to me...

At the hackfest in BRU I have now implemented KillMode=mixed, which
should fixed the issue mostly... Could you test, please?

Lennart
--
Lennart Poettering, Red Hat
Andrey Borzenkov
2014-02-05 02:56:52 UTC
Permalink
В Wed, 5 Feb 2014 00:10:51 +0100
Post by Lennart Poettering
Post by Andrey Borzenkov
Post by Lennart Poettering
Thanks for tracking this done, this really sounds like you nailed the
problem. Now, how to fix it?
Hmm, so, I would claim this is a shortcoming of
KillMode=control-group, which is the default for everything. There has
been an item on the TODO list to maybe introduce a KillMode=mixed
setting, which would send SIGTERM only to the main process, and the
SIGKILL later on to all processes. I am pretty sure that this would
solve the issue at hand quite nicely here, because the systemd user
instance would get a nice chance to clean up its own act, before the
systemd system instance would make tabula rasa...
I still favor alternative approach - let systemd wait for main PID
to exit after ExecStop instead. This is functionally equivalent to the
above with slight advantages
I am really not convinced that ExecStop= should be allowed to be
asynchronous. (Which is what you suggest we do, right?)
Yes.
Post by Lennart Poettering
In fact, it's
already problem enough that we pretend we allow ExecReload= to be
asynchronous like that... It's a question of allowing bad code
through...
I do not suggest "send it a singnal and pray for it". You send a signal
(or whatever) and wait for MAINPID to exit. MAINPID is *the* service for
systemd. Service exists while it runs; service is stopped when it
exits. I do not understand what is bad about it, sorry.
Post by Lennart Poettering
Either people let us shutdown a service, or they do it
themselves, but allowing a crappy (asynchronous) shutdown routine sounds
wrong to me...
It is synchronous for systemd - it waits until MAINPID exits. Not every
service has natural synchronous stop implementation - actually, I
suspect that most of them do not and simply use busy loop waiting for
the same MAINPID. But we already have PID 1 whose job is to wait for
other processes to exist and it does its job rather well. Why not rely
on it?
Post by Lennart Poettering
At the hackfest in BRU I have now implemented KillMode=mixed, which
should fixed the issue mostly... Could you test, please?
Sure it works, but it does exactly the same - it implements
asynchronous service stop without having benefits of reaping leftover
processes early.
Lennart Poettering
2014-02-05 13:59:14 UTC
Permalink
Post by Andrey Borzenkov
Post by Lennart Poettering
already problem enough that we pretend we allow ExecReload= to be
asynchronous like that... It's a question of allowing bad code
through...
I do not suggest "send it a singnal and pray for it". You send a signal
(or whatever) and wait for MAINPID to exit. MAINPID is *the* service for
systemd. Service exists while it runs; service is stopped when it
exits. I do not understand what is bad about it, sorry.
Well, it's about sending signals, then we do that for people anyway...

I mean, either people use the standard way to shut down daemons by
sending SIGTERM, and then we'll do all for them, they need no
configuration at all. Or they have some complex interfacing in place,
but then we should be able to assume that it is synchronous. But the
middle place of something where they shut down things asynchronously but
in a non-standard way sounds really off to me...

Lennart
--
Lennart Poettering, Red Hat
Djalal Harouni
2014-02-06 00:35:45 UTC
Permalink
Post by Lennart Poettering
Post by Andrey Borzenkov
Post by Lennart Poettering
Thanks for tracking this done, this really sounds like you nailed the
problem. Now, how to fix it?
Hmm, so, I would claim this is a shortcoming of
KillMode=control-group, which is the default for everything. There has
been an item on the TODO list to maybe introduce a KillMode=mixed
setting, which would send SIGTERM only to the main process, and the
SIGKILL later on to all processes. I am pretty sure that this would
solve the issue at hand quite nicely here, because the systemd user
instance would get a nice chance to clean up its own act, before the
systemd system instance would make tabula rasa...
I still favor alternative approach - let systemd wait for main PID
to exit after ExecStop instead. This is functionally equivalent to the
above with slight advantages
I am really not convinced that ExecStop= should be allowed to be
asynchronous. (Which is what you suggest we do, right?) In fact, it's
already problem enough that we pretend we allow ExecReload= to be
asynchronous like that... It's a question of allowing bad code
through... Either people let us shutdown a service, or they do it
themselves, but allowing a crappy (asynchronous) shutdown routine sounds
wrong to me...
At the hackfest in BRU I have now implemented KillMode=mixed, which
should fixed the issue mostly... Could you test, please?
Another bug, perhaps related, did not have time to confirm it, but the
logic is shared and the proposed KillMode=mixed patch did "fix" it! so:

If "KillUserProcesses=yes" of logind.conf is set and if its conditions are
met for the corresponding user, then
# loginctl terminate-user $user

logs:
Feb 05 23:35:00 fedora-tree-20 systemd[1]: session-1.scope stopping timed out. Killing.
Feb 05 23:35:00 fedora-tree-20 systemd[1]: Stopped Session 1 of user root.
Feb 05 23:35:00 fedora-tree-20 systemd[1]: Unit session-1.scope entered failed state.
...
Feb 05 23:35:03 fedora-tree-20 systemd[1]: Starting User Manager for UID 0...
Feb 05 23:35:03 fedora-tree-20 login[259]: pam_systemd(login:session): Failed to create session: Input/output error
Feb 05 23:35:03 fedora-tree-20 login[259]: pam_unix(login:session): session opened for user root by LOGIN(uid=0)
Feb 05 23:35:03 fedora-tree-20 systemd-logind[21]: Failed to start session scope session-1.scope: Unit session-1.scope already exists. org.freedesktop.systemd1.UnitExists
Feb 05 23:35:03 fedora-tree-20 systemd[260]: pam_unix(systemd-user:session): session opened for user root by (uid=0)
...


terminate-user => user_stop() Will try to terminate the scope and service
of the user, however the scope will timeout and will enter a failed state.

Then try to log in again using the same user, the same session id of the
previous session will be available and will be re-used to identify the
session scope of the current new session 'session-1.scope' which will
make session_start() fail since it got the same scope name of the
previous session which is still in the failed state...

session_start()
=> session_start_scope()
=> manager_start_scope() will fail

The pam_systemd will not register the session, and logind function
results will be wrong...


Anyway it seems that this also got fixed, is this the correct fix!
did not have time to debug, but after a "git pull" I did a quick
test using bash signal trap, got the correct SIGTERM+SIGHUP but
still we do not wait for session processes...


Lennart please, another thing:

src/core/unit.c:unit_kill_context() in the KILL_CONTROL_GROUP or
KILL_MIXED test:

"sig" can be SIGKILL or during the next call after the first
SIGTERM + SIGHUP , sig for sure will be SIGKILL so we have
cg_kill_recursive() sending a SIGKILL, what if it returns > 0
we'll endup sending another SIGHUP after the SIGKILL...

Not sure, I'll try to test it tomorrow.

Thanks!
--
Djalal Harouni
http://opendz.org
Lennart Poettering
2014-02-06 01:03:12 UTC
Permalink
Post by Djalal Harouni
session_start()
=> session_start_scope()
=> manager_start_scope() will fail
The pam_systemd will not register the session, and logind function
results will be wrong...
Anyway it seems that this also got fixed, is this the correct fix!
did not have time to debug, but after a "git pull" I did a quick
test using bash signal trap, got the correct SIGTERM+SIGHUP but
still we do not wait for session processes...
I'll have a closer look into the GC and killing logic for sessions.
Post by Djalal Harouni
src/core/unit.c:unit_kill_context() in the KILL_CONTROL_GROUP or
"sig" can be SIGKILL or during the next call after the first
SIGTERM + SIGHUP , sig for sure will be SIGKILL so we have
cg_kill_recursive() sending a SIGKILL, what if it returns > 0
we'll endup sending another SIGHUP after the SIGKILL...
Not sure, I'll try to test it tomorrow.
A true! I now conditionalized the SIGHUP to be only send when "sigkill"
is true. (Not that we don't check if sig == SIGKILL, but really check
for the boolean passed. Sending the SIGHUP after sig == SIGKILL makes
little sense, but there are quie a number of other combinations that
make little sense either and we don't check for them either, for example
setting SendSIGHUP to enabled, but also setting KillSignal=SIGHUP... I
don't think we should try too hard to check for these non-sensical
combinations, since they won't hurt much... That said sending the SIGHUP
after the final SIGKILL is certainly bogus indeed, so I fixed that...)

Lennart
--
Lennart Poettering, Red Hat
Tom Horsley
2014-01-24 15:44:17 UTC
Permalink
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.

Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Colin Guthrie
2014-01-24 17:10:46 UTC
Permalink
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.

For me personally, the NFS timeout is a proper pain the backside. A
little more cleverness there would be appreciated. e.g. can we not just
do lazy umounts by default for NFS (or just e.g. a 5s timeout max on the
regular NFS umount)? Perhaps this isn't possible in the umount loop - or
at least not possible cleanly...

Col
--
Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
Tribalogic Limited http://www.tribalogic.net/
Open Source:
Mageia Contributor http://www.mageia.org/
PulseAudio Hacker http://www.pulseaudio.org/
Trac Hacker http://trac.edgewall.org/
Michael Biebl
2014-01-24 17:18:44 UTC
Permalink
Post by Colin Guthrie
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.
Making the shutdown more verbose in such a situation would imho be a
good idea, showing a countdown or something like that with a note for
which service systemd is currently waiting to be shutdown.

I completely agree with Tom here: In situations where on shutdown (or
boot for that matter) the system blocks for longer then 30-60 secs and
no feedback at all most people will simply assume the system got stuck
and do power-reset.

Michael
--
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?
Lennart Poettering
2014-01-24 17:44:02 UTC
Permalink
Post by Michael Biebl
Post by Colin Guthrie
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.
Making the shutdown more verbose in such a situation would imho be a
good idea, showing a countdown or something like that with a note for
which service systemd is currently waiting to be shutdown.
I completely agree with Tom here: In situations where on shutdown (or
boot for that matter) the system blocks for longer then 30-60 secs and
no feedback at all most people will simply assume the system got stuck
and do power-reset.
Yupp, Michal had the same idea, that's why there is the eye-of-sauron
animation in place...

Lennart
--
Lennart Poettering, Red Hat
Michael Biebl
2014-01-24 18:26:48 UTC
Permalink
Post by Lennart Poettering
Post by Michael Biebl
Making the shutdown more verbose in such a situation would imho be a
good idea, showing a countdown or something like that with a note for
which service systemd is currently waiting to be shutdown.
I completely agree with Tom here: In situations where on shutdown (or
boot for that matter) the system blocks for longer then 30-60 secs and
no feedback at all most people will simply assume the system got stuck
and do power-reset.
Yupp, Michal had the same idea, that's why there is the eye-of-sauron
animation in place...
Ah, good to know. That's a start.
I guess my systemd version (v204) is simply too old then?

Is this animation shown irregardless of whether one has booted with
"quiet" or not?
Does it require plymouth?

Cheers,
Michael
--
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?
Reindl Harald
2014-01-24 18:30:00 UTC
Permalink
Post by Michael Biebl
Post by Lennart Poettering
Yupp, Michal had the same idea, that's why there is the eye-of-sauron
animation in place...
Ah, good to know. That's a start.
I guess my systemd version (v204) is simply too old then?
Is this animation shown irregardless of whether one has booted with
"quiet" or not?
dunno - quiet is the first i disable
Post by Michael Biebl
Does it require plymouth?
for sure not

"rd.plymouth=0 plymouth.enable=0" on all machines i maintain (around 40 currently)

as well as

[***@srv-rhsoft:~]$ cat /etc/dracut.conf.d/90-plymouth.conf
omit_dracutmodules+="plymouth"


but as said - a value in seconds would be interesting
in case of a service with a large timeout it really matters and
in doubt you dunno how large the timeout of the hanging one is
Zbigniew Jędrzejewski-Szmek
2014-01-24 18:44:08 UTC
Permalink
Post by Michael Biebl
Post by Lennart Poettering
Post by Michael Biebl
Making the shutdown more verbose in such a situation would imho be a
good idea, showing a countdown or something like that with a note for
which service systemd is currently waiting to be shutdown.
I completely agree with Tom here: In situations where on shutdown (or
boot for that matter) the system blocks for longer then 30-60 secs and
no feedback at all most people will simply assume the system got stuck
and do power-reset.
Yupp, Michal had the same idea, that's why there is the eye-of-sauron
animation in place...
Ah, good to know. That's a start.
I guess my systemd version (v204) is simply too old then?
Is this animation shown irregardless of whether one has booted with
"quiet" or not?
With quiet the [OK] lines are not shown, so no, it only works
without quiet.
Post by Michael Biebl
Does it require plymouth?
It's text based.

Zbyszek
Reindl Harald
2014-01-24 18:49:02 UTC
Permalink
Post by Zbigniew Jędrzejewski-Szmek
Post by Michael Biebl
Is this animation shown irregardless of whether one has booted with
"quiet" or not?
With quiet the [OK] lines are not shown, so no, it only works
without quiet
one more case why the shiny graphical boot does only harm
in case of troubles and the average user is not aware that
his system could be more verbose than Windows/OSX at boot

the advanced user disables rhgb / quiet but could do this
also only in case things seem to be broken while the
ordinary user is facing a (maybe, or maybe not)
hanging system not talking to him
Michael Biebl
2014-01-24 18:51:44 UTC
Permalink
Post by Zbigniew Jędrzejewski-Szmek
Post by Michael Biebl
Post by Lennart Poettering
Post by Michael Biebl
Making the shutdown more verbose in such a situation would imho be a
good idea, showing a countdown or something like that with a note for
which service systemd is currently waiting to be shutdown.
I completely agree with Tom here: In situations where on shutdown (or
boot for that matter) the system blocks for longer then 30-60 secs and
no feedback at all most people will simply assume the system got stuck
and do power-reset.
Yupp, Michal had the same idea, that's why there is the eye-of-sauron
animation in place...
Ah, good to know. That's a start.
I guess my systemd version (v204) is simply too old then?
Is this animation shown irregardless of whether one has booted with
"quiet" or not?
With quiet the [OK] lines are not shown, so no, it only works
without quiet.
Hm, I think those messages should always be shown so the user has a
chance to know what's going on.

IIRC most distros today enable "quiet" by default.
--
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?
Andrey Borzenkov
2014-01-24 19:01:42 UTC
Permalink
В Fri, 24 Jan 2014 19:44:08 +0100
Post by Zbigniew Jędrzejewski-Szmek
Post by Michael Biebl
Post by Lennart Poettering
Post by Michael Biebl
Making the shutdown more verbose in such a situation would imho be a
good idea, showing a countdown or something like that with a note for
which service systemd is currently waiting to be shutdown.
I completely agree with Tom here: In situations where on shutdown (or
boot for that matter) the system blocks for longer then 30-60 secs and
no feedback at all most people will simply assume the system got stuck
and do power-reset.
Yupp, Michal had the same idea, that's why there is the eye-of-sauron
animation in place...
Ah, good to know. That's a start.
I guess my systemd version (v204) is simply too old then?
Is this animation shown irregardless of whether one has booted with
"quiet" or not?
With quiet the [OK] lines are not shown, so no, it only works
without quiet.
Is it possible to automatically switch to more verbose mode as soon as
any problem is seen (like service timeout)?
Post by Zbigniew Jędrzejewski-Szmek
Post by Michael Biebl
Does it require plymouth?
It's text based.
Zbyszek
_______________________________________________
systemd-devel mailing list
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Reindl Harald
2014-01-24 19:09:58 UTC
Permalink
В Fri, 24 Jan 2014 19:44:08 +0100
Post by Zbigniew Jędrzejewski-Szmek
Post by Michael Biebl
Post by Lennart Poettering
Post by Michael Biebl
Making the shutdown more verbose in such a situation would imho be a
good idea, showing a countdown or something like that with a note for
which service systemd is currently waiting to be shutdown.
I completely agree with Tom here: In situations where on shutdown (or
boot for that matter) the system blocks for longer then 30-60 secs and
no feedback at all most people will simply assume the system got stuck
and do power-reset.
Yupp, Michal had the same idea, that's why there is the eye-of-sauron
animation in place...
Ah, good to know. That's a start.
I guess my systemd version (v204) is simply too old then?
Is this animation shown irregardless of whether one has booted with
"quiet" or not?
With quiet the [OK] lines are not shown, so no, it only works
without quiet.
Is it possible to automatically switch to more verbose mode as soon as
any problem is seen (like service timeout)?
too late, after the timeout is reached it continues
the users problem is the silent waiting *before* the timeout
Andrey Borzenkov
2014-01-24 19:48:38 UTC
Permalink
Post by Reindl Harald
Post by Andrey Borzenkov
В Fri, 24 Jan 2014 19:44:08 +0100
Post by Zbigniew Jędrzejewski-Szmek
Post by Michael Biebl
Post by Lennart Poettering
Post by Michael Biebl
Making the shutdown more verbose in such a situation would imho be a
good idea, showing a countdown or something like that with a note for
which service systemd is currently waiting to be shutdown.
I completely agree with Tom here: In situations where on shutdown (or
boot for that matter) the system blocks for longer then 30-60 secs and
no feedback at all most people will simply assume the system got stuck
and do power-reset.
Yupp, Michal had the same idea, that's why there is the eye-of-sauron
animation in place...
Ah, good to know. That's a start.
I guess my systemd version (v204) is simply too old then?
Is this animation shown irregardless of whether one has booted with
"quiet" or not?
With quiet the [OK] lines are not shown, so no, it only works
without quiet.
Is it possible to automatically switch to more verbose mode as soon as
any problem is seen (like service timeout)?
too late,
May be I was not clear. There is some timeout after which "running
stars" animation starts. Apparently it is hard-coded and not
configurable. It would be useful if this timeout also triggered switch
to verbose mode.
Post by Reindl Harald
after the timeout is reached it continues
the users problem is the silent waiting *before* the timeout
_______________________________________________
systemd-devel mailing list
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Zbigniew Jędrzejewski-Szmek
2014-01-28 04:27:16 UTC
Permalink
Post by Andrey Borzenkov
Post by Reindl Harald
Post by Andrey Borzenkov
В Fri, 24 Jan 2014 19:44:08 +0100
Post by Zbigniew Jędrzejewski-Szmek
Post by Michael Biebl
Post by Lennart Poettering
Post by Michael Biebl
Making the shutdown more verbose in such a situation would imho be a
good idea, showing a countdown or something like that with a note for
which service systemd is currently waiting to be shutdown.
I completely agree with Tom here: In situations where on shutdown (or
boot for that matter) the system blocks for longer then 30-60 secs and
no feedback at all most people will simply assume the system got stuck
and do power-reset.
Yupp, Michal had the same idea, that's why there is the eye-of-sauron
animation in place...
Ah, good to know. That's a start.
I guess my systemd version (v204) is simply too old then?
Is this animation shown irregardless of whether one has booted with
"quiet" or not?
With quiet the [OK] lines are not shown, so no, it only works
without quiet.
Is it possible to automatically switch to more verbose mode as soon as
any problem is seen (like service timeout)?
too late,
May be I was not clear. There is some timeout after which "running
stars" animation starts. Apparently it is hard-coded and not
configurable. It would be useful if this timeout also triggered switch
to verbose mode.
Post by Reindl Harald
after the timeout is reached it continues
the users problem is the silent waiting *before* the timeout
Done.

After a job has been running for more than 5s, or a failure occurs,
cylon eye will be shown. Also status will be printed until bootup or
shutdown is completed.

systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.
[FAILED] Failed to start bad.service.
See 'systemctl status bad.service' for details.
Starting Permit User Sessions...
[ OK ] Started Cleanup of Temporary Directories.
[ OK ] Started Permit User Sessions.
Starting Console Getty...
[ OK ] Started Console Getty.
[ OK ] Reached target Login Prompts.
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.

Fedora release 21 (Rawhide)
Kernel 3.13.0-0.rc5.git0.2.fc21.x86_64 on an x86_64 (console)

fedora-21 login: root
Michael Biebl
2014-01-28 05:52:21 UTC
Permalink
Post by Zbigniew Jędrzejewski-Szmek
Done.
After a job has been running for more than 5s, or a failure occurs,
cylon eye will be shown. Also status will be printed until bootup or
shutdown is completed.
You are awesome, thanks man!
Reindl Harald
2014-01-28 09:51:43 UTC
Permalink
Post by Zbigniew Jędrzejewski-Szmek
Post by Reindl Harald
after the timeout is reached it continues
the users problem is the silent waiting *before* the timeout
Done.
After a job has been running for more than 5s, or a failure occurs,
cylon eye will be shown. Also status will be printed until bootup or
shutdown is completed.
systemd 208 running in system mode. (+PAM +LIBWRAP +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ)
Detected virtualization 'systemd-nspawn'.
[FAILED] Failed to start bad.service.
See 'systemctl status bad.service' for details.
Starting Permit User Sessions...
[ OK ] Started Cleanup of Temporary Directories.
[ OK ] Started Permit User Sessions.
Starting Console Getty...
[ OK ] Started Console Getty.
[ OK ] Reached target Login Prompts.
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.
Fedora release 21 (Rawhide)
Kernel 3.13.0-0.rc5.git0.2.fc21.x86_64 on an x86_64 (console)
fedora-21 login: root
sounds good - thanks for all that fine-tuning!
Lennart Poettering
2014-01-24 17:43:16 UTC
Permalink
Post by Colin Guthrie
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.
We do that. Michal's "eye of sauron" animation is shown as soon as
something blocks too long, and the name of the unit we are waiting for
is shown.
Post by Colin Guthrie
For me personally, the NFS timeout is a proper pain the backside. A
little more cleverness there would be appreciated. e.g. can we not just
do lazy umounts by default for NFS (or just e.g. a 5s timeout max on the
regular NFS umount)? Perhaps this isn't possible in the umount loop - or
at least not possible cleanly...
We used to do that in the final umount loop. But that doesn't really
work, since the loop relies that busy file systems return EBUSY if they
are busy...

Mounts need proper depdendency so that we cann unmount them. THere's no
way around that really...

Lennart
--
Lennart Poettering, Red Hat
Reindl Harald
2014-01-24 17:45:38 UTC
Permalink
Post by Lennart Poettering
Post by Colin Guthrie
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.
We do that. Michal's "eye of sauron" animation is shown as soon as
something blocks too long, and the name of the unit we are waiting for
is shown.
but there is nothing saying how long the timeout remains
"displaying the timeout" means a value in seconds
Lennart Poettering
2014-01-24 17:53:26 UTC
Permalink
Post by Reindl Harald
Post by Lennart Poettering
Post by Colin Guthrie
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.
We do that. Michal's "eye of sauron" animation is shown as soon as
something blocks too long, and the name of the unit we are waiting for
is shown.
but there is nothing saying how long the timeout remains
"displaying the timeout" means a value in seconds
That delay is set to 5s.

(Oh, and where I wrote "eye of sauron" I meant "cylon". I guess this
reveals my utter ignorance of all things science fiction and fantasy. We
have this animation in there now since a while, and quite frankly, I
have no idea where this supposedly is inspired from, except it's
something with spaceships, which already turns me off so much I quickly
stop listening...)

Lennart
--
Lennart Poettering, Red Hat
Kay Sievers
2014-01-24 18:21:34 UTC
Permalink
On Fri, Jan 24, 2014 at 6:53 PM, Lennart Poettering
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
Post by Colin Guthrie
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.
We do that. Michal's "eye of sauron" animation is shown as soon as
something blocks too long, and the name of the unit we are waiting for
is shown.
but there is nothing saying how long the timeout remains
"displaying the timeout" means a value in seconds
That delay is set to 5s.
(Oh, and where I wrote "eye of sauron" I meant "cylon". I guess this
reveals my utter ignorance of all things science fiction and fantasy. We
have this animation in there now since a while, and quite frankly, I
have no idea where this supposedly is inspired from, except it's
something with spaceships, which already turns me off so much I quickly
stop listening...)


Kay
Lennart Poettering
2014-01-27 12:23:56 UTC
Permalink
Post by Kay Sievers
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
We do that. Michal's "eye of sauron" animation is shown as soon as
something blocks too long, and the name of the unit we are waiting for
is shown.
but there is nothing saying how long the timeout remains
"displaying the timeout" means a value in seconds
That delay is set to 5s.
(Oh, and where I wrote "eye of sauron" I meant "cylon". I guess this
reveals my utter ignorance of all things science fiction and fantasy. We
have this animation in there now since a while, and quite frankly, I
have no idea where this supposedly is inspired from, except it's
something with spaceships, which already turns me off so much I quickly
stop listening...)
http://youtu.be/oYZcZSVYx3M
Oh god. I would have preferred to stay ignorant on this one...

Lennart
--
Lennart Poettering, Red Hat
Colin Guthrie
2014-01-25 14:06:55 UTC
Permalink
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
Post by Colin Guthrie
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.
We do that. Michal's "eye of sauron" animation is shown as soon as
something blocks too long, and the name of the unit we are waiting for
is shown.
but there is nothing saying how long the timeout remains
"displaying the timeout" means a value in seconds
That delay is set to 5s.
What was meant here was that the *user* is not shown for how long the
"cylon" animation will play before systemd gives up and gets aggressive.

So there is a 5s timeout before displaying that, but all it does is tell
you how many jobs are waiting and not how long it's going to wait for them.

If the user sits and watches that animation for 20s they'll likely think
"ahh well this is stuck" and yank the cord, not knowing that things will
be done cleanly if they just wait another 10s.

If we displayed a timeout clock here too, users would be more willing to
wait.

Col
--
Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
Tribalogic Limited http://www.tribalogic.net/
Open Source:
Mageia Contributor http://www.mageia.org/
PulseAudio Hacker http://www.pulseaudio.org/
Trac Hacker http://trac.edgewall.org/
Koen Kooi
2014-01-25 15:18:09 UTC
Permalink
Post by Colin Guthrie
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
Post by Colin Guthrie
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.
We do that. Michal's "eye of sauron" animation is shown as soon as
something blocks too long, and the name of the unit we are waiting for
is shown.
but there is nothing saying how long the timeout remains
"displaying the timeout" means a value in seconds
That delay is set to 5s.
What was meant here was that the *user* is not shown for how long the
"cylon" animation will play before systemd gives up and gets aggressive.
So there is a 5s timeout before displaying that, but all it does is tell
you how many jobs are waiting and not how long it's going to wait for them.
If the user sits and watches that animation for 20s they'll likely think
"ahh well this is stuck" and yank the cord, not knowing that things will
be done cleanly if they just wait another 10s.
If we displayed a timeout clock here too, users would be more willing to
wait.
To make matters worse, the cylon eye isn't displayed when you boot with 'quiet' in your kernel command line.

regards,

Koen
Marcos Mello
2014-01-25 17:09:34 UTC
Permalink
Koen Kooi <koen <at> dominion.thruhere.net> writes:
[snip]
Post by Koen Kooi
To make matters worse, the cylon eye isn't displayed when you boot with
'quiet' in your kernel command line.
"quiet systemd.show_status=1" shows the gracious Cylon eye.
Reindl Harald
2014-01-25 17:16:38 UTC
Permalink
Post by Marcos Mello
[snip]
Post by Koen Kooi
To make matters worse, the cylon eye isn't displayed when you boot with
'quiet' in your kernel command line.
"quiet systemd.show_status=1" shows the gracious Cylon eye
so that should be default and extended by a visible counter
manually to add boot-params are useless for the normal user
the advaned one is not using quiet at all
Zbigniew Jędrzejewski-Szmek
2014-01-27 06:47:15 UTC
Permalink
Post by Reindl Harald
Post by Marcos Mello
[snip]
Post by Koen Kooi
To make matters worse, the cylon eye isn't displayed when you boot with
'quiet' in your kernel command line.
"quiet systemd.show_status=1" shows the gracious Cylon eye
so that should be default and extended by a visible counter
manually to add boot-params are useless for the normal user
the advaned one is not using quiet at all
I now pushed a change to git to display time since a job was started
and the job timeout in the ephemeral status. It turns out that in the
recent rewrite, the timeout logic was borked, so the ephemeral status was
not displayed properly. It should now be displayed more reliably.

Still, nothing is displayed with 'quiet'. This is a separate change to
make I guess.

Zbyszek
Michael Biebl
2014-01-27 08:03:24 UTC
Permalink
Post by Zbigniew Jędrzejewski-Szmek
Post by Reindl Harald
Post by Marcos Mello
[snip]
Post by Koen Kooi
To make matters worse, the cylon eye isn't displayed when you boot with
'quiet' in your kernel command line.
"quiet systemd.show_status=1" shows the gracious Cylon eye
so that should be default and extended by a visible counter
manually to add boot-params are useless for the normal user
the advaned one is not using quiet at all
I now pushed a change to git to display time since a job was started
and the job timeout in the ephemeral status. It turns out that in the
recent rewrite, the timeout logic was borked, so the ephemeral status was
not displayed properly. It should now be displayed more reliably.
Thanks!
Post by Zbigniew Jędrzejewski-Szmek
Still, nothing is displayed with 'quiet'. This is a separate change to
make I guess.
That would be awesome. I assume this would cover a "stuck" boot as well?

This is often mentioned complaint, see e.g. the recent discussion

https://lists.debian.org/debian-boot/2014/01/msg00251.html
https://lists.debian.org/debian-boot/2014/01/msg00253.html
https://lists.debian.org/debian-boot/2014/01/msg00255.html
--
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?
Lennart Poettering
2014-01-27 12:32:13 UTC
Permalink
Post by Michael Biebl
Post by Zbigniew Jędrzejewski-Szmek
Post by Reindl Harald
so that should be default and extended by a visible counter
manually to add boot-params are useless for the normal user
the advaned one is not using quiet at all
I now pushed a change to git to display time since a job was started
and the job timeout in the ephemeral status. It turns out that in the
recent rewrite, the timeout logic was borked, so the ephemeral status was
not displayed properly. It should now be displayed more reliably.
Thanks!
Post by Zbigniew Jędrzejewski-Szmek
Still, nothing is displayed with 'quiet'. This is a separate change to
make I guess.
That would be awesome. I assume this would cover a "stuck" boot as well?
This is often mentioned complaint, see e.g. the recent discussion
https://lists.debian.org/debian-boot/2014/01/msg00251.html
https://lists.debian.org/debian-boot/2014/01/msg00253.html
https://lists.debian.org/debian-boot/2014/01/msg00255.html
So, there has been this todo list item for a while to support a mode
where a failing service at boot would result in boot status output to be
turned on. Still not sure iof such a logic would be a good upstream
default, but certainly a good default for more technically-minded
distros such as Debian.

So maybe something like this: In addition to the boolean values for
systemd.show_status= on the kernel cmdline (or ShowStatus= in
system.conf), we'd add a third value called "auto". If that is set
we'd boot up without any status output, until either at least one
service failed, or at least one job reaches its timeout half-way. When
that point is reached we'c continue the entire rest of the boot with
status output enabled. And we wouldn't just turn the logging on, we'd
also explain why we turned it on in one "introductory" message: "Turning
on boot-time status output because of service failure:", or "Turning
on boot-time status output because of nearing job timeout:" or something
like that.

I'd be happy to see a patch like that merged.

Lennart
--
Lennart Poettering, Red Hat
Tom Gundersen
2014-01-27 12:42:31 UTC
Permalink
On Mon, Jan 27, 2014 at 1:32 PM, Lennart Poettering
Post by Lennart Poettering
So, there has been this todo list item for a while to support a mode
where a failing service at boot would result in boot status output to be
turned on. Still not sure iof such a logic would be a good upstream
default, but certainly a good default for more technically-minded
distros such as Debian.
So maybe something like this: In addition to the boolean values for
systemd.show_status= on the kernel cmdline (or ShowStatus= in
system.conf), we'd add a third value called "auto". If that is set
we'd boot up without any status output, until either at least one
service failed, or at least one job reaches its timeout half-way.
For people like me who has an attention span of about five seconds,
half-way to the timeout is still a really long time to just sit there.
Maybe just use the same timeout as the eye-of-cylon thingie?
Post by Lennart Poettering
When
that point is reached we'c continue the entire rest of the boot with
status output enabled.
Hm, maybe only do this if something actually failed/reached the
timeout, and not if we just show the eye-of-cylon for a while and then
continue normally?
Post by Lennart Poettering
And we wouldn't just turn the logging on, we'd
also explain why we turned it on in one "introductory" message: "Turning
on boot-time status output because of service failure:", or "Turning
on boot-time status output because of nearing job timeout:" or something
like that.
-t
Lennart Poettering
2014-01-27 14:29:15 UTC
Permalink
Post by Tom Gundersen
Post by Lennart Poettering
So maybe something like this: In addition to the boolean values for
systemd.show_status= on the kernel cmdline (or ShowStatus= in
system.conf), we'd add a third value called "auto". If that is set
we'd boot up without any status output, until either at least one
service failed, or at least one job reaches its timeout half-way.
For people like me who has an attention span of about five seconds,
half-way to the timeout is still a really long time to just sit there.
Maybe just use the same timeout as the eye-of-cylon thingie?
Yeah, maybe. Figuring out good timeouts its probably something one
should do when actually playing around with it and checking how things
"feel" if this is implemented...
Post by Tom Gundersen
Post by Lennart Poettering
When
that point is reached we'c continue the entire rest of the boot with
status output enabled.
Hm, maybe only do this if something actually failed/reached the
timeout, and not if we just show the eye-of-cylon for a while and then
continue normally?
Hmm, possibly, yeah, but we probably should explain that
too... i.e. "Turning off boot-time status output again, since timeout is
resolved" or so... But maybe this ultimately gets too confusing...

Note that this all is just an issue on non-Plymouth systems. If you use
Plymouth then things are much nicer anyway, since the output is always
generated, just not visible on screen until the user hits Esc.

Lennart
--
Lennart Poettering, Red Hat
Andrey Borzenkov
2014-01-27 15:05:04 UTC
Permalink
В Mon, 27 Jan 2014 15:29:15 +0100
Post by Lennart Poettering
Post by Tom Gundersen
Post by Lennart Poettering
So maybe something like this: In addition to the boolean values for
systemd.show_status= on the kernel cmdline (or ShowStatus= in
system.conf), we'd add a third value called "auto". If that is set
we'd boot up without any status output, until either at least one
service failed, or at least one job reaches its timeout half-way.
For people like me who has an attention span of about five seconds,
half-way to the timeout is still a really long time to just sit there.
Maybe just use the same timeout as the eye-of-cylon thingie?
Yeah, maybe. Figuring out good timeouts its probably something one
should do when actually playing around with it and checking how things
"feel" if this is implemented...
Post by Tom Gundersen
Post by Lennart Poettering
When
that point is reached we'c continue the entire rest of the boot with
status output enabled.
Hm, maybe only do this if something actually failed/reached the
timeout, and not if we just show the eye-of-cylon for a while and then
continue normally?
Hmm, possibly, yeah, but we probably should explain that
too... i.e. "Turning off boot-time status output again, since timeout is
resolved" or so... But maybe this ultimately gets too confusing...
Note that this all is just an issue on non-Plymouth systems. If you use
Plymouth then things are much nicer anyway, since the output is always
generated, just not visible on screen until the user hits Esc.
Yes, but you still want to virtually press ESC to attract user
attention to a problem ...
Kay Sievers
2014-01-28 11:25:45 UTC
Permalink
On Mon, Jan 27, 2014 at 7:47 AM, Zbigniew Jędrzejewski-Szmek
Post by Zbigniew Jędrzejewski-Szmek
Post by Reindl Harald
Post by Marcos Mello
[snip]
Post by Koen Kooi
To make matters worse, the cylon eye isn't displayed when you boot with
'quiet' in your kernel command line.
"quiet systemd.show_status=1" shows the gracious Cylon eye
so that should be default and extended by a visible counter
manually to add boot-params are useless for the normal user
the advaned one is not using quiet at all
I now pushed a change to git to display time since a job was started
and the job timeout in the ephemeral status. It turns out that in the
recent rewrite, the timeout logic was borked, so the ephemeral status was
not displayed properly. It should now be displayed more reliably.
Still, nothing is displayed with 'quiet'. This is a separate change to
make I guess.
I've reverted it for now. It breaks booting with dracut, the daemon
reload fails, and we cannot transition into the real rootfs.

Kay
Dominique Michel
2014-01-25 15:41:08 UTC
Permalink
Le Sat, 25 Jan 2014 14:06:55 +0000,
Post by Colin Guthrie
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
Post by Colin Guthrie
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need
to give services the chance to shut down cleanly and in the
right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.
We do that. Michal's "eye of sauron" animation is shown as soon as
something blocks too long, and the name of the unit we are
waiting for is shown.
but there is nothing saying how long the timeout remains
"displaying the timeout" means a value in seconds
That delay is set to 5s.
What was meant here was that the *user* is not shown for how long the
"cylon" animation will play before systemd gives up and gets
aggressive.
So there is a 5s timeout before displaying that, but all it does is
tell you how many jobs are waiting and not how long it's going to
wait for them.
If the user sits and watches that animation for 20s they'll likely
think "ahh well this is stuck" and yank the cord, not knowing that
things will be done cleanly if they just wait another 10s.
I fully agree. At my work, 20s are already a lot of time when it is
time to go.
Post by Colin Guthrie
If we displayed a timeout clock here too, users would be more willing
to wait.
It would also be nice to show which process systemd is waiting for,
together with something like
Press Ctrl+C to abort and continue

Most important is the possibility to abort the wait and continue,
because I am sure some peoples I know will just think "What is this
f. s." and yank the cord anyway.

Dominique
Post by Colin Guthrie
Col
Lennart Poettering
2014-01-27 14:56:52 UTC
Permalink
Post by Dominique Michel
Post by Colin Guthrie
So there is a 5s timeout before displaying that, but all it does is
tell you how many jobs are waiting and not how long it's going to
wait for them.
If the user sits and watches that animation for 20s they'll likely
think "ahh well this is stuck" and yank the cord, not knowing that
things will be done cleanly if they just wait another 10s.
I fully agree. At my work, 20s are already a lot of time when it is
time to go.
Post by Colin Guthrie
If we displayed a timeout clock here too, users would be more willing
to wait.
It would also be nice to show which process systemd is waiting for,
together with something like
Press Ctrl+C to abort and continue
Most important is the possibility to abort the wait and continue,
because I am sure some peoples I know will just think "What is this
f. s." and yank the cord anyway.
I like the theory of this, but I am not sure how to implement this
nicely, since it is usually not clear what precisely the unit is that
isn't showing up...

Also, it's security sensitive I think, so it cannot be enabled by
default I think...

Lennart
--
Lennart Poettering, Red Hat
Colin Guthrie
2014-01-25 14:09:47 UTC
Permalink
Post by Lennart Poettering
Post by Colin Guthrie
For me personally, the NFS timeout is a proper pain the backside. A
little more cleverness there would be appreciated. e.g. can we not just
do lazy umounts by default for NFS (or just e.g. a 5s timeout max on the
regular NFS umount)? Perhaps this isn't possible in the umount loop - or
at least not possible cleanly...
We used to do that in the final umount loop. But that doesn't really
work, since the loop relies that busy file systems return EBUSY if they
are busy...
Mounts need proper depdendency so that we cann unmount them. THere's no
way around that really...
There has to be a way around dealing with the NFS case. It's surely
quite common - certainly it's one I have to deal with a lot.

That said, it's maybe something to solve in the NFS code better, as I
often resume from suspend to a different network and only really notice
my old NFS mounts are still there when I try to use a file->open dialog
which just hangs (as does e.g. "df") if you have such stale mounts and
never seems to recover.

Either way, some better way of handling of this really needs to go in
somewhere in the stack!

Col
--
Colin Guthrie
gmane(at)colin.guthr.ie
http://colin.guthr.ie/

Day Job:
Tribalogic Limited http://www.tribalogic.net/
Open Source:
Mageia Contributor http://www.mageia.org/
PulseAudio Hacker http://www.pulseaudio.org/
Trac Hacker http://trac.edgewall.org/
Reindl Harald
2014-01-24 18:45:37 UTC
Permalink
uhm the below is the result of people using "reply-all" on
lists and the other side filters out duplicates caused by
leading the off-list reply survives if it was faster

-------- Original-Nachricht --------
Betreff: Re: [systemd-devel] Allow stop jobs to be killed during shutdown
Datum: Fri, 24 Jan 2014 18:59:56 +0100
Post by Lennart Poettering
Post by Reindl Harald
Post by Lennart Poettering
Post by Colin Guthrie
Post by Tom Horsley
Post by Lennart Poettering
However, something like that can never be the default, we need to give
services the chance to shut down cleanly and in the right order.
I didn't ask for any change to any default, I just asked for
users to be able to make the shutdown process proceed when
they have more information than systemd has about the chances
of success of some random stop job.
Without that, what you *will* get is people pulling the
power plug which has a vastly greater chance of screwing up
the system than not waiting for a single stop job.
Perhaps just displaying the timeout would be useful here.
We do that. Michal's "eye of sauron" animation is shown as soon as
something blocks too long, and the name of the unit we are waiting for
is shown.
but there is nothing saying how long the timeout remains
"displaying the timeout" means a value in seconds
That delay is set to 5s
????

the timeout is "TimeoutStopSec" or "TimeoutStopSec"
Post by Lennart Poettering
Oh, and where I wrote "eye of sauron" I meant "cylon"
irrelevant - whatever it is it says "waiting for service xyz"
but it does *not* say how long it waits until it will give up

the interesting value is "TimeoutStopSec-TimeWaiting"
Loading...