Discussion:
[systemd-devel] Seeking Advice on debugging failure to reboot
Barry Scott
2011-10-06 14:56:43 UTC
Permalink
We can reproducably get an F15 system in a state that it fails to
complete a reboot.

Becuase all the logging, network and getty's has been stopped all
we can see is the kernel messages on the console.

What we need to know is why systemd is not completing the reboot.

What advice do you have on how to debug this situation please?

Barry
Barry Scott
2011-10-06 17:26:21 UTC
Permalink
Post by Barry Scott
We can reproducably get an F15 system in a state that it fails to
complete a reboot.
Becuase all the logging, network and getty's has been stopped all
we can see is the kernel messages on the console.
What we need to know is why systemd is not completing the reboot.
What advice do you have on how to debug this situation please?
Barry
We have got a bit further along using netconsole to see the kmsg
after systemd turns off syslog.







The obvious difference between a reboot that works and one that
fails is that I see these lines in the good case:







[ 135.391206] systemd[1]: final.target changed dead -> active
[ 135.391217] systemd[1]: Job final.target/start finished, result=done
[ 135.391249] systemd[1]: About to execute: /bin/systemctl --force reboot







But this never happens in the bad case.







How do I track down why final.target is not activated?







Barry
Michal Schmidt
2011-10-07 08:48:52 UTC
Permalink
Post by Barry Scott
Post by Barry Scott
We can reproducably get an F15 system in a state that it fails to
complete a reboot.
Then can you suggest how to reproduce it?
Post by Barry Scott
The obvious difference between a reboot that works and one that
[ 135.391206] systemd[1]: final.target changed dead -> active
[ 135.391217] systemd[1]: Job final.target/start finished, result=done
[ 135.391249] systemd[1]: About to execute: /bin/systemctl --force reboot
What is printed before these lines? Could you attach the complete log?

Michal
Barry Scott
2011-10-11 10:44:55 UTC
Permalink
Post by Michal Schmidt
Post by Barry Scott
Post by Barry Scott
We can reproducably get an F15 system in a state that it fails to
complete a reboot.
Then can you suggest how to reproduce it?
Post by Barry Scott
The obvious difference between a reboot that works and one that
[ 135.391206] systemd[1]: final.target changed dead -> active
[ 135.391217] systemd[1]: Job final.target/start finished, result=done
[ 135.391249] systemd[1]: About to execute: /bin/systemctl --force reboot
What is printed before these lines? Could you attach the complete log?
The full log is here:

http://onelanftp.co.uk/bscott/ntb10117.netconsole.txt

The first boot sequence from lines 1-182 works.

The second from 188-1182 has the problem.

If the kernel is at fault then I would expect to see the
"About to execute: /bin/systemctl --force reboot" line
and a failure to reboot.

But given that line is not output I suspect its more
likely to be an issue with systemd not getting to that
point. And question is what stops this happening?

Barry
Michal Schmidt
2011-10-11 11:37:23 UTC
Permalink
Post by Barry Scott
http://onelanftp.co.uk/bscott/ntb10117.netconsole.txt
The first boot sequence from lines 1-182 works.
The second from 188-1182 has the problem.
If the kernel is at fault then I would expect to see the
"About to execute: /bin/systemctl --force reboot" line
and a failure to reboot.
But given that line is not output I suspect its more
likely to be an issue with systemd not getting to that
point. And question is what stops this happening?
In the second run umount.target is not reached. It is not obvious why.
The log shows only /data, /boot and /media being unmounted in both runs.
There may be a mountpoint that refuses to die. Did you try waiting a few
minutes to see if any timeout kicks in?
It is unfortunate that we only see the messages after syslog is stopped.
Try with "systemd.log_target=console", or just stop rsyslog.service
manually before invoking reboot. This way netconsole should show more of
the events.

Michal
Barry Scott
2011-10-11 15:09:00 UTC
Permalink
Post by Michal Schmidt
Post by Barry Scott
http://onelanftp.co.uk/bscott/ntb10117.netconsole.txt
The first boot sequence from lines 1-182 works.
The second from 188-1182 has the problem.
If the kernel is at fault then I would expect to see the
"About to execute: /bin/systemctl --force reboot" line
and a failure to reboot.
But given that line is not output I suspect its more
likely to be an issue with systemd not getting to that
point. And question is what stops this happening?
In the second run umount.target is not reached. It is not obvious why.
The log shows only /data, /boot and /media being unmounted in both runs.
There may be a mountpoint that refuses to die. Did you try waiting a few
minutes to see if any timeout kicks in?
We do not see any timeout.
Post by Michal Schmidt
It is unfortunate that we only see the messages after syslog is stopped.
Try with "systemd.log_target=console", or just stop rsyslog.service
manually before invoking reboot. This way netconsole should show more of
the events.
I have uploaded full netconsole logs with the options you suggested:

http://onelanftp.co.uk/bscott/netconsole.noreboot.txt
http://onelanftp.co.uk/bscott/netconsole.reboot.txt

Here is the output of mount from the same system after power-fail
reboot:

# mount
/proc on /proc type proc (rw,relatime)
/sys on /sys type sysfs (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=509736k,nr_inodes=127434,mode=755)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /dev/shm type tmpfs (rw,relatime)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
/dev/mapper/NTBgroup-System4 on / type ext4 (rw,noatime,user_xattr,barrier=1,data=ordered)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu type cgroup (rw,nosuid,nodev,noexec,relatime,cpu)
cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
systemd-1 on /dev/mqueue type autofs (rw,relatime,fd=23,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
systemd-1 on /dev/hugepages type autofs (rw,relatime,fd=24,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=25,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
systemd-1 on /sys/kernel/debug type autofs (rw,relatime,fd=26,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
systemd-1 on /sys/kernel/security type autofs (rw,relatime,fd=28,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
tmpfs on /media type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
/dev/mapper/NTBgroup-Data on /data type ext4 (rw,noatime,user_xattr,barrier=1,data=ordered)
/dev/sda1 on /boot type ext3 (rw,noatime,errors=continue,user_xattr,acl,barrier=0,data=ordered)

Barry
Michal Schmidt
2011-10-11 15:35:00 UTC
Permalink
Post by Barry Scott
http://onelanftp.co.uk/bscott/netconsole.noreboot.txt
http://onelanftp.co.uk/bscott/netconsole.reboot.txt
A suspicious moment occurring only in the failing case is:

[ 751.743255] swapoff[7788]: swapoff: /dev/dm-1: swapoff failed: Cannot
allocate memory
...
[ 751.830279] systemd[1]: dev-dm\x2d1.swap changed deactivating -> failed

I wonder if this could prevent the reaching of umount.target.

Michal
Barry Scott
2011-10-11 15:41:53 UTC
Permalink
Post by Michal Schmidt
Post by Barry Scott
http://onelanftp.co.uk/bscott/netconsole.noreboot.txt
http://onelanftp.co.uk/bscott/netconsole.reboot.txt
[ 751.743255] swapoff[7788]: swapoff: /dev/dm-1: swapoff failed: Cannot
allocate memory
...
[ 751.830279] systemd[1]: dev-dm\x2d1.swap changed deactivating -> failed
I wonder if this could prevent the reaching of umount.target.
Aha. I'll work on findout out what is using up all the memory.
Maybe I have a process that has eaten up lots of ram preventing
swapoff from succeeding.

Barry
Barry Scott
2011-10-12 13:46:11 UTC
Permalink
Thanks to Michal's observation that swapoff failed we have now found the root cause.

swapoff is called while all our production processes are still running.

We would have expected systemd to turn off swap after stopping most if not
all processes and thus freeing up as much memory as possible. swapoff fails
becuase there is not enough ram available to load the pages from swap into.

Is this a systemd problem or an issue with the way we wrote our services and targets?

Barry
Michal Schmidt
2011-10-12 14:27:10 UTC
Permalink
Post by Barry Scott
Thanks to Michal's observation that swapoff failed we have now found the root cause.
swapoff is called while all our production processes are still running.
We would have expected systemd to turn off swap after stopping most if not
all processes and thus freeing up as much memory as possible. swapoff fails
becuase there is not enough ram available to load the pages from swap into.
Is this a systemd problem or an issue with the way we wrote our services and targets?
Do you use "DefaultDependencies=false" in any of your service unit
files? That could explain the missing ordering.

Michal
Barry Scott
2011-10-12 14:54:59 UTC
Permalink
Post by Michal Schmidt
Post by Barry Scott
Thanks to Michal's observation that swapoff failed we have now found the root cause.
swapoff is called while all our production processes are still running.
We would have expected systemd to turn off swap after stopping most if not
all processes and thus freeing up as much memory as possible. swapoff fails
becuase there is not enough ram available to load the pages from swap into.
Is this a systemd problem or an issue with the way we wrote our services and targets?
Do you use "DefaultDependencies=false" in any of your service unit
files? That could explain the missing ordering.
No, not in any of our files.

What dependency is supposed to cause the swapoff to be after the production
processes are stopped?

Barry
Michal Schmidt
2011-10-12 15:42:38 UTC
Permalink
Post by Barry Scott
What dependency is supposed to cause the swapoff to be after the production
processes are stopped?
The units' ordering should be something like this:

*.swap Before swap.target Before sysinit.target Before basic.target
Before production.service

This ordering is used for startup. When shutting down, they should get
stopped in the reverse order.

Michal
Barry Scott
2011-10-12 16:54:27 UTC
Permalink
Post by Michal Schmidt
Post by Barry Scott
What dependency is supposed to cause the swapoff to be after the production
processes are stopped?
*.swap Before swap.target Before sysinit.target Before basic.target
Before production.service
This ordering is used for startup. When shutting down, they should get
stopped in the reverse order.
I think I see what the issue is.

My default target contains:

Requires=basic.target
After=basic.target

But it also contains lines like:

Requires=start-signage.service
After=start-signage.service

This means that start-signage can run in parallel with basic.target.

What I need to do is find all the "production" services and tell them
to run after basic.target. I'll work on that and see where I get.

Barry
Barry Scott
2011-10-13 14:18:46 UTC
Permalink
Post by Barry Scott
Post by Michal Schmidt
Post by Barry Scott
What dependency is supposed to cause the swapoff to be after the production
processes are stopped?
*.swap Before swap.target Before sysinit.target Before basic.target
Before production.service
This ordering is used for startup. When shutting down, they should get
stopped in the reverse order.
I think I see what the issue is.
Requires=basic.target
After=basic.target
Requires=start-signage.service
After=start-signage.service
This means that start-signage can run in parallel with basic.target.
What I need to do is find all the "production" services and tell them
to run after basic.target. I'll work on that and see where I get.
Using "systemctl dump" I can see that my production services all
have the suggested properties:

...
-> Unit daemon-network-manager-controller.service:
Requires: basic.target
After: basic.target
References: basic.target

-> Unit daemon-layout.service:
Requires: basic.target
After: basic.target
References: basic.target
...

Does this point to a bug in systemd-26-8.fc15.i686 reboot code?

Barry

Clyde E. Kunkel
2011-10-06 19:03:24 UTC
Permalink
Sorry, inadvertently left off the ML.

-------- Original Message --------
Subject: Re: [systemd-devel] Seeking Advice on debugging failure to reboot
Date: Thu, 06 Oct 2011 15:01:30 -0400
Post by Barry Scott
Post by Barry Scott
We can reproducably get an F15 system in a state that it fails to
complete a reboot.
Becuase all the logging, network and getty's has been stopped all
we can see is the kernel messages on the console.
What we need to know is why systemd is not completing the reboot.
What advice do you have on how to debug this situation please?
Barry
...
We have got a bit further along using netconsole to see the kmsg
after systemd turns off syslog.
....
The obvious difference between a reboot that works and one that
..
[ 135.391206] systemd[1]: final.target changed dead -> active
[ 135.391217] systemd[1]: Job final.target/start finished, result=done
[ 135.391249] systemd[1]: About to execute: /bin/systemctl --force reboot
...
But this never happens in the bad case.
...
How do I track down why final.target is not activated?
....
Barry
FWIW, when I modprobe netconsole in F15, 16, or 17 to send syslog msgs
to a remote machine, I have to hit the reset switch to reboot or power
switch to shutdown. The last msg seen on the offending machine is:

[nnnn:nnnn] ehci_hcd 0000:00:ld.7: PCI INT A disabled

where nnnn:nnnn is the time sequence which varies. The msg is not seen
on the remote logger.

Not bz'd yet, trying to gain more understanding of what is controlling
at this point: systemd or something else.
--
Regards,
OldFart
Lennart Poettering
2011-10-10 20:15:55 UTC
Permalink
Post by Barry Scott
We have got a bit further along using netconsole to see the kmsg
after systemd turns off syslog.
....
The obvious difference between a reboot that works and one that
..
[ 135.391206] systemd[1]: final.target changed dead -> active
[ 135.391217] systemd[1]: Job final.target/start finished, result=done
[ 135.391249] systemd[1]: About to execute: /bin/systemctl --force reboot
If you get that far it's most likely a kernel problem. There were a
number of issues with some newer boards reported in rhbz. It's
surprisingly hard to reboot/poweroff a machine.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Lennart Poettering
2011-10-10 20:20:46 UTC
Permalink
Post by Barry Scott
We can reproducably get an F15 system in a state that it fails to
complete a reboot.
Becuase all the logging, network and getty's has been stopped all
we can see is the kernel messages on the console.
What we need to know is why systemd is not completing the reboot.
What advice do you have on how to debug this situation please?
My usual recommendation is to boot with "systemd.log_level=debug
systemd.log_target=kmsg" on the kernel cmdline.

Then, place a script like the following in /lib/systemd/system-shutdown
and mark it +x:

#!/bin/bash
mount / -o rw,remount
dmesg > /dmesg.shutdown
mount / -o ro,remount
exit 0

Then, when you shutdown you'll have the full kernel log buffer in
/dmesg.shutdown, right before the machine powers off/reboots. If you get
that far however, it's usually a good indication that the kernel is
borked, rather than systemd.

Lennart
--
Lennart Poettering - Red Hat, Inc.
Loading...