[systemd-devel] Seeking Advice on debugging failure to reboot

Post by Barry Scott
We can reproducably get an F15 system in a state that it fails to
complete a reboot.
Becuase all the logging, network and getty's has been stopped all
we can see is the kernel messages on the console.
What we need to know is why systemd is not completing the reboot.
What advice do you have on how to debug this situation please?
Barry

We have got a bit further along using netconsole to see the kmsg
after systemd turns off syslog.

The obvious difference between a reboot that works and one that
fails is that I see these lines in the good case:

[ 135.391206] systemd[1]: final.target changed dead -> active
[ 135.391217] systemd[1]: Job final.target/start finished, result=done
[ 135.391249] systemd[1]: About to execute: /bin/systemctl --force reboot

But this never happens in the bad case.

How do I track down why final.target is not activated?

Barry

Michal Schmidt

2011-10-07 08:48:52 UTC

Post by Barry Scott
We can reproducably get an F15 system in a state that it fails to
complete a reboot.

Then can you suggest how to reproduce it?

Post by Barry Scott
The obvious difference between a reboot that works and one that
[ 135.391206] systemd[1]: final.target changed dead -> active
[ 135.391217] systemd[1]: Job final.target/start finished, result=done
[ 135.391249] systemd[1]: About to execute: /bin/systemctl --force reboot

What is printed before these lines? Could you attach the complete log?

Michal

Barry Scott

2011-10-11 10:44:55 UTC

Post by Barry Scott
We can reproducably get an F15 system in a state that it fails to
complete a reboot.

Then can you suggest how to reproduce it?

What is printed before these lines? Could you attach the complete log?

The full log is here:

http://onelanftp.co.uk/bscott/ntb10117.netconsole.txt

The first boot sequence from lines 1-182 works.

The second from 188-1182 has the problem.

If the kernel is at fault then I would expect to see the
"About to execute: /bin/systemctl --force reboot" line
and a failure to reboot.

But given that line is not output I suspect its more
likely to be an issue with systemd not getting to that
point. And question is what stops this happening?

Barry

Michal Schmidt

2011-10-11 11:37:23 UTC

Post by Barry Scott
http://onelanftp.co.uk/bscott/ntb10117.netconsole.txt
The first boot sequence from lines 1-182 works.
The second from 188-1182 has the problem.
If the kernel is at fault then I would expect to see the
"About to execute: /bin/systemctl --force reboot" line
and a failure to reboot.
But given that line is not output I suspect its more
likely to be an issue with systemd not getting to that
point. And question is what stops this happening?

In the second run umount.target is not reached. It is not obvious why.
The log shows only /data, /boot and /media being unmounted in both runs.
There may be a mountpoint that refuses to die. Did you try waiting a few
minutes to see if any timeout kicks in?
It is unfortunate that we only see the messages after syslog is stopped.
Try with "systemd.log_target=console", or just stop rsyslog.service
manually before invoking reboot. This way netconsole should show more of
the events.

Michal

Barry Scott

2011-10-11 15:09:00 UTC

We do not see any timeout.

Post by Michal Schmidt
It is unfortunate that we only see the messages after syslog is stopped.
Try with "systemd.log_target=console", or just stop rsyslog.service
manually before invoking reboot. This way netconsole should show more of
the events.

I have uploaded full netconsole logs with the options you suggested:

http://onelanftp.co.uk/bscott/netconsole.noreboot.txt
http://onelanftp.co.uk/bscott/netconsole.reboot.txt

Here is the output of mount from the same system after power-fail
reboot:

# mount
/proc on /proc type proc (rw,relatime)
/sys on /sys type sysfs (rw,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=509736k,nr_inodes=127434,mode=755)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /dev/shm type tmpfs (rw,relatime)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
/dev/mapper/NTBgroup-System4 on / type ext4 (rw,noatime,user_xattr,barrier=1,data=ordered)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu type cgroup (rw,nosuid,nodev,noexec,relatime,cpu)
cgroup on /sys/fs/cgroup/cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
systemd-1 on /dev/mqueue type autofs (rw,relatime,fd=23,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
systemd-1 on /dev/hugepages type autofs (rw,relatime,fd=24,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=25,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
systemd-1 on /sys/kernel/debug type autofs (rw,relatime,fd=26,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
systemd-1 on /sys/kernel/security type autofs (rw,relatime,fd=28,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
tmpfs on /media type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
/dev/mapper/NTBgroup-Data on /data type ext4 (rw,noatime,user_xattr,barrier=1,data=ordered)
/dev/sda1 on /boot type ext3 (rw,noatime,errors=continue,user_xattr,acl,barrier=0,data=ordered)

Barry

Michal Schmidt

2011-10-11 15:35:00 UTC

Post by Barry Scott
http://onelanftp.co.uk/bscott/netconsole.noreboot.txt
http://onelanftp.co.uk/bscott/netconsole.reboot.txt

A suspicious moment occurring only in the failing case is:

[ 751.743255] swapoff[7788]: swapoff: /dev/dm-1: swapoff failed: Cannot
allocate memory
...
[ 751.830279] systemd[1]: dev-dm\x2d1.swap changed deactivating -> failed

I wonder if this could prevent the reaching of umount.target.

Michal

Barry Scott

2011-10-11 15:41:53 UTC

Post by Barry Scott
http://onelanftp.co.uk/bscott/netconsole.noreboot.txt
http://onelanftp.co.uk/bscott/netconsole.reboot.txt

[ 751.743255] swapoff[7788]: swapoff: /dev/dm-1: swapoff failed: Cannot
allocate memory
...
[ 751.830279] systemd[1]: dev-dm\x2d1.swap changed deactivating -> failed
I wonder if this could prevent the reaching of umount.target.

Aha. I'll work on findout out what is using up all the memory.
Maybe I have a process that has eaten up lots of ram preventing
swapoff from succeeding.

Barry

Barry Scott

2011-10-12 13:46:11 UTC

Thanks to Michal's observation that swapoff failed we have now found the root cause.

swapoff is called while all our production processes are still running.

We would have expected systemd to turn off swap after stopping most if not
all processes and thus freeing up as much memory as possible. swapoff fails
becuase there is not enough ram available to load the pages from swap into.

Is this a systemd problem or an issue with the way we wrote our services and targets?

Barry

Michal Schmidt

2011-10-12 14:27:10 UTC

Post by Barry Scott
Thanks to Michal's observation that swapoff failed we have now found the root cause.
swapoff is called while all our production processes are still running.
We would have expected systemd to turn off swap after stopping most if not
all processes and thus freeing up as much memory as possible. swapoff fails
becuase there is not enough ram available to load the pages from swap into.
Is this a systemd problem or an issue with the way we wrote our services and targets?

Do you use "DefaultDependencies=false" in any of your service unit
files? That could explain the missing ordering.

Michal

Barry Scott

2011-10-12 14:54:59 UTC

Do you use "DefaultDependencies=false" in any of your service unit
files? That could explain the missing ordering.

No, not in any of our files.

What dependency is supposed to cause the swapoff to be after the production
processes are stopped?

Barry

Michal Schmidt

2011-10-12 15:42:38 UTC

Post by Barry Scott
What dependency is supposed to cause the swapoff to be after the production
processes are stopped?

The units' ordering should be something like this:

*.swap Before swap.target Before sysinit.target Before basic.target
Before production.service

This ordering is used for startup. When shutting down, they should get
stopped in the reverse order.

Michal

Barry Scott

2011-10-12 16:54:27 UTC

Post by Barry Scott
What dependency is supposed to cause the swapoff to be after the production
processes are stopped?

*.swap Before swap.target Before sysinit.target Before basic.target
Before production.service
This ordering is used for startup. When shutting down, they should get
stopped in the reverse order.

I think I see what the issue is.

My default target contains:

Requires=basic.target
After=basic.target

But it also contains lines like:

Requires=start-signage.service
After=start-signage.service

This means that start-signage can run in parallel with basic.target.

What I need to do is find all the "production" services and tell them
to run after basic.target. I'll work on that and see where I get.

Barry

Barry Scott

2011-10-13 14:18:46 UTC

Post by Barry Scott
What dependency is supposed to cause the swapoff to be after the production
processes are stopped?

*.swap Before swap.target Before sysinit.target Before basic.target
Before production.service
This ordering is used for startup. When shutting down, they should get
stopped in the reverse order.

I think I see what the issue is.
Requires=basic.target
After=basic.target
Requires=start-signage.service
After=start-signage.service
This means that start-signage can run in parallel with basic.target.
What I need to do is find all the "production" services and tell them
to run after basic.target. I'll work on that and see where I get.

Using "systemctl dump" I can see that my production services all
have the suggested properties:

...
-> Unit daemon-network-manager-controller.service:
Requires: basic.target
After: basic.target
References: basic.target

-> Unit daemon-layout.service:
Requires: basic.target
After: basic.target
References: basic.target
...

Does this point to a bug in systemd-26-8.fc15.i686 reboot code?

Barry

Clyde E. Kunkel

2011-10-06 19:03:24 UTC

Sorry, inadvertently left off the ML.

-------- Original Message --------
Subject: Re: [systemd-devel] Seeking Advice on debugging failure to reboot
Date: Thu, 06 Oct 2011 15:01:30 -0400

...
We have got a bit further along using netconsole to see the kmsg
after systemd turns off syslog.
....
The obvious difference between a reboot that works and one that
..
[ 135.391206] systemd[1]: final.target changed dead -> active
[ 135.391217] systemd[1]: Job final.target/start finished, result=done
[ 135.391249] systemd[1]: About to execute: /bin/systemctl --force reboot
...
But this never happens in the bad case.
...
How do I track down why final.target is not activated?
....
Barry

FWIW, when I modprobe netconsole in F15, 16, or 17 to send syslog msgs
to a remote machine, I have to hit the reset switch to reboot or power
switch to shutdown. The last msg seen on the offending machine is:

[nnnn:nnnn] ehci_hcd 0000:00:ld.7: PCI INT A disabled

where nnnn:nnnn is the time sequence which varies. The msg is not seen
on the remote logger.

Not bz'd yet, trying to gain more understanding of what is controlling
at this point: systemd or something else.

--
Regards,
OldFart

Lennart Poettering

2011-10-10 20:15:55 UTC

Post by Barry Scott
We have got a bit further along using netconsole to see the kmsg
after systemd turns off syslog.
....
The obvious difference between a reboot that works and one that
..
[ 135.391206] systemd[1]: final.target changed dead -> active
[ 135.391217] systemd[1]: Job final.target/start finished, result=done
[ 135.391249] systemd[1]: About to execute: /bin/systemctl --force reboot

If you get that far it's most likely a kernel problem. There were a
number of issues with some newer boards reported in rhbz. It's
surprisingly hard to reboot/poweroff a machine.

Lennart

--
Lennart Poettering - Red Hat, Inc.

Lennart Poettering

2011-10-10 20:20:46 UTC