Discussion:
[systemd-devel] [Xen-devel] [PATCH v1] core: mount xenfs, ignore proc-xen.mount (#6442, #6662)
Olaf Hering
2017-11-29 15:54:57 UTC
Permalink
With all of the above, was it considered to check /sys/hypervisor
alongside with or perhaps even in preference to /proc/xen?
Yes.
/proc/xen/capabilities is the one and only place that indicates a dom0.

Olaf
Jan Beulich
2017-11-29 16:01:40 UTC
Permalink
Post by Olaf Hering
With all of the above, was it considered to check /sys/hypervisor
alongside with or perhaps even in preference to /proc/xen?
Yes.
/proc/xen/capabilities is the one and only place that indicates a dom0.
But in the description you talk about detect_vm() - by its name that
doesn't look to care about Dom0, but whether running on top of
_some_ hypervisor.

Jan
Olaf Hering
2017-11-29 16:07:42 UTC
Permalink
Post by Jan Beulich
But in the description you talk about detect_vm() - by its name that
doesn't look to care about Dom0, but whether running on top of
_some_ hypervisor.
dom0 has to be handled as "no virtualization" because it has access to
all hardware, all services depending on "native" have to run in dom0.

Actually, it is spelled with 'z': ConditionVirtualization=
https://www.freedesktop.org/software/systemd/man/systemd.unit.html
https://www.freedesktop.org/software/systemd/man/systemd-detect-virt.html

Olaf
Jan Beulich
2017-11-29 16:24:55 UTC
Permalink
Post by Olaf Hering
Post by Jan Beulich
But in the description you talk about detect_vm() - by its name that
doesn't look to care about Dom0, but whether running on top of
_some_ hypervisor.
dom0 has to be handled as "no virtualization" because it has access to
all hardware, all services depending on "native" have to run in dom0.
Ah, I see. But then still I don't see why at least on half way
recent Xen /sys/hypervisor/properties/features wouldn't have
the information you're after (and even more precise, because
down the road control domain and hardware domain may be
separate entities).

Jan
Olaf Hering
2017-11-30 08:23:55 UTC
Permalink
Post by Jan Beulich
Ah, I see. But then still I don't see why at least on half way
recent Xen /sys/hypervisor/properties/features wouldn't have
the information you're after (and even more precise, because
down the road control domain and hardware domain may be
separate entities).
Per discussion in https://github.com/systemd/systemd/pull/6662, the
feature bits should not be used for dom0 detection.

Olaf
Jan Beulich
2017-11-30 08:35:45 UTC
Permalink
Post by Olaf Hering
Post by Jan Beulich
Ah, I see. But then still I don't see why at least on half way
recent Xen /sys/hypervisor/properties/features wouldn't have
the information you're after (and even more precise, because
down the road control domain and hardware domain may be
separate entities).
Per discussion in https://github.com/systemd/systemd/pull/6662, the
feature bits should not be used for dom0 detection.
I can't seem to interpret that discussion the way you do. In fact
(as I've said before) using the feature flag is more reliable, as it
being set implies this is the hardware domain (rather than the
more fuzzy "control domain" implied by "control_d").

Wei, your comments there from Oct 27 and 30 are what I think
Olaf refers to. Could you clarify this?

Jan
Olaf Hering
2017-12-01 11:04:55 UTC
Permalink
Am Fri, 1 Dec 2017 10:21:46 +0000
In Olaf's case, he cares about knowing whether the domain runs the
controlling toolstack, he doesn't care about if it is the hardware
domain or not, so my conclusion was using that flag was wrong.
I think this is not entirely accurate. Right now the term "dom0" is
a mix of "has access to host (IO) hardware" and "runs the toolstack".

ConditionVirtualization= today lacks such details as well.
"xen" means domU, and "none" is dom0, simply to handle "dom0" like "native"
so that all services that want access to "host hardware" can start.

One could argue that passing a PCI device to a domU may also require
.service files to manage that PCI device in some way. The specifc case
which triggered all the suggested changes was smartd, which is not
supposed to run in "VMs". If a SATA card is provided to a domU it may
be a good idea to monitor the attached disks as well.

So in some way Jan is correct with his suggestion to use XENFEAT_dom0
instead of "control_d". I will do some research about when it became available
and update the patch description.

Olaf
Lennart Poettering
2017-12-01 17:06:37 UTC
Permalink
Post by Olaf Hering
Am Fri, 1 Dec 2017 10:21:46 +0000
In Olaf's case, he cares about knowing whether the domain runs the
controlling toolstack, he doesn't care about if it is the hardware
domain or not, so my conclusion was using that flag was wrong.
I think this is not entirely accurate. Right now the term "dom0" is
a mix of "has access to host (IO) hardware" and "runs the toolstack".
ConditionVirtualization= today lacks such details as well.
"xen" means domU, and "none" is dom0, simply to handle "dom0" like "native"
so that all services that want access to "host hardware" can start.
One could argue that passing a PCI device to a domU may also require
.service files to manage that PCI device in some way. The specifc case
which triggered all the suggested changes was smartd, which is not
supposed to run in "VMs". If a SATA card is provided to a domU it may
be a good idea to monitor the attached disks as well.
So in some way Jan is correct with his suggestion to use XENFEAT_dom0
instead of "control_d". I will do some research about when it became available
and update the patch description.
To make this clear: systemd is only interested in a very high-level
view on things: all it wants to know if we are payload of some kind of
virtualization, or not. We aren't interested in details, and what kind
of virtualization logic Xen precisely deploys is an implementation
detail from our point of view, that we aren't interested in. If
services need to know in all detail what kind of system they run on,
then ConditionVirtualization=/AssertVirtualization= is really not for
them, and they should just run their own code, and figure out things
on their own.

I don't care much where precisely the line is drawn, when a Xen
environment is considered "host" and when "guest", I'll let you guys
figure that out. Only for the latter ConditionVirtualization=guest
should hold, for the latter it should not.

Lennart
--
Lennart Poettering, Red Hat
Jan Beulich
2017-12-01 11:39:30 UTC
Permalink
Post by Jan Beulich
Post by Olaf Hering
Post by Jan Beulich
Ah, I see. But then still I don't see why at least on half way
recent Xen /sys/hypervisor/properties/features wouldn't have
the information you're after (and even more precise, because
down the road control domain and hardware domain may be
separate entities).
Per discussion in https://github.com/systemd/systemd/pull/6662, the
feature bits should not be used for dom0 detection.
I can't seem to interpret that discussion the way you do. In fact
(as I've said before) using the feature flag is more reliable, as it
being set implies this is the hardware domain (rather than the
more fuzzy "control domain" implied by "control_d").
Wei, your comments there from Oct 27 and 30 are what I think
Olaf refers to. Could you clarify this?
Judging from the snippet here I don't quite understand what your
disagreement is. You already said control domain and hardware domain
could be separate entities in the future.
The XENFEAT_dom0 flag currently denotes hardware domain. It boils down
to whether we want Dom0 to mean hardware domain, control domain or just
"A domain that has domid 0".
In Olaf's case, he cares about knowing whether the domain runs the
controlling toolstack, he doesn't care about if it is the hardware
domain or not, so my conclusion was using that flag was wrong.
I understood it exactly the other way around: The question is
whether to treat things the same as without Xen:
"dom0 has to be handled as "no virtualization" because it has access to
all hardware, all services depending on "native" have to run in dom0."
Sound like hardware domain to me, not control domain.

Jan
Jan Beulich
2017-12-01 12:11:45 UTC
Permalink
Suppose at one point we split hardware domain and control domain, which
one will you call Dom0? Which one will get the flag?
There can only be one hardware domain, which will continue to
be the one getting XENFEAT_dom0. There could be any number
of control domains (perhaps with some coordination between
them).

Jan
Jan Beulich
2017-12-01 12:23:16 UTC
Permalink
Post by Jan Beulich
Suppose at one point we split hardware domain and control domain, which
one will you call Dom0? Which one will get the flag?
There can only be one hardware domain, which will continue to
be the one getting XENFEAT_dom0. There could be any number
of control domains (perhaps with some coordination between
them).
Right. So XENFEAT_dom0 is not really what Olaf needs.
Sigh. What does "has access to all the hardware" translate to
for you?

Jan
Olaf Hering
2017-12-01 12:38:48 UTC
Permalink
Am Fri, 1 Dec 2017 12:29:24 +0000
But Olaf needs to know if some of the services like xenconsoled or
xenstored should be started, and if some of the special file systems
should be mounted, right? Those aren't tied to hardware in anyway. In my
view that's the responsibility of the toolstack control domain.
No, I did not intent to make use of ConditionVirtualization= in the
xen*.service files in tools/hotplug/Linux/. That variable can not be used
for this purpose, and the patch would not change that.

In case you refer to the "proc-xen.mount" change from a few days/weeks ago,
this was all about avoiding the error when xenfs becomes an "API filesystem".
With this suggested change the existing "proc-xen.mount units would not fail
anymore because /proc/xen is added to the ignore list.

Olaf
Olaf Hering
2017-12-01 15:49:01 UTC
Permalink
What information do you need? For a moment let's skip using the fuzzy
"Dom0" term and try to be precise. Like "I would like to know if that
domain has access to all hardware" or something else.
That depends on the .service files. This is the list of openSUSE Tumbleweed:

dev-hugepages.mount:ConditionVirtualization=!private-users
haveged.service:ConditionVirtualization=!container
hv_fcopy_daemon.service:ConditionVirtualization=microsoft
hv_kvp_daemon.service:ConditionVirtualization=microsoft
hv_vss_daemon.service:ConditionVirtualization=microsoft
irqd.service:ConditionVirtualization=!container
ksm.service:ConditionVirtualization=no
lxcfs.service:ConditionVirtualization=!container
mcelog.service:ConditionVirtualization=false
ntp-wait.service:ConditionVirtualization=!container
ntpd.service:ConditionVirtualization=!container
rng-tools.service:ConditionVirtualization=!container
smartd.service:ConditionVirtualization=false
sys-fs-fuse-connections.mount:ConditionVirtualization=!private-users
systemd-random-seed.service:ConditionVirtualization=!container
systemd-timesyncd.service:ConditionVirtualization=!container
vgauthd.service:ConditionVirtualization=vmware
vmblock-fuse.service:ConditionVirtualization=vmware
vmtoolsd.service:ConditionVirtualization=vmware
xendriverdomain.service:ConditionVirtualization=xen

I think the relevant services are ksm,mcelog and smartd. Each one likely
wants the "hardware domain" rather than the "toolstack domain". IMO what
systemd today sees as "dom0" should be set based on "XENFEAT_dom0".

Olaf
Jan Beulich
2017-12-01 16:52:28 UTC
Permalink
Post by Jan Beulich
Post by Jan Beulich
Suppose at one point we split hardware domain and control domain, which
one will you call Dom0? Which one will get the flag?
There can only be one hardware domain, which will continue to
be the one getting XENFEAT_dom0. There could be any number
of control domains (perhaps with some coordination between
them).
Right. So XENFEAT_dom0 is not really what Olaf needs.
Sigh. What does "has access to all the hardware" translate to
for you?
That would mean hardware domain.
But Olaf needs to know if some of the services like xenconsoled or
xenstored should be started, and if some of the special file systems
should be mounted, right? Those aren't tied to hardware in anyway. In my
view that's the responsibility of the toolstack control domain.
Can you point me to the start of your discussion with Olaf so that I can
check what the disagreement between you and Olaf is about?
The start of the discussion is the root of this thread. Olaf somewhere in
the middle pointed to another discussion which you appear to have been
involved in.

I'm also not sure there's actual disagreement here - I was merely pointing
out that strictly following what was written in the description of the patch
there may not be a need to consult /proc/xen, and hence no need to
mount it early.

Jan
systemd github import bot
2017-11-29 17:03:01 UTC
Permalink
Patchset imported to github.
To create a pull request, one of the main developers has to initiate one via:
<https://github.com/systemd/systemd/compare/master...systemd-mailing-devs:20171129153842.14353-1-olaf%40aepfle.de>

--
Generated by https://github.com/haraldh/mail2git
Lennart Poettering
2017-11-29 17:48:03 UTC
Permalink
The detection of ConditionVirtualisation= relies on the presence of
/proc/xen/capabilities. If the file exists and contains the string
"control_d", the running system is a dom0 and VIRTUALIZATION_NONE should
be set. In case /proc/xen exists, or some sysfs files indicate "xen",
VIRTUALIZATION_XEN should be set to indicate the system is a domU.
With an (old) xenlinux based kernel, /proc/xen/capabilities is always
available and the detection described above works always. But with a
pvops based kernel, xenfs must be mounted on /proc/xen to get
"capabilities". Up to now this was done by a proc-xen.mount unit, which
is part of xen.git. Since the mounting happens "late", other units may
be scheduled before "proc-xen.mount". If these other units make use of
"ConditionVirtualisation=", the virtualization detection returns
incorect results. detect_vm() will set VIRTUALIZATION_XEN because "xen"
is found in sysfs. This value will be cached. Once xenfs is mounted,
the next process that runs detect_vm() will get VIRTUALIZATION_NONE.
This misdetection can be fixed by turning xenfs into an "API
filesystem", which is done by adding it to mount_table[]. That way xenfs
will be mounted early, and detect_vm() returns correct results.
But since the "proc-xen.mount" unit still exists, the startup of that
unit fails because it is rejected due to a conflict with an "API
filesystem" mount. This is done in mount_verify(). The unit gets into
state FAILED. Other Xen toolstack related .service files rely on the
existance of the proc-xen.mount unit. If proc-xen.mount fails, the
toolstack fails too.
To stay compatible with existing and upcoming releases of Xen, add
"/proc/xen" to the list of ignored paths and adjust the check for API
mounts.
Now "proc-xen.mount" is silently accepted, it gets into state "start
condition failed" because "ConditionPathExists=!/proc/xen/capabilities
was not met". But all units depending on "proc-xen.mount" start anyway.
Could you please submit this through github? we much prefer doing
reviews through github these days.

Also S-o-b is a kernel thing, we don't do that on systemd.
---
I'm not 100% sure if the logic change in mount_verify() has undesired effects.
src/core/mount-setup.c | 3 +++
src/core/mount.c | 2 +-
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/src/core/mount-setup.c b/src/core/mount-setup.c
index d5c0fcddc..df23e71b5 100644
--- a/src/core/mount-setup.c
+++ b/src/core/mount-setup.c
@@ -114,6 +114,8 @@ static const MountPoint mount_table[] = {
cg_is_legacy_wanted, MNT_FATAL|MNT_IN_CONTAINER },
{ "pstore", "/sys/fs/pstore", "pstore", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV,
NULL, MNT_NONE },
+ { "xenfs", "/proc/xen", "xenfs", NULL, MS_NOSUID|MS_NOEXEC,
+ NULL, MNT_NONE },
#if ENABLE_EFI
{ "efivarfs", "/sys/firmware/efi/efivars", "efivarfs", NULL, MS_NOSUID|MS_NOEXEC|MS_NODEV,
is_efi_boot, MNT_NONE },
@@ -126,6 +128,7 @@ static const MountPoint mount_table[] = {
static const char ignore_paths[] =
/* SELinux file systems */
"/sys/fs/selinux\0"
+ "/proc/xen\0"
/* Container bind mounts */
"/proc/sys\0"
"/dev/console\0"
diff --git a/src/core/mount.c b/src/core/mount.c
index b25bb9cb4..5afe7fcd7 100644
--- a/src/core/mount.c
+++ b/src/core/mount.c
@@ -526,7 +526,7 @@ static int mount_verify(Mount *m) {
return -EINVAL;
}
- if (mount_point_is_api(m->where) || mount_point_ignore(m->where)) {
+ if (mount_point_is_api(m->where) && !mount_point_ignore(m->where)) {
log_unit_error(UNIT(m), "Cannot create mount unit for API file system %s. Refusing.", m->where);
return -EINVAL;
}
_______________________________________________
systemd-devel mailing list
https://lists.freedesktop.org/mailman/listinfo/systemd-devel
Lennart
--
Lennart Poettering, Red Hat
Loading...