[lxc-devel] RFC: Device Namespaces

Hi everyone!

We [1] have been working on bringing lightweight virtualization to
Linux-based mobile devices like Android (or other Linux-based devices with
diverse I/O) and want to share our solution: device namespaces.

Imagine you could run several instances of your favorite mobile OS or other
distributions in isolated containers, each under the impression of having
exclusive access to device drivers; Interact and switch between them within
a blink, no flashing, no reboot.

Device namespaces are an extension to existing Linux kernel namespaces that
brings lightweight virtualization to Linux-based end-user devices,
primarily mobile devices.
Device namespaces introduce a private and virtual namespace for device
drivers to create the illusion for a process group that it interacts
exclusively with a set of drivers. Device namespaces also introduce the
concepts of an ?active? namespace with which a user interacts, vs
?non-active? namespaces that run in the background, and the ability to
switch between them.[2]

We are planning to prepare individual patches to be submitted to the
relevant maintainers and mailing lists. In the meantime, we already want to
share a set of patches on top of the Android goldfish Kernel 3.4 as well as
a user-space demo, so you can see where we are heading and get an overview
of the approach and see how it works.

We are aware that the patches are not ready for submission in their current
state, and we'd highly appreciate any feedback or suggestions which may
come to your mind once you have a look [3]. Of particular interest is to
elaborate a proper userspace API with respect to existing and future
use-cases. To illustrate a simple use-case we also provide a simple
userspace demo for Android [4].

I will be presenting "The Case for Linux Device Namespace" [5] at LinuxCon
North America 2013 [6]. We will also be attending the Containers Track [7]
at LPC 2013 to present the current state of the patches and discuss the
best course to proceed.

We are looking forward to hear from you!

Thanks,

Oren.

1: http://www.cellrox.com/
2: https://github.com/Cellrox/devns-patches/wiki/DeviceNamespace
3: https://github.com/Cellrox/devns-patches
4: https://github.com/Cellrox/devns-demo
5: http://sched.co/1asN1v7
6: http://events.linuxfoundation.org/events/linuxcon-north-america
7: http://www.linuxplumbersconf.org/2013/ocw/events/LPC2013/tracks/153

--
Oren Laadan
Cellrox Ltd.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20130822/c8fec9ab/attachment.html>

Michael J Coss

2013-08-26 15:42:42 UTC

Hi everyone!
We [1] have been working on bringing lightweight virtualization to
Linux-based mobile devices like Android (or other Linux-based devices
with diverse I/O) and want to share our solution: device namespaces.
Imagine you could run several instances of your favorite mobile OS or
other distributions in isolated containers, each under the impression
of having exclusive access to device drivers; Interact and switch
between them within a blink, no flashing, no reboot.
Device namespaces are an extension to existing Linux kernel namespaces
that brings lightweight virtualization to Linux-based end-user
devices, primarily mobile devices.
Device namespaces introduce a private and virtual namespace for device
drivers to create the illusion for a process group that it interacts
exclusively with a set of drivers. Device namespaces also introduce
the concepts of an ?active? namespace with which a user interacts, vs
?non-active? namespaces that run in the background, and the ability to
switch between them.[2]
We are planning to prepare individual patches to be submitted to the
relevant maintainers and mailing lists. In the meantime, we already
want to share a set of patches on top of the Android goldfish Kernel
3.4 as well as a user-space demo, so you can see where we are heading
and get an overview of the approach and see how it works.
We are aware that the patches are not ready for submission in their
current state, and we'd highly appreciate any feedback or suggestions
which may come to your mind once you have a look [3]. Of particular
interest is to elaborate a proper userspace API with respect to
existing and future use-cases. To illustrate a simple use-case we also
provide a simple userspace demo for Android [4].
I will be presenting "The Case for Linux Device Namespace" [5] at
LinuxCon North America 2013 [6]. We will also be attending the
Containers Track [7] at LPC 2013 to present the current state of the
patches and discuss the best course to proceed.
We are looking forward to hear from you!
Thanks,
Oren.

Great news. I have been working on something similar, and will look over
your patch set. Although, one use case that I want is kind of the
reverse of what you're doing; to run an Android container on a Linux
host, as well as just provide device protection to the host from containers.

--
---Michael J Coss

Andy Lutomirski

2013-08-29 19:06:38 UTC

Have you looked at systemd-logind? It seems to do something similar.

Stéphane Graber

2013-09-03 19:35:34 UTC

Have you looked at systemd-logind? It seems to do something similar.

logind can be used to know the list of existing user sessions and which
have console access, it also creates and to some extent manages cgroups
but it doesn't do anything that the device namespace would at the kernel
level.

The main benefit from having a device namespace in the kernel would be
to only get the uevents and device access for devices that are either
owned or shared with the container. Being able to have fake devices
replace some of the standard ones would also be nice to have.
--
St?phane Graber
Ubuntu developer
http://www.ubuntu.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20130903/25f3d8a1/attachment.pgp>

Michael J Coss

2013-09-25 18:07:31 UTC

I've been looking at this problem for some time to help solve my very
specific use case. In our case we are using containers to provide
individual "desktops" to a number of users. We want the desktop to run
X, and bind and unbind a display, keyboard, mouse to that X server
running in a particular container, and not be able to grab anyone elses
keyboard, mouse or display unless granted specific access to that from
the owern. To that end, I started worked on a udev solution. I
understand that most containers don't/won't run udev. And systemd won't
even start udev if the container doesn't have the mknod capability which
is a kinda odd cookie but I digress.

Currently the kernel effectively broadcasts uevents to all network
namespaces, and this is an issue. I don't want container A to see
container B's events. It should see only what the admin has set for the
policy for that container. This policy should be handled on the host
for the containers in userspace. This deamon can get the events, and
then forward to the appropriate container(s) those events that are
pertinent, and disregard the rest. To accomplish this, I had to change
the broadcast mechanism, and then provide a forwarding mechanism to
specific network namespaces.

Back in the day, that would have been sufficient. Udev running in the
container would have gotten the add event, and created the appropriate
devices and symlinks, and then cleaned up on remove/change events. With
the introduction of devtmpfs, udev no longer actually creates the device
nodes. It just handles links and name changes. So, I'm still left
with needing to create/manage devtmpfs or some other solution. This
leads me down the path of virtualizing devtmpfs. I've been fooling
around with FUSE, to basically mirror the host /dev (filtered
appropriately), but there are many ugly security, and implementation
details that look bad to me. I have been kicking around the notion that
the device cgroup might provide the list of "acceptable" devices, and
construct a filter/view for devtmpfs based on that.

I do have these changes working on a mostly stock 3.10 kernel, the
kernel changes are pretty small, and the deamon does a pretty minimal
filtering mostly to demonstrate functionality. It does assume that the
containers are running in a separate network namespace, but that's about it.

Of course, that still leaves you with sysfs needing similar treatment.

---Michael J Coss

Serge Hallyn

2013-09-25 20:13:56 UTC

Post by Michael J Coss
I've been looking at this problem for some time to help solve my very
specific use case. In our case we are using containers to provide
individual "desktops" to a number of users. We want the desktop to run
X, and bind and unbind a display, keyboard, mouse to that X server
running in a particular container, and not be able to grab anyone elses
keyboard, mouse or display unless granted specific access to that from
the owern. To that end, I started worked on a udev solution. I
understand that most containers don't/won't run udev. And systemd won't
even start udev if the container doesn't have the mknod capability which
is a kinda odd cookie but I digress.
Currently the kernel effectively broadcasts uevents to all network
namespaces, and this is an issue. I don't want container A to see
container B's events. It should see only what the admin has set for the
policy for that container. This policy should be handled on the host
for the containers in userspace. This deamon can get the events, and
then forward to the appropriate container(s) those events that are
pertinent, and disregard the rest. To accomplish this, I had to change
the broadcast mechanism, and then provide a forwarding mechanism to
specific network namespaces.
Back in the day, that would have been sufficient. Udev running in the
container would have gotten the add event, and created the appropriate
devices and symlinks, and then cleaned up on remove/change events. With
the introduction of devtmpfs, udev no longer actually creates the device
nodes. It just handles links and name changes. So, I'm still left
with needing to create/manage devtmpfs or some other solution. This
leads me down the path of virtualizing devtmpfs. I've been fooling
around with FUSE, to basically mirror the host /dev (filtered

Rather than using FUSE, I'd recommend looking into doing it the same
way as the devpts fs. Might not work out (or be rejected) in the end,
but at first glance it seems the right way to handle it. So each new
instance mount starts empty, changes in one are not reflected in
another, but new devices which the kernel later creates may (subject
to device cgroup of the process which mounted it?) be created in the
new instances.

Post by Michael J Coss
appropriately), but there are many ugly security, and implementation
details that look bad to me. I have been kicking around the notion that
the device cgroup might provide the list of "acceptable" devices, and
construct a filter/view for devtmpfs based on that.
I do have these changes working on a mostly stock 3.10 kernel, the
kernel changes are pretty small, and the deamon does a pretty minimal
filtering mostly to demonstrate functionality. It does assume that the
containers are running in a separate network namespace, but that's about it.
Of course, that still leaves you with sysfs needing similar treatment.
---Michael J Coss
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Lxc-devel mailing list
Lxc-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

Amir Goldstein

2013-09-29 18:14:40 UTC

I was thinking it makes sense to tie unique instances of devtmpfs sb to
userns.
If not for any other reason, for the fact that any mount sb already has the
knowledge
of the userns that mounted it.
But also, I think devtmpfs needs to be userns friendly, so it can safely
get the FS_USERNS_DEV_MOUNT flag.

it.

Post by Michael J Coss
Of course, that still leaves you with sysfs needing similar treatment.
---Michael J Coss

------------------------------------------------------------------------------

Post by Michael J Coss
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most

from

Post by Michael J Coss
the latest Intel processors and coprocessors. See abstracts and register

http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk

Post by Michael J Coss
_______________________________________________
Lxc-devel mailing list
Lxc-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk
_______________________________________________
Lxc-devel mailing list
Lxc-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/lxc-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20130929/022bf81a/attachment.html>

Oren Laadan

2013-08-26 10:11:52 UTC

Hi Serge,

Hi everyone!
We [1] have been working on bringing lightweight virtualization to
Linux-based mobile devices like Android (or other Linux-based devices

with

diverse I/O) and want to share our solution: device namespaces.
Imagine you could run several instances of your favorite mobile OS or

other

distributions in isolated containers, each under the impression of having
exclusive access to device drivers; Interact and switch between them

within

a blink, no flashing, no reboot.
Device namespaces are an extension to existing Linux kernel namespaces

that

brings lightweight virtualization to Linux-based end-user devices,
primarily mobile devices.
Device namespaces introduce a private and virtual namespace for device
drivers to create the illusion for a process group that it interacts
exclusively with a set of drivers. Device namespaces also introduce the
concepts of an ?active? namespace with which a user interacts, vs
?non-active? namespaces that run in the background, and the ability to
switch between them.[2]

To illustrate the need for device namespaces, consider the use case of
running two containers of your favorite OS (say, Android), on a single
physical phone. As a user, you either work in one container, or in the
other, and you will want to be able to switch between them (just like with
apps on mobile devices: you interact with one application at a time, and
switch between them).

See here for a demo of how it works: http://vimeo.com/60113683

To accomplish this, device namespaces solve two shortcomings of existing
namespaces:

1. A namespace for device drivers: each (Android) container needs a
private view of all devices. This includes logical drivers, like binder (in
Android) but also loop device; and physical devices, like the framebuffer
and the touch-screen.

In other words, device namespaces virtualize the _major/minor_ and the
_state_ of device drivers. With the exception of VFS, network, and PTY
(note: all three offer/are virtual devices), device drivers are otherwise
not isolated between containers.

2. A namespace for interactive scenarios: a namespace can be "active" - it
has access to the hardware, e.g. display and touch-screen. This will be the
container with which the user is interacting right now. Otherwise a
namespace is "non-active" - it still runs in the background, but can
neither alter the display nor receive input from the touch-screen.
Switching to another container means a context switch in the relevant
drivers, so that they restore the state and now "obey" the other namespace.

You can also think about the "active" namespace as foreground, and the
"non-active" as background, akin to foreground/background processes in a
terminal with job-control. Similar to how a terminal delivers input to the
foreground task only but not to the background tasks - this is enforced by
the new device namespace.

More details on this use-case are in the wiki:
https://github.com/Cellrox/devns-patches/wiki/Thinvisor).

We are planning to prepare individual patches to be submitted to the
Looking forward to it, and seeing you at the containers track :)

Same here!

2: https://github.com/Cellrox/devns-patches/wiki/DeviceNamespace
3: https://github.com/Cellrox/devns-patches
4: https://github.com/Cellrox/devns-demo

(Have looked over the wiki, will look over the patches as well)
-serge

Thanks,

Oren.

Eric W. Biederman

2013-09-06 17:50:10 UTC

Post by Oren Laadan
Hi Serge,

Hi everyone!
We [1] have been working on bringing lightweight virtualization to
Linux-based mobile devices like Android (or other Linux-based devices

with

diverse I/O) and want to share our solution: device namespaces.
Imagine you could run several instances of your favorite mobile OS or

other

distributions in isolated containers, each under the impression of having
exclusive access to device drivers; Interact and switch between them

within

a blink, no flashing, no reboot.
Device namespaces are an extension to existing Linux kernel namespaces

that

To illustrate the need for device namespaces, consider the use case of
running two containers of your favorite OS (say, Android), on a single
physical phone. As a user, you either work in one container, or in the
other, and you will want to be able to switch between them (just like with
apps on mobile devices: you interact with one application at a time, and
switch between them).
See here for a demo of how it works: http://vimeo.com/60113683
To accomplish this, device namespaces solve two shortcomings of existing
1. A namespace for device drivers: each (Android) container needs a
private view of all devices. This includes logical drivers, like binder (in
Android) but also loop device; and physical devices, like the framebuffer
and the touch-screen.
In other words, device namespaces virtualize the _major/minor_ and the
_state_ of device drivers. With the exception of VFS, network, and PTY
(note: all three offer/are virtual devices), device drivers are otherwise
not isolated between containers.
2. A namespace for interactive scenarios: a namespace can be "active" - it
has access to the hardware, e.g. display and touch-screen. This will be the
container with which the user is interacting right now. Otherwise a
namespace is "non-active" - it still runs in the background, but can
neither alter the display nor receive input from the touch-screen.
Switching to another container means a context switch in the relevant
drivers, so that they restore the state and now "obey" the other namespace.
You can also think about the "active" namespace as foreground, and the
"non-active" as background, akin to foreground/background processes in a
terminal with job-control. Similar to how a terminal delivers input to the
foreground task only but not to the background tasks - this is enforced by
the new device namespace.
https://github.com/Cellrox/devns-patches/wiki/Thinvisor).

I think this is going to take some talking, and looking at code.

I think you are talking about having wrappers around your devices so you
can share. Which is not the quite same problem the rest of us have been
thinking of when talking about a device namespace.

My first impression is that this is better solved with more appropriate
abstractions in userspace or in the kernel.

But we can talk at LPC and see what we can hash out.

Eric

Amir Goldstein

2013-09-08 12:28:55 UTC

Post by Oren Laadan
Hi Serge,
On Thu, Aug 22, 2013 at 2:21 PM, Serge Hallyn <serge.hallyn at ubuntu.com

Hi everyone!
We [1] have been working on bringing lightweight virtualization to
Linux-based mobile devices like Android (or other Linux-based devices

with

diverse I/O) and want to share our solution: device namespaces.
Imagine you could run several instances of your favorite mobile OS or

other

distributions in isolated containers, each under the impression of

having

Post by Oren Laadan

exclusive access to device drivers; Interact and switch between them

within

a blink, no flashing, no reboot.
Device namespaces are an extension to existing Linux kernel namespaces

that

the

Post by Oren Laadan

concepts of an ?active? namespace with which a user interacts, vs
?non-active? namespaces that run in the background, and the ability to
switch between them.[2]

with

Post by Oren Laadan
apps on mobile devices: you interact with one application at a time, and
switch between them).
See here for a demo of how it works: http://vimeo.com/60113683
To accomplish this, device namespaces solve two shortcomings of existing
1. A namespace for device drivers: each (Android) container needs a
private view of all devices. This includes logical drivers, like binder

(in

Post by Oren Laadan
Android) but also loop device; and physical devices, like the framebuffer
and the touch-screen.
In other words, device namespaces virtualize the _major/minor_ and the
_state_ of device drivers. With the exception of VFS, network, and PTY
(note: all three offer/are virtual devices), device drivers are otherwise
not isolated between containers.
2. A namespace for interactive scenarios: a namespace can be "active" -

Post by Oren Laadan
has access to the hardware, e.g. display and touch-screen. This will be

the

Post by Oren Laadan
container with which the user is interacting right now. Otherwise a
namespace is "non-active" - it still runs in the background, but can
neither alter the display nor receive input from the touch-screen.
Switching to another container means a context switch in the relevant
drivers, so that they restore the state and now "obey" the other

namespace.

Post by Oren Laadan
You can also think about the "active" namespace as foreground, and the
"non-active" as background, akin to foreground/background processes in a
terminal with job-control. Similar to how a terminal delivers input to

the

Post by Oren Laadan
foreground task only but not to the background tasks - this is enforced

Post by Oren Laadan
the new device namespace.
https://github.com/Cellrox/devns-patches/wiki/Thinvisor).

I think this is going to take some talking, and looking at code.

Hi Eric,

If we can get people to take a quick look at the code before LPC
that could make the LPC discussions more effective.
Even looking at one of the subsystem patches can give a basic
idea of the work we have done:
https://github.com/Cellrox/linux/commits/devns-goldfish-3.4

I think you are talking about having wrappers around your devices so you

can share. Which is not the quite same problem the rest of us have been
thinking of when talking about a device namespace.

We are interested in all problems related to virtualizated view of devices
inside a container, so let our work so far be a starting point to discuss
all of them.

My first impression is that this is better solved with more appropriate
abstractions in userspace or in the kernel.
But we can talk at LPC and see what we can hash out.

Looking forward to that :-)

Amir.

Eric
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20130908/0e81be5b/attachment.html>

Eric W. Biederman

2013-09-09 00:51:55 UTC

On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman
Hi Eric,
If we can get people to take a quick look at the code before LPC
that could make the LPC discussions more effective.
Even looking at one of the subsystem patches can give a basic
https://github.com/Cellrox/linux/commits/devns-goldfish-3.4
I think you are talking about having wrappers around your devices so you
can share. ?Which is not the quite same problem the rest of us
have been
thinking of when talking about a device namespace.
We are interested in all problems related to virtualizated view of devices
inside a container, so let our work so far be a starting point to
discuss all of them.
My first impression is that this is better solved with more appropriate
abstractions in userspace or in the kernel.

As I read your code, you are solving the problem of one opener of a
device among a group of openers being able to access a device at a time.
Which leads to the question why can't the multiplexing happen in
userspace?

I think with your design it would not be possible to play a song in one
device namespace while doing work in the other. As a security model
that isn't wrong but as someone trying to get work done that could be a
real pain.

The more common concern is to have devices we can use all of the time.

There may be a need for a device namespace and multiplexing access to
hardware devices makes that clearer. So far nothing has risen to the
level of we actually need a device namespace to do X. Especially in an
erra of hotplug and dynamic device numbers.

It is arguable that you could do your kind of device multiplexing with a
fuse device in userspace that implements your desired policy.

And policy is where cell situtation seems to fall down because it hard
codes one specific policy into the kernel, and a policy most situations
don't find useful.

Eric

Amir Goldstein

2013-09-10 07:09:31 UTC

On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman
Hi Eric,
If we can get people to take a quick look at the code before LPC
that could make the LPC discussions more effective.
Even looking at one of the subsystem patches can give a basic
https://github.com/Cellrox/linux/commits/devns-goldfish-3.4
I think you are talking about having wrappers around your devices so you
can share. Which is not the quite same problem the rest of us have been
thinking of when talking about a device namespace.
We are interested in all problems related to virtualizated view of devices
inside a container, so let our work so far be a starting point to
discuss all of them.
My first impression is that this is better solved with more appropriate
abstractions in userspace or in the kernel.

As a matter of fact, in our multi persona phone, you *can* hear music played
from background persona, but you *cannot* see images drawn from background
persona.

Post by Eric W. Biederman
The more common concern is to have devices we can use all of the time.
There may be a need for a device namespace and multiplexing access to
hardware devices makes that clearer. So far nothing has risen to the
level of we actually need a device namespace to do X. Especially in an
erra of hotplug and dynamic device numbers.
It is arguable that you could do your kind of device multiplexing with a
fuse device in userspace that implements your desired policy.

I agree about it being arguable :-)
We shall present our arguments on LPC.

Post by Eric W. Biederman
And policy is where cell situtation seems to fall down because it hard
codes one specific policy into the kernel, and a policy most situations
don't find useful.

It's true that for our product, we have made hardcoded policy decisions in
our kernel
patches, but that was just as a proof of concept for the technique.

We do envision being able to dynamically assign a device to a specific devns
(e.g. block,loop) keep a device shared between multi devns (e.g. audio)
and in addition to that, being able to multiplex a device between multi
devns (e.g. framebuffer)

Post by Eric W. Biederman
Eric

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20130910/c70a7c6a/attachment.html>

Janne Karhunen

2013-09-25 11:05:32 UTC

On Fri, Sep 6, 2013 at 7:50 PM, Eric W. Biederman
Hi Eric,
If we can get people to take a quick look at the code before LPC
that could make the LPC discussions more effective.

Hi,

I think we are curious enough to experiment with Erics idea of
implementing basic 'device namespace' in userspace (never miss an
opportunity to throw away kernel code). Can anyone point out any
obvious reason why this would not work if we consider bulk of the work
being plain access filtering?

That being said, is there a valid reason why binder is part of device
namespace here instead of IPC?

--
Janne

Eric W. Biederman

2013-09-25 20:23:46 UTC

Post by Janne Karhunen
That being said, is there a valid reason why binder is part of device
namespace here instead of IPC?

I think the practical issue with binder was simply that binder only
allows for a single instance and thus is does not play nicely with
containers.

Eric

Eric W. Biederman

2013-09-25 21:47:05 UTC

Post by Janne Karhunen
That being said, is there a valid reason why binder is part of device
namespace here instead of IPC?

I think the practical issue with binder was simply that binder only
allows for a single instance and thus is does not play nicely with
containers.

It's true that there was a singleton paradigm in binder that had to be
overcome, but I agree with Janne. It really belongs in the IPC namespace,
and I don't see any technical reason not to move it there.

*Blink* I missed the IPC namespace suggestion.

The IPC namespace sounds reasonable.

Of course binder is still in staging because it has implementation and
ABI problems. Little things like a 64bit kernel and a 32bit userspace
don't work particularly well. So while fixing those problems it might
be possible to fix the single container problem as well. It would be a
weird direction for cleanup of binder to come from but I don't see why
that wouldn't work.

Personally until binder is out of staging it seems reasonable to push
for an API that sucks less, or for a more general solution that Androdid
could use instead of binder.

One of the uses of namespaces is to clean up after problematic kernel
design decisions. If we still have the option I would rather fix the
problems than clean up after them.

Eric

Amir Goldstein

2013-09-29 17:56:58 UTC

On Thu, Sep 26, 2013 at 12:47 AM, Eric W. Biederman

On Sep 25, 2013, at 4:23 PM, Eric W. Biederman <ebiederm at xmission.com>

Post by Janne Karhunen
That being said, is there a valid reason why binder is part of device
namespace here instead of IPC?

I think the practical issue with binder was simply that binder only
allows for a single instance and thus is does not play nicely with
containers.

*Blink* I missed the IPC namespace suggestion.
The IPC namespace sounds reasonable.

Binder rewrite for IPC namespace is in the works (by Oren)
We discussed this with Greg and adding namespace support to binder (in
staging) seemed reasonable to him as well.

Post by Eric W. Biederman
Of course binder is still in staging because it has implementation and
ABI problems. Little things like a 64bit kernel and a 32bit userspace
don't work particularly well. So while fixing those problems it might
be possible to fix the single container problem as well. It would be a
weird direction for cleanup of binder to come from but I don't see why
that wouldn't work.
Personally until binder is out of staging it seems reasonable to push
for an API that sucks less, or for a more general solution that Androdid
could use instead of binder.
One of the uses of namespaces is to clean up after problematic kernel
design decisions. If we still have the option I would rather fix the
problems than clean up after them.
Eric
_______________________________________________
Containers mailing list
Containers at lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20130929/e84399eb/attachment.html>

Jeremy Andrus

2013-09-25 21:17:08 UTC

Post by Janne Karhunen
That being said, is there a valid reason why binder is part of device
namespace here instead of IPC?

I think the practical issue with binder was simply that binder only
allows for a single instance and thus is does not play nicely with
containers.

Eric W. Biederman

2013-09-25 21:34:54 UTC

From conversations at Linux Plumbers Converence it became fairly clear

that one if not the roughest edge on containers today is dealing with
devices.

- Hotplug does not work.
- There seems to be no implementation that does a much beyond creating
setting up a static set of /dev entries today.
- Containers do not see the appropriate uevents for their container.

One of the more compelling cases I heard was of someone who was running
the a Linux Desktop in container and wanted to just let that container
see the devices needed for his desktop, and not everything else.

Talking with the OpenVZ folks it appears that preserving device numbers
across checkpoint/restart is not currently an issue. However they reuse
the same loopback minor number when they can which would hide this
issue. So while it is clear we don't need to worry about migrating
an application that cares about major/minor numbers of filesystems right
now as the set of application that are migrated increases that situation
may change. As the case with the network device ifindex has shown it is
possible to implement filtering now and later when there is a usecase it
is possible to expand filtering to actual namespace local identifiers.

Thinking about it for the case of container migration the simplest
solution for the rare application that needs something more may be to
figure out how to send a kernel hotplug event. Something to think about
when we encounter them.

So the big issues for a device namespace to solve are filtering which
devices a container has access to and being able to dynamically change
which devices those are at run time (aka hotplug).

After having thought about this for a bit I don't know if a pure
userspace solution is sufficient or actually a good idea.

- We can manually manage a tmpfs with device nodes in userspace.
(But that is deprecated functionality in the mainstream kernel).
- We can manually export a subset of sysfs with bind mounts.
(But that feels hacky, and is essentially incompatible with hotplug).
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).
- There is no way to fake netlink uevents for a container to see them.
(The best we could do is replace udev everywhere with something that
listens on a unix domain socket).
- It would be nice to replace the device cgroup with a comprehensive
solution that really works. (Among other things the device cgroup
does not work in terms of struct device the underlying kernel
abstraction for devices).

We must manage sysfs entries as well device nodes because:
- Seeing more than we should has the real potential to confuse
userspace, especially a userspace that replays uevents.
- Some device control must happens through writing to sysfs files and
if we don't remove all root privileges from a container only by
exporting a subset of sysfs to that container can we limit which
sysfs nodes can be written to.

The current kernel tagged sysfs entry support does not look like a good
match for the impelementing device filtering. The common case will
be allowing devices like /dev/zero, and /dev/null that live in
/sys/devices/virtual and are the devices we are most likely to care
about. Those devices need to live in multiple device namespaces so
everyone can use them. Perhaps exclusive assignment will be the more
common paradigm for device namespaces like it is for network devices in
the network namespace but from what little I can of this problem right now I
don't think so.

I definitely think we should hold off on a kernel level implementation
until we really understand the issues and are ready to implement device
namespaces correctly.

A userspace implementation looks like it can only do about 95% of what
is really needed, but at the same time looks like an easy way to
experiment until the problem is sufficiently well understood.

At the end of the day we need to filter the devices a set of userspace
processes can use and be able to change that set of devices dynamically.
All of the rest of the infrastructure for that lives in the kernel, and
keeping all of the infrastructure in one place where it can be
maintained together is likely to be most maintainable. It looks like
the code is just complicated enough and the use cases just boring enough
that spreading the code to perform container device hotplug and
container device filtering between a dozen userspace tools, and a hadful
of userspace device managers will not be particularly managable at the
end of the day.

In summary the situation with device hoptlug and containers sucks today,
and we need to do something. Running a linux desktop in a container is
a reasonably good example use case. Having one standard common
maintainable implementation would be very useful and the most logical
place for that would be in the kernel. For now we should focus on
simple device filtering and hotplug.

Eric

Greg Kroah-Hartman

2013-09-26 05:33:20 UTC

Post by Eric W. Biederman
So the big issues for a device namespace to solve are filtering which
devices a container has access to and being able to dynamically change
which devices those are at run time (aka hotplug).

As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG
anymore, because it was redundant), I think you need to really think
this through better (pci, memory, cpus, etc.) before you do anything in
the kernel.

Post by Eric W. Biederman
After having thought about this for a bit I don't know if a pure
userspace solution is sufficient or actually a good idea.
- We can manually manage a tmpfs with device nodes in userspace.
(But that is deprecated functionality in the mainstream kernel).

Yes, but I'm not going to namespace devtmpfs, as that is going to be an
impossible task, right?

And remember, udev doesn't create device nodes anymore...

Post by Eric W. Biederman
- We can manually export a subset of sysfs with bind mounts.
(But that feels hacky, and is essentially incompatible with hotplug).

True.

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Post by Eric W. Biederman
- There is no way to fake netlink uevents for a container to see them.
(The best we could do is replace udev everywhere with something that
listens on a unix domain socket).

You shouldn't need to do this.

Post by Eric W. Biederman
- It would be nice to replace the device cgroup with a comprehensive
solution that really works. (Among other things the device cgroup
does not work in terms of struct device the underlying kernel
abstraction for devices).

I didn't even know there was a device cgroup.

Which means that if there is one, odds are it's useless.

Post by Eric W. Biederman
- Seeing more than we should has the real potential to confuse
userspace, especially a userspace that replays uevents.

You should never replay uevents. If you don't do that, why can't you
see all of sysfs?

Post by Eric W. Biederman
- Some device control must happens through writing to sysfs files and
if we don't remove all root privileges from a container only by
exporting a subset of sysfs to that container can we limit which
sysfs nodes can be written to.

But you have the issue of controlling devices in a "shared" way, which
isn't going to be usable for almost all devices.

Post by Eric W. Biederman
The current kernel tagged sysfs entry support does not look like a good
match for the impelementing device filtering. The common case will
be allowing devices like /dev/zero, and /dev/null that live in
/sys/devices/virtual and are the devices we are most likely to care
about. Those devices need to live in multiple device namespaces so
everyone can use them. Perhaps exclusive assignment will be the more
common paradigm for device namespaces like it is for network devices in
the network namespace but from what little I can of this problem right now I
don't think so.
I definitely think we should hold off on a kernel level implementation
until we really understand the issues and are ready to implement device
namespaces correctly.

I agree, especially as I don't think this will ever work.

Post by Eric W. Biederman
A userspace implementation looks like it can only do about 95% of what
is really needed, but at the same time looks like an easy way to
experiment until the problem is sufficiently well understood.

95% is probably way better than what you have today, and will fit the
needs of almost everyone today, so why not do it?

I'd argue that those last 5% either are custom solutions that never get
merged, or candidates for true virtulization.

Post by Eric W. Biederman
In summary the situation with device hoptlug and containers sucks today,
and we need to do something. Running a linux desktop in a container is
a reasonably good example use case.

No it isn't. I'd argue that this is a horrible use case, one that you
shouldn't do. Why not just use multi-head machines like people do who
really want to do this, relying on user separation? That's a workable
solution that is quite common and works very well today.

Post by Eric W. Biederman
Having one standard common maintainable implementation would be very
useful and the most logical place for that would be in the kernel.
For now we should focus on simple device filtering and hotplug.

Just listen for libudev stuff, don't try to filter them, or ever
"replay" them, that way lies madness, and lots of nasty race conditions
that is guaranteed to break things.

good luck,

greg k-h

Janne Karhunen

2013-09-26 08:25:56 UTC

On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

I suppose so, but now you take the assumption that there is no
need for running multiple Linux variants on the same host (say
Ubuntu and Android side by side). Is this something you would
not like to see done?

--
Janne

Greg Kroah-Hartman

2013-09-26 13:56:04 UTC

Post by Janne Karhunen
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

You can do that today without any need for device namespaces, so why is
this an issue here?

greg k-h

Janne Karhunen

2013-09-26 17:01:31 UTC

On Thu, Sep 26, 2013 at 4:56 PM, Greg Kroah-Hartman

Post by Janne Karhunen
I suppose so, but now you take the assumption that there is no
need for running multiple Linux variants on the same host (say
Ubuntu and Android side by side). Is this something you would
not like to see done?

You can do that today without any need for device namespaces, so why is
this an issue here?

I think you misunderstood me. I wasn't so much advocating on the
device namespace part, just the issue at hand (device access
filtering based on which ns happens to be 'active'). We are already
trying to do this in userspace, let's see how that goes.

That being said, our wish would be to support any combination of
OS's and frankly, I'd be slightly annoyed to tell the customer that
they can't do two Androids or we magically run out of bits.

--
Janne

Greg Kroah-Hartman

2013-09-26 17:07:57 UTC

Post by Janne Karhunen
That being said, our wish would be to support any combination of
OS's and frankly, I'd be slightly annoyed to tell the customer that
they can't do two Androids or we magically run out of bits.

If you want to support "any" combination of operating systems, then use
a hypervisor, that's what they are there for :)

Janne Karhunen

2013-09-26 17:56:53 UTC

On Thu, Sep 26, 2013 at 8:07 PM, Greg Kroah-Hartman

If you want to support "any" combination of operating systems, then use
a hypervisor, that's what they are there for :)

Only relevant mobile OS's are of interest ;)

--
Janne

Greg Kroah-Hartman

2013-09-30 16:11:17 UTC

If you want to support "any" combination of operating systems, then use
a hypervisor, that's what they are there for :)

No that's not quite the right way to think about it: The correct
statement is only use a hypervisor if you need different kernels. With
Windows, it happens to be true that you need a different kernel for each
different OS version. However; with Linux, thanks to strong ABI
backwards compatibility, you mostly don't. The way OpenVZ works today
is that it installs a modified kernel which can then bring up every
Linux OS in a separate container. Our use case is the hosters that give
you root login to a virtual private server and allow you to upgrade it
on your own. The reason for using a container rather than a hypervisor
is the old density and elasticity one: 3x the density (i.e. 1/3 the
overhead cost to the hoster) and the boot only needs to start at init,
not bring up of virtual hardware and booting a second kernel.

I understand that some people really like the idea of using OpenVZ for
various things like this, but to claim that because of it we need to
hack up the driver core in the kernel into unimaginable pieces is not
necessarily something that I'll agree with.

But all of this is just words, I have yet to see any patches for any of
this, so I'll just wait until that happens before worrying about it...

thanks,

greg k-h

James Bottomley

2013-09-30 16:33:26 UTC

If you want to support "any" combination of operating systems, then use
a hypervisor, that's what they are there for :)

I don't believe I claimed that. In fact, from 3.9 we can already bring
up an OpenVZ containerised system running different versions of Linux
that you can give someone root access to with no kernel modifications
whatsoever. The user space solution currently works for us because
we're handing out server VPSs, so the device configuration is fixed as
we init the container. However, we do have use cases for dynamic
instead of static device configurations, which is why we're
participating in the debate.

Post by Greg Kroah-Hartman
But all of this is just words, I have yet to see any patches for any of
this, so I'll just wait until that happens before worrying about it...

Well, that's because we're still debating what the best approach is. If
you want a historical parallel: the comments you make above (hack up
the ... kernel into unimaginable pieces) is an almost exact mirror of
the comments that were made rejecting the in-kernel Checkpoint/Restore
patches at the 2010 Kernel Summit ... yet we have it fully functional
today in a form that proved acceptable.

James

James Bottomley

2013-09-30 15:37:19 UTC

If you want to support "any" combination of operating systems, then use
a hypervisor, that's what they are there for :)

Michael J Coss

2013-09-26 17:17:54 UTC

As I mentioned in another part of this thread, my use case is deploying
"linux desktops" to users as containers. The goal is to have the
container run unmodified distros, and to be able to run arbitrary code.
A tall order to be sure, and maybe not realistic, but I'm in research so
its good to think big.

To that end, we would like the container to be manage as if it were a
"real" system. This includes udev. I realize that udev no longer
creates devices but uses devtmpfs, but the event notification needs to
be seen for other parts to the system, and for the rules that udev
actually does. In particular, X uses uevents to detect keyboard, mice
and display connections.

But when a new device is added, we need that information to go to only
the appropriate container. Currently, uevents are broadcasted to all
listeners in all network namespaces. I have a set of patches that
restrict the initial broadcast to only the host namespace. The second
part is a user space deamon that applies policy and forwards the message
to the container's udev. But rather than have to run a modified udev,
by allowing an interface for the host to inject a replay of the original
message to the container's udev, we achieve at least part of our goal.

This still leave devtmpfs, and while I do believe that there are user
space solutions, I think a virtualization of that is a better approach.
The policy needs to be driven by the host, but the view of the synthetic
filesystem should be managed by the kernel.

There are a number of other kernel filesystems that are equally
problematic, sysfs, proc, debugfs, etc. Is it really proposed that all
of these be handled in userspace?. We can get some safety by disallows
some mounts, and using readonly, but a unified policy would be nice.

My kernel patch is just to facility the communication to the container
of the appropriate uevents, and the daemon uses libudev to collect,
apply policy, and forward the appropriate events. And I'm working on a
solution for devtmpfs

---Michael J Coss

Janne Karhunen

2013-10-01 06:19:58 UTC

On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

--
Janne

Greg Kroah-Hartman

2013-10-01 17:33:42 UTC

Post by Janne Karhunen
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

I really wish I had never seen that patch, and I am glad it was
rejected.

Janne Karhunen

2013-10-01 18:23:51 UTC

On Tue, Oct 1, 2013 at 8:33 PM, Greg Kroah-Hartman

Post by Janne Karhunen

Post by Greg Kroah-Hartman
That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

I really wish I had never seen that patch, and I am glad it was
rejected.

Thanks, I agree. Just wanted to point out the reason and
bring up the discussion.

--
Janne

Andy Lutomirski

2013-10-01 17:27:45 UTC

Post by Janne Karhunen
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

Can't the daemon live outside the container and shuffle stuff in?
IOW, there seems to be little point in containerizing things if you're
just going to punch a privilege hole in the namespace.

FWIW, I think that the capability evolution rules are crap, but
changing them is a can of worms, and enough people seem to thing the
status quo is acceptable that this is unlikely to ever get fixed.

--Andy

Serge E. Hallyn

2013-10-01 17:53:45 UTC

Post by Janne Karhunen
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

Can't the daemon live outside the container and shuffle stuff in?

That's exactly what Michael Warfield is suggesting, fwiw.

-serge

Eric W. Biederman

2013-10-01 19:51:36 UTC

Post by Janne Karhunen
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

Can't the daemon live outside the container and shuffle stuff in?

That's exactly what Michael Warfield is suggesting, fwiw.

Michael Warfields example of dynamically assigning serial ports to
containers is a pretty good test case. Serial ports are extremely well
known kernel objects who evolution effectively stopped long ago. When
we need it we have ptys to virtual serial ports when we need it, but in
general unprivileged users are safe to directly use a serial port
device.

Glossing over the details. The general problem is some policy exists
outside of the container that deciedes if an when a container gets a
serial port and stuffs it in.

The expectation is that system containers will then run the udev
rules and send the libuevent event.

To make that all work without kernel modifications requires placing
a faux-udev in the container, that listens for a device assignment from
outside the container and then does exactly what udev would have done.

The problems with this that I see are:

- udev is a moving target making it hard to build a faux-udev that will
work everywhere.

- On distro's running systemd and udev integration is sufficiently tight
that I am not certain a faux-udev is possible or will continue to be
possible.

- There are two other widely deployed solutions for managing hotplug
devices besides udev.

So given these difficulties I do not believe that the evolution of linux
device management is done, and that patches to udev, the kernel or both
will be needed. While it would be good for testing and understanding
the problem I don't think a faux-udev will be a long term maintainable
solution.

I also understand the point that we aren't talking patches yet and just
discussing ideas. Right now it is my hope that if we talk this out we
can figure out a general direction that has a hope of working.

Post by Serge E. Hallyn
From where I am standing faking uevents instead of replacing

udev/mdev/whatever looks simpler and more maintainable.

Eric

Serge Hallyn

2013-10-01 20:46:05 UTC

Post by Janne Karhunen
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

Can't the daemon live outside the container and shuffle stuff in?

That's exactly what Michael Warfield is suggesting, fwiw.

I thought the suggestion was that udev on the host would be given
container-specific rules, saying "plop this device into /dev/container1/"
(with /dev/container1 being bind-mounted to $container1_rootfs/dev).

-serge

Michael J Coss

2013-10-02 00:20:32 UTC

Post by Serge Hallyn
I thought the suggestion was that udev on the host would be given
container-specific rules, saying "plop this device into /dev/container1/"
(with /dev/container1 being bind-mounted to $container1_rootfs/dev).
-serge

At least for my use case this isn't sufficient. I need to have the
uevents actually propagated to the container. I'm running an Xserver in
the container, and I need the keyboard/mouse/display add/remove to show
up as udev events so X will use the appropriate devices. Further, I
can't have *all* uevents propagated to *all* containers, because X will
want to use all the devices.

Kernel changes are required to stop the broadcast of uevents to all
kernel socket listeners in all namespaces. And a minor addition is
needed to be able to forward a given event to any listeners within a
given namespace. A user space daemon can filter events and forward them
to the appropriate containers.

You still have fix the /dev in the container, and I put a local dev
directory in /etc/lxc/<container> and bind mount to allow my systemd
container to actually run udev, and have a custom /dev directory.

--
---Michael J Coss

Michael H. Warfield

2013-10-01 22:59:48 UTC

Post by Janne Karhunen
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

Can't the daemon live outside the container and shuffle stuff in?

That's exactly what Michael Warfield is suggesting, fwiw.

I think that the "given container-specific rules, saying..." thing was
on my chart of options as the one with the big cloudy shaped object in
the lower right corner labeled "and then a miracle occurs".

The basic part is the mapping from /dev into /dev/lxc/container. That
should be doable based on the rules in the host and a basic udev trigger
along with a simple mapping configuration. The "given
container-specific" part becomes a morass if it gets complicated enough.

What I was envisioning was a very simple system of container specific
{match} and {map} objects. If a name or symlink passed to the daemon
from a udev trigger matched a match, then the name and symlinks and
additional maps would be mapped into the appropriate container
subdirectory. That works real well if the container and host udev rules
are congruent.

The tough part is the "container-specific" rules which was the part I
specifically mentioned that I had no clue how to make happen. That's a
non-trivial task if the container is allowed to make arbitrary udev rule
changes based on what they are allowed to receive from the host (and how
do we trigger the changes in the host when a change is made in the
container).

It's easily doable where the container rules are congruent with the host
rules. Where they are orthogonal gets much more complicated. But...
All that being said, I will take the congruent solution as a starting
point (and that will not be an 80% solution - it will be more like a 95%
solution) and we can argue about the corner cases and deltas after that.
Doable, yes, for some value of doable.

I like what Greg was saying about using libudev but I'm totally in the
dark as to how to effectively hook that or if it would even work in the
container. That one is not in my realm.

Post by Serge Hallyn
-serge

Regards,
Mike
--
Michael H. Warfield (AI4NB) | Desk: (404) 236-2807
Senior Researcher - X-Force | Cell: (678) 463-0932
IBM Security Services | mhw at linux.vnet.ibm.com mhw at wittsend.com
6303 Barfield Road | http://www.iss.net/
Atlanta, Georgia 30328 | http://www.wittsend.com/mhw/
| PGP Key: 0x674627FF
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20131001/e1eda7c7/attachment.pgp>

Eric W. Biederman

2013-10-02 22:55:51 UTC

Post by Eric W. Biederman
Glossing over the details. The general problem is some policy exists
outside of the container that deciedes if an when a container gets a
serial port and stuffs it in.
The expectation is that system containers will then run the udev
rules and send the libuevent event.

That is what I was trying to describe. We still need something that
lets the software in the container know it needs to do something.

I may be blind but right now short of replacing the internal udev, or
modifying the kernel I don't see a solution for letting software in a
container know there is a new device it can use.

Once we get the notification issue sorted out I think we have enough to
bring up a full desktop environment in a container and be able to say we
don't need anything else from devices unless someone discovers that
checkpoint/restart actually needs minor numbers to be preserved.

Eric

Greg Kroah-Hartman

2013-10-01 20:57:18 UTC

Post by Janne Karhunen
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

Can't the daemon live outside the container and shuffle stuff in?

That's exactly what Michael Warfield is suggesting, fwiw.

How is udev a moving target? Use libudev and all should be fine, that's
an ABI you can rely on, right?

Or, if you don't like/want udev, use mdev in your container. Or
something else, what does this have to do with the kernel?

Post by Eric W. Biederman
- On distro's running systemd and udev integration is sufficiently tight
that I am not certain a faux-udev is possible or will continue to be
possible.

That's not a kernel issue, that's a "ouch, this is hard, let's give up"
issue.

Or perhaps it is a "maybe I shouldn't even be trying to do this" type
issue... :)

Post by Eric W. Biederman
- There are two other widely deployed solutions for managing hotplug
devices besides udev.

I know of mdev, what's the other one? The hacked-up shell script that
Android uses? Or something else?

Post by Eric W. Biederman
So given these difficulties I do not believe that the evolution of linux
device management is done, and that patches to udev, the kernel or both
will be needed. While it would be good for testing and understanding
the problem I don't think a faux-udev will be a long term maintainable
solution.

You are saying that for some reason you feel helpless with the way
userspace is going, so we have to change the kernel. That's horrible,
and is not going to be a reason I accept to change the kernel, sorry.

Post by Eric W. Biederman
I also understand the point that we aren't talking patches yet and just
discussing ideas. Right now it is my hope that if we talk this out we
can figure out a general direction that has a hope of working.
From where I am standing faking uevents instead of replacing
udev/mdev/whatever looks simpler and more maintainable.

Have you really looked into this? Numerous people, who understand this
code path and userspace issues, have said it is not a good idea at all.

But hey, what do I know...

I still have yet to see a reason why you can't use libudev today for
something like this.

Anyway, I'm done discussing this as it's pointless this early, I'm going
to refrain for any more pithy comments until someone posts some code, as
this is just wasting people's time at the moment.

greg k-h

Eric W. Biederman

2013-10-02 22:45:46 UTC

I think libudev is a solution to a completely different problem. It is
possible I am blind but I just don't see how libudev even attempts to
solve the problem.

The desire is to plop a distro install into a subdirectory. Fire up a
container around it, and let the distro's userspace do it's thing to
manage hotplug events.

devtmpfs can be faked fairly easily.
I don't know about sysfs.

Sending events that say you have hotplugged is the largest practical
problem.

On the minimal side I think the patch below is enough to let us fake up
uevents for the container and make things work. I have heard the words
faking uevents and is a bad thing. But I have not heard a reason or seen
any attempt at explanation. My guess is that we are simply talking
about different problems.

I would like to see someone wire up all of the userspace bits and see
how well hotplug can be made to work before I walk down the path
represented by this patch but it seems reasonable. But I do have
anecdotal reports from someone who walked a similar path that this is
enough to bring up a full desktop system in a container.

Eric

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 7a6c396a263b..46d05783da82 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -38,6 +38,7 @@ extern void netlink_table_ungrab(void);

#define NL_CFG_F_NONROOT_RECV (1 << 0)
#define NL_CFG_F_NONROOT_SEND (1 << 1)
+#define NL_CFG_F_IMPERSONATE_KERN (1 << 2)

/* optional Netlink kernel configuration parameters */
struct netlink_kernel_cfg {
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 52e5abbc41db..f75e34397df8 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -375,9 +375,12 @@ static int uevent_net_init(struct net *net)
struct uevent_sock *ue_sk;
struct netlink_kernel_cfg cfg = {
.groups = 1,
- .flags = NL_CFG_F_NONROOT_RECV,
+ .flags = NL_CFG_F_NONROOT_RECV | NL_CFG_F_IMPERSONATE_KERN,
};

+ if (net->user_ns != &init_user_ns)
+ return 0;
+
ue_sk = kzalloc(sizeof(*ue_sk), GFP_KERNEL);
if (!ue_sk)
return -ENOMEM;
@@ -399,6 +402,9 @@ static void uevent_net_exit(struct net *net)
{
struct uevent_sock *ue_sk;

+ if (net->user_ns != &init_user_ns)
+ return;
+
mutex_lock(&uevent_sock_mutex);
list_for_each_entry(ue_sk, &uevent_sock_list, list) {
if (sock_net(ue_sk->sk) == net)
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 0c61b59175dc..71863cc465eb 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1252,7 +1252,7 @@ static int netlink_release(struct socket *sock)

skb_queue_purge(&sk->sk_write_queue);

- if (nlk->portid) {
+ if (sk_hashed(sk)) {
struct netlink_notify n = {
.net = sock_net(sk),
.protocol = sk->sk_protocol,
@@ -1409,11 +1409,21 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
return err;
}

- if (nlk->portid) {
+ if (sk_hashed(sk)) {
if (nladdr->nl_pid != nlk->portid)
return -EINVAL;
} else {
- err = nladdr->nl_pid ?
+ bool autobind = nladdr->nl_pid == 0;
+ if (nladdr->nl_pid == 0 && (nladdr->nl_pad == 0xffff)) {
+ if (!(nl_table[sk->sk_protocol].flags & NL_CFG_F_IMPERSONATE_KERN))
+ return -EPERM;
+ if (net->user_ns == &init_user_ns)
+ return -EPERM;
+ if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+ return -EPERM;
+ autobind = false;
+ }
+ err = !autobind ?
netlink_insert(sk, net, nladdr->nl_pid) :
netlink_autobind(sock);
if (err)
@@ -1467,7 +1477,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
if (nladdr->nl_groups && !netlink_capable(sock, NL_CFG_F_NONROOT_SEND))
return -EPERM;

- if (!nlk->portid)
+ if (!sk_hashed(sk))
err = netlink_autobind(sock);

if (err == 0) {
@@ -2228,7 +2238,7 @@ static int netlink_sendmsg(struct kiocb *kiocb, struct socket *sock,
dst_group = nlk->dst_group;
}

- if (!nlk->portid) {
+ if (!sk_hashed(sk)) {
err = netlink_autobind(sock);
if (err)
goto out;

Michael H. Warfield

2013-10-01 22:19:23 UTC

Post by Janne Karhunen
On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

Which reminds me, one potential reason being..
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

Can't the daemon live outside the container and shuffle stuff in?

That's exactly what Michael Warfield is suggesting, fwiw.

Actually, I don't necessarily see that as a problem as much as a
necessity. If a container can decide when it gets a serial port or
other device, I would think that would constitute a security issue and
container isolation violation. Restricting what container can have
access to what has to be determined in the host and, once you've drunk
that koolaid, you might as well stuff it in somewhere. Policy has to be
in the host or you will never get the security corner cases right.

Ultimately, it is the host which is in charge of the hardware and is
managing the containers (it can start them up, shut them down, or manage
them) so, at its base level, is is the responsibility of the host to
manage those devices between the containers.

That being said, there is the additional issue of, what does the
container do when we hand it a device and how do we let it know. That's
now classically the issue of udev and formerly hotplug and their
predecessors...

Post by Eric W. Biederman
The expectation is that system containers will then run the udev
rules and send the libuevent event.

Which makes sense. Something along the line of a socket into the
container to send selected events from the user space daemon in the host
would make some sense there.

Post by Eric W. Biederman
To make that all work without kernel modifications requires placing
a faux-udev in the container, that listens for a device assignment from
outside the container and then does exactly what udev would have done.
- udev is a moving target making it hard to build a faux-udev that will
work everywhere.

Well, it is an it isn't. Yeah the rules have been changing (I'm getting
tired of the "deprecated" rule warnings) but I've seen worse, much
worse.

Post by Eric W. Biederman
- On distro's running systemd and udev integration is sufficiently tight
that I am not certain a faux-udev is possible or will continue to be
possible.

Actually, I think that's a non-issue. IIRC, systemd (now) discontinues
its udev operation when it detects it's in a container. That was at the
heart of the entire Fedora 15/16 in a container meltdown with the broken
versions of systemd trying to run udev in the container. What do we do
in place of it? I don't know.

Post by Eric W. Biederman
- There are two other widely deployed solutions for managing hotplug
devices besides udev.
So given these difficulties I do not believe that the evolution of linux
device management is done, and that patches to udev, the kernel or both
will be needed. While it would be good for testing and understanding
the problem I don't think a faux-udev will be a long term maintainable
solution.
I also understand the point that we aren't talking patches yet and just
discussing ideas. Right now it is my hope that if we talk this out we
can figure out a general direction that has a hope of working.
From where I am standing faking uevents instead of replacing
udev/mdev/whatever looks simpler and more maintainable.
Eric

Mike
--
Michael H. Warfield (AI4NB) | (770) 985-6132 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20131001/e48369f5/attachment.pgp>

Janne Karhunen

2013-10-01 18:36:52 UTC

Post by Janne Karhunen
http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html

Can't the daemon live outside the container and shuffle stuff in?
IOW, there seems to be little point in containerizing things if you're
just going to punch a privilege hole in the namespace.

Yeah. I will try to experiment just how much can be 'stuffed
in' without effective caps. It certainly would be better this way.

Post by Andy Lutomirski
FWIW, I think that the capability evolution rules are crap, but
changing them is a can of worms, and enough people seem to thing the
status quo is acceptable that this is unlikely to ever get fixed.

I have noted (Casey almost tried to strangle me during the
last security summit for even daring to talk about it).

--
Janne

Andrey Wagin

2013-10-28 23:31:17 UTC

2013/9/26 Eric W. Biederman <ebiederm at xmission.com>

From conversations at Linux Plumbers Converence it became fairly clear
that one if not the roughest edge on containers today is dealing with
devices.
- Hotplug does not work.
- There seems to be no implementation that does a much beyond creating
setting up a static set of /dev entries today.
- Containers do not see the appropriate uevents for their container.
One of the more compelling cases I heard was of someone who was running
the a Linux Desktop in container and wanted to just let that container
see the devices needed for his desktop, and not everything else.

I had experience of implementing this functionality in OpenVZ kernel.
I had requirements to not modify user-space tools, so that
implementations looks as dirty hack, but even hotplug of devices are
workin there.

....

So the big issues for a device namespace to solve are filtering which
devices a container has access to and being able to dynamically change
which devices those are at run time (aka hotplug).
After having thought about this for a bit I don't know if a pure
userspace solution is sufficient or actually a good idea.

I would prefer to think a bit more about userspace solution. We can
try to expand udev functionality.

- We can manually manage a tmpfs with device nodes in userspace.
(But that is deprecated functionality in the mainstream kernel).
- We can manually export a subset of sysfs with bind mounts.
(But that feels hacky, and is essentially incompatible with hotplug).
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).
- There is no way to fake netlink uevents for a container to see them.
(The best we could do is replace udev everywhere with something that
listens on a unix domain socket).

or we can teach udev to listens on a unix domain socket.

The host udev listens netlink. When it gets an event about a new
device, it decides for which containers it must be avaliable, does all
required actions and sends events in containers. Probably the protocol
of notifications must be unified for all udev-like services.

- It would be nice to replace the device cgroup with a comprehensive
solution that really works. (Among other things the device cgroup
does not work in terms of struct device the underlying kernel
abstraction for devices).
- Seeing more than we should has the real potential to confuse
userspace, especially a userspace that replays uevents.
- Some device control must happens through writing to sysfs files and
if we don't remove all root privileges from a container only by
exporting a subset of sysfs to that container can we limit which
sysfs nodes can be written to.

Sorry if a following idea will sound crazy. Can we use fuse
filesystems for filtering sysfs and devtmpfs? When a CT mounts sysfs,
it will mount fuse-sysfs, which is implemented by userspace program on
host system.

* This way allows to emulate the behavior of uevent files in
containers, if we will use unix sockets between udev services.
* Probably a userspace daemon will be more flexible and customizable
than something in kernel

Do we have a use case when a perfomance of sysfs is critical?

Thanks,
Andrey

Michael J Coss

2013-10-29 00:30:01 UTC

Post by Andrey Wagin
I had experience of implementing this functionality in OpenVZ kernel.
I had requirements to not modify user-space tools, so that
implementations looks as dirty hack, but even hotplug of devices are
workin there.

Same here. I want the container to run as much as possible unmodified
code. And in my case, that means an unmodified udev as well. Ideally,
I want those, and only those uevents that are destined for that
container to go there. This requires a few kernel modifications, but
nothing massive. I'm working on getting permission from my management
to post the patch set that implements this here.

Post by Andrey Wagin
I would prefer to think a bit more about userspace solution. We can
try to expand udev functionality.

My changes requires a userspace daemon that runs on the host system to
forward messages after applying whatever policy the admin wants or needs
for uevent to containers. In my case, I need mouse, keyboard and
display events to go the the appropriate container. Others might want
serial device plugin events, whatever.
The daemon listens to the same netlink socket, and then writes to a
simple device that forwards the event to the appropriate container
netlink socket. These are read or not by the udev/systemd whatever
running in the container which does whatever is needed. The daemon in
our case, also handles creating the base devices in a host filesystem
that is bound to the containers /dev directory.

Post by Andrey Wagin
or we can teach udev to listens on a unix domain socket. The host udev
listens netlink. When it gets an event about a new device, it decides
for which containers it must be avaliable, does all required actions
and sends events in containers. Probably the protocol of notifications
must be unified for all udev-like services.

These minimal kernel changes + a host daemon fixes "most" of the
problems. There are a few warts..notable sysfs and devtmpfs.

Post by Andrey Wagin
Sorry if a following idea will sound crazy. Can we use fuse
filesystems for filtering sysfs and devtmpfs? When a CT mounts sysfs,
it will mount fuse-sysfs, which is implemented by userspace program on
host system. * This way allows to emulate the behavior of uevent files
in containers, if we will use unix sockets between udev services. *
Probably a userspace daemon will be more flexible and customizable
than something in kernel Do we have a use case when a perfomance of
sysfs is critical?

I started working on a devtmpfs FUSE. And the issues are many. There's
the performance penalty, the security, etc. It looks possible and might
be doable but in the short term for me, it's easier to have a directory
that the host can modify, and bind mount that directory to the
containers /dev, and just don't use devtmpfs in the container. I do
need a way to stop the mounting of "bad" kernel filesystems to prevent
the adversarial container from harming the host.

FUSE might be a better match for sysfs, but you'd need to have a filter
that manages the massive directed graph, and prunes unnecessary thing
from the graph on a per container basis. My group is working on this
right now to see how bad it will be.

Ultimately, I'd really rather have a containerized sysfs and devtmpfs,
but I suspect that there's going to be a lot of push back on doing that
in the kernel.

--
---Michael J Coss

Amir Goldstein

2013-09-29 19:28:55 UTC

On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <

Yes, but I'm not going to namespace devtmpfs, as that is going to be an
impossible task, right?

That sounds like a challenge ;-)
Seriously, as Serge correctly noted, it would not be that different from
devpts
if you start from an empty devtmpfs and populate it with devices that are
"added
in the context of that namespace".
The semantics in which devices are "added in the context of a namespace"
is the missing piece of the puzzle.

What we really like to see is a setns() style API that can be used to
add a device in the context of a namespace in either a "shared" or "private"
mode.
This kind of API is a required building block for us to write device drivers
that are namespace aware in a way that userspace will have enough
flexibility
for dynamic configuration.

We are trying to come up with a proposal for that sort of API.
When we have something decent, we shall post it.

Post by Greg Kroah-Hartman
And remember, udev doesn't create device nodes anymore...

Post by Eric W. Biederman
- We can manually export a subset of sysfs with bind mounts.
(But that feels hacky, and is essentially incompatible with hotplug).

True.

Post by Eric W. Biederman
- We can relay a call of /sbin/hotplug from outside of a container
to inside of a container based on policy.
(But no one uses /sbin/hotplug anymore).

That's right, they should be listening to libudev events, so why can't
your daemon shuffle them off to the proper container, all in userspace?

You shouldn't need to do this.

I didn't even know there was a device cgroup.
Which means that if there is one, odds are it's useless.

Post by Eric W. Biederman
- Seeing more than we should has the real potential to confuse
userspace, especially a userspace that replays uevents.

You should never replay uevents. If you don't do that, why can't you
see all of sysfs?

But you have the issue of controlling devices in a "shared" way, which
isn't going to be usable for almost all devices.

now I

Post by Eric W. Biederman
don't think so.
I definitely think we should hold off on a kernel level implementation
until we really understand the issues and are ready to implement device
namespaces correctly.

I agree, especially as I don't think this will ever work.

95% is probably way better than what you have today, and will fit the
needs of almost everyone today, so why not do it?
I'd argue that those last 5% either are custom solutions that never get
merged, or candidates for true virtulization.

Just listen for libudev stuff, don't try to filter them, or ever
"replay" them, that way lies madness, and lots of nasty race conditions
that is guaranteed to break things.
good luck,
greg k-h

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20130929/fed270b8/attachment.html>

Greg Kroah-Hartman

2013-09-29 20:06:20 UTC

On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <gregkh at linuxfoundation.org

Post by Eric W. Biederman
After having thought about this for a bit I don't know if a pure
userspace solution is sufficient or actually a good idea.
- We can manually manage a tmpfs with device nodes in userspace.
? (But that is deprecated functionality in the mainstream kernel).

Yes, but I'm not going to namespace devtmpfs, as that is going to be an
impossible task, right?
That sounds like a challenge ;-)
Seriously, as Serge correctly noted, it would not be that different from devpts
if you start from an empty devtmpfs and populate it with devices that are
"added in the context of that namespace". The semantics in which
devices are "added in the context of a namespace" is the missing piece
of the puzzle.

And the fact that these devices are almost all created before userspace
starts up, is a non-trivial "piece of the puzzle" :)

Good luck,

greg k-h

Michael H. Warfield

2013-09-30 15:36:50 UTC

On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman <gregkh at linuxfoundation.org

And the fact that these devices are almost all created before userspace
starts up, is a non-trivial "piece of the puzzle" :)

That's putting it mildly. As I said in the Containers session at Linux
Plumbers, I agree with you (wrt device namespaces), but we do have (a)
problem(s) to solve. The more I've thought on this, the more I agree
with you and that there's got to be a better way.

I'm not going to address the Android use case issues here which Janne
raised (which are very valid), since I've got other fish to fry and I
haven't even begun to look at the complexities of Android in an LXC
container on a non-android host, much less Android on Android or other
on Android. This may have some applicability to the Android case, I
just haven't thought it through yet. Anything on a common kernel should
work and standard distributions seem to be no problem now, but Android
is a rather unique beast, to say the least.

I will disagree with you on one point, though, from that session. When
I mentioned both persistent and dynamic devices, you said they were
mutually exclusive. It may be a difference in semantics or terminology
but I would beg to differ there, so I'll explain that too...

In my "worst case, real world, right now" scenario of the USB sharing
device and multiple USB serial adapters for serial consoles, I have
several different issues that are illustrative of several problems I'm
trying to overcome.

With this sharing device, you get a "/dev/usbshare" HID device on all
the connected hosts which do NOT have the USB bus that's being shared.
The device that has control of the bus does NOT see the /dev/usbshare
device but does see all the USB devices (the serial port adapters
- /dev/ttyUSB* - in this case) which are connected to it.

So, when you switch the sharing from system A to system B, all the
shared serial devices disappear from A and the /dev/usbshare device
appears, while the usbshare device disappears from system B and all the
usb serial devices appear together. Either system may (and do) have
other static usb serial devices attached so the numbering and order
of /dev/ttyUSB* may vary and can even change depending if a host had
been booted with the usb bus shared to it or not.

Ok... That's the "dynamic" devices I was referring to. They come and
go and may have differing names under differing circumstances. Very
real world dynamic.

Now... For consistency, I have udev rules that map those serial devices
to other names, based on their device USB serial numbers. That naming
convention remains persistent on that system as the devices come and go
and remains consistent between the systems with those rules.

So that's my "dynamic" with "persistent" devices. I have persistent
names on dynamic devices. Perhaps I could have chosen my terminology
better but, that's what I was arguing for in that Plumbers session when
I used those terms.

Now, for the complications... If I wish to (and I most certainly do)
divvy up these serial devices between containers, I have several things
which need to be managed.

The /dev/usbshare device needs to be mapped to ALL containers which may
wish to request the shared bus (plus the host). It's generally only a
very momentary device access and collisions would be extremely rare and
non-harmful in any case (two containers both wanting the bus on the same
host - shrug...). It's actually far less confusing and difficult than
merely the collisions and contention between systems, and that's been
easily managable, given the rarity of cross serial console access (the
real world use case).

The /dev/ttyUSB* devices need to be mapped to their specific containers
with or without removing them from the host and possibly allowing for
multiple containers. Device access is easily managed by the device
driver for multiple access (EBUSY) and not a problem. This could be
more complicated if, for example, we were talking about USB drives, loop
devices, or other devices which multiple access, but that's another
layer of complication.

The "persistent" udev symlinks also need to be mapped to the containers.
I think I can do this equally well in the host as the real devices...

Post by Greg Kroah-Hartman
Good luck,

I'm scratching on an idea that started forming just after that session.
I told Serge that "I think I can do it and it will (should) suck less."
Basically, it exploits some of the properties of devtmpfs to accomplish
some of our goals.

You're right about the user space problem. Something needs to manage
the devices in a coherent manner as devices come and go and as
containers come and go in asynchronous manner. In my mind, the only
place for that is in the host. "Non trivial" is a jaw dropping
understatement and I can see where you feel it would be impossible to
manage in applying namespaces to devtmpfs. That leaves the user space
in the host. I can see where it would be intractable in the kernel.

I may get beat mercilessly for suggesting this but, just as with
cgroups, if we create a subdirectory in devtmpfs for subsystem (LXC) and
container, we can then bind mount that subtree off of devtmpfs to the
container and then the host can map and manipulate the device subtree
into the container (even if the container is denied mknod capability).
That leaves the host to manage all the devices, which actually makes a
LOT of sense (to me) since it should be responsible for the devices and
the overall kernel operations. That would be no different than needing
to configure device passthroughs for KVM / VirtualBox / VMware
hypervisors.

Example... In the host I would have something like this...

/dev/lxc/
romulus
remus
gemini
janus

And then bind mount each of those subdirectories
to /var/lib/lxc/${Container}/rootfs/dev directory. Then map the devices
from the host /dev to the container /dev with mknod in the host and
relative symlinks.

That also (I think) helps me deal with some of the (mis)behavior of
systemd where it contains unconfigurable behavior (mounting devtmpfs)
controlled by "magic cookies" (/dev mounted on another major/minor
from / to disable it mounting devtmpfs). I initially recoiled in horror
of the thought of overloading the devtmpfs subtree with container based
subdirectories, devices, and symlinks but the idea grew on me that this
might be better than what we're dealing with now of mounting tmpfs on
the /dev mount point in all theses containers and then having to
populate them just to prevent systemd from creating collisions with
devtmpfs and the resulting violation of the container isolation.

It DOES still leave the problem of dealing with udev rules in the
container and subsidiary device syslinks in the container which may not
correspond to the rules in the host. That's still problem in my mind
(but already present and miniscule to what we would be solving). I
could pattern match everything coming out of udev in a trigger and map
devices and symlinks into the new subtree in the host but I have no way
to manage propagating the rules in the container down into the processor
in the host or a way to trigger those udev rules in the containers.
Suggestions there might be nice (as well as the cat calls). I'm not
sure I have it clear in my head yet how I would deal with bringing up a
container and then mapping all the required existing devices over to it.
That's your user space problem in a nutshell. That's easy to handle
with udev as things come and go but, when the user space comes after and
udev isn't processing triggers, how do I handle the mappings. That's
also non-trivial in my mind.

Device creation would seem to be pretty trivial. Device removal, not so
much. If I create another node on devtmpfs and that major/minor gets
removed, will it also get removed? I also have to remove the symlinks.
The removal process just feels more complicated in my mind.

Greg, I think you are absolutely right, this needs to be managed in user
space and not in kernel space and we do have the tools to do it. I
think I can do some of it in a way that will suck less compared to how
we're (LXC is) doing it now. I'm just not so sure how comprehensive the
solution will be or how well it will work.

I've still got several other takeaways from that session to put a bow on
before really testing this idea further. I really have not fully
fleshed this idea out and it's going to take me some time. There may
also me some other corner cases I haven't considered. And then there's
Android. Sigh...

And maybe I'm just totally off base and crazy. Wouldn't be the first
time, won't be the last time.

Post by Greg Kroah-Hartman
greg k-h

Regards,
Mike
--
Michael H. Warfield (AI4NB) | (770) 985-6132 | mhw at WittsEnd.com
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20130930/77e74032/attachment.pgp>

Eric W. Biederman

2013-10-03 00:44:02 UTC

Post by Amir Goldstein
What we really like to see is a setns() style API that can be used to
add a device in the context of a namespace in either a "shared" or
"private" mode.

I think you mean an "ip link set dev FOO netns XXX" style API.

Right now one of the best suggestions on the table is:

mkdir -p /dev/container/X
ln /dev/zero /dev/container/X/zero
ln /dev/null /dev/container/X/null
...

With /dev/container/X mounted on /dev for container X.

Which seems to cover putting a device in a namespace, while allowing
things to still be reasonably managed.

There are a few other variations on that scheme but nothing that says we
must have kernel support or to create any kind of kernel context beyond
which directory the device nodes live in.

Post by Amir Goldstein
This kind of API is a required building block for us to write device
drivers that are namespace aware in a way that userspace will have
enough flexibility for dynamic configuration.
We are trying to come up with a proposal for that sort of API. When
we have something decent, we shall post it.

I really think what you need to write are special drivers that
facilitate your use case.

For the networking stack we wound up adding veth pairs, and macvlan
devices, to handle the common sharing modes.

Outside of your sharing situation I am not seeing any need or any
advantage of creating devices that are modified to be sharable and I am
seeing a lot of disadvantages to implementing things that way. The
biggest is that you seem to working independent of the subsystem
maintainers of those devices which is generally a poor idea.

Unprivileged creation of device nodes we can handle if it can be shown
that it is safe to create device nodes.

As I understand your problem you are trying to multiplex a device by
building a device with a built in stop light. Where one opener can
write and the other openers are stopped/dropped. That sounds very
similar to macvlan, or ethernet bridging. From the patches you have
floated I suspect it would be very simple to build and just need a
little bit of glue.

Eric

Eric W. Biederman

2013-10-03 00:59:36 UTC

I really think what you need to write are special drivers that
facilitate your use case.

Even more practically if you can write special drivers it removes a
level of policy from the kernel, and allows those special drivers to
use at other times for other occassions.

Eric

Amir Goldstein

2013-10-03 08:58:39 UTC

Post by Amir Goldstein
What we really like to see is a setns() style API that can be used to
add a device in the context of a namespace in either a "shared" or
"private" mode.

I think you mean an "ip link set dev FOO netns XXX" style API.

correct.

Post by Eric W. Biederman
mkdir -p /dev/container/X
ln /dev/zero /dev/container/X/zero
ln /dev/null /dev/container/X/null
...
With /dev/container/X mounted on /dev for container X.
Which seems to cover putting a device in a namespace, while allowing
things to still be reasonably managed.
There are a few other variations on that scheme but nothing that says we
must have kernel support or to create any kind of kernel context beyond
which directory the device nodes live in.

I really think what you need to write are special drivers that
facilitate your use case.
For the networking stack we wound up adding veth pairs, and macvlan
devices, to handle the common sharing modes.
Outside of your sharing situation I am not seeing any need or any
advantage of creating devices that are modified to be sharable and I am
seeing a lot of disadvantages to implementing things that way. The
biggest is that you seem to working independent of the subsystem
maintainers of those devices which is generally a poor idea.
Unprivileged creation of device nodes we can handle if it can be shown
that it is safe to create device nodes.
As I understand your problem you are trying to multiplex a device by
building a device with a built in stop light. Where one opener can
write and the other openers are stopped/dropped. That sounds very
similar to macvlan, or ethernet bridging. From the patches you have
floated I suspect it would be very simple to build and just need a
little bit of glue.

Excellent! let's focus the discussion on a new device driver we want to
write
which is namespace aware. let's call this device driver valarm-dev.
Similarly to Android's alarm-dev, valarm-dev can be used to request RTC
wakeup calls
from user space and get/set RTC values, but with valarm-dev, every container
may use different values for current time.

As you can see in our patch set, we already have a version of alarm-dev
that maintains
its state inside a context, instead of in global variable, so it is capable
of providing
different context per namespace.

And now for the 1M$ question: per *which* namespace do we attribute the
current realtime clock time?
To UTS namespace (because T historically stands for Time)? To device
namespace?
Even if device namespace would exist, we do not want to tie the policy
decision of "separate time"
to a very wide definition of "separate devices".

So what we want to create, is an API for device driver writers, that will
enable to write a namespace
aware device and allow userspace to configure when the namespace aware
device context is unshared.

We would like to share with you our very initial thoughts about how this
will be implemented:
- Extend register_pernet_subsys/device(ops) API
to register_perns_subsys/device(nstype, ops) API
- Extend pernet_operations to perns_operations that include optional
migrate() and/or unshare() ops
- Let valarm-dev register_peruser_subsys/device(&alarm_userns_ops)
- Implement a new syscall (or netlink command if it makes more sense)
setdevns(int dev_fd, int ns_fd, int nstype, int flags)
- Unlike the netlink set netns case, this API is not used solely to *move*
a device to a different namespace,
but also to *unshare* a device context between namespaces, for those
devices that resigtered unshare() ops.

This is our missing piece of the puzzle.
After that, whether we make changes to existing drivers (e.g. evdev) or
write new virtualized drivers (e.g. vevdev)
is a technicality. We care not which way to go, whichever way seems more
maintainable.

What do you think of this master plan?

P.S. Please try to refrain from addressing the validity of the use case of
alarm-dev in particular,
as we do not wish to get engage "Android sucks" wars.
We simply want to present the case for improving the namespace
infrastructure to cater the needs
of device driver writers that wish to tailor their drivers for containers
based products.

Cheers,
Amir.

Post by Eric W. Biederman
Eric

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxcontainers.org/pipermail/lxc-devel/attachments/20131003/ce3b25ce/attachment.html>

Eric W. Biederman

2013-10-03 09:17:17 UTC