[fuse-devel] device namespaces

Discussion:

riya khanna

2014-09-24 04:34:46 UTC

(Please pardon multiple emails, artifact of merging all separate
conversations)

Thanks for your feedback!

Letting the kernel know about what devices a container could access (based
on device cgroups) and having devtmpfs in the kernel create device nodes
for a container that map to corresponding CUSE nodes is what I thought of.
For example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual
framebuffer (based on real fb0 SCREENINFO properties) for this process
provided permissions allow this operation. To view the framebuffer, the
CUSE based virtual device would talk to the actual hardware. Since
namespaces would have different view of the underlying devices, "sysfs" has
to made aware of this as well.

Please let me know your inputs. Thanks again!

-Riya

Hi,
I'm a newbie trying to come up with a fuse/cuse-based solution to
device namespace virtualization.

Fwiw I find the thought of allowing use of cuse from a container

(well,

an unprivileged container at least) more than a little bit

frightening

from a security perspective. If a process does an ioctl on a

cuse-based

device then the process implementing the device can get a very

broad

ability to read and write in the initiator's address space. If the

The cuse or fuse process would best run with the permissions of the
container. Even for an unprivileged container it could connect to
bind-mounts of say /dev/null etc for any passthrough access.

device were to show up automagically in devtmpfs and a process on

the

host could be tricked into opening the device, then that sounds

like a

great vector for an attack. Just something to keep in mind.

Yup. You'd like to think that having the devices be owned by uid

100000

would be a clue, but a script might not notice. The fs should only

mounted in the container's fs, but that can of course be reached

through

/proc/pid/root. Now an unpriv user shouldn't be able to chroot into
there without starting a new user namespace - leaving the victim no
long privileged and so no more harmful than the user was to begin

with.

I don't think it matters if the user is unprivileged if you're using
cuse to implement the devices. In order for it to work the unprivileged
user would need read/write access to /dev/cuse, and once it has that
there seems to be no restrictions on what cuse functionality it can

make

use of.
When the user creates a device cuse calls device_add() for the new
device, which is going to create a node in devtmpfs which is owned by
global root. At that point I see nothing that would stop a process in
the host from opening the file and doing ioctls. It looks like it would
even be possible to use cuse to claim a well-known major/minor pair for
your device if it wasn't already claimed (e.g. the driver was a module
and not loaded).
I didn't spend a lot of time looking at the code, so it's possible I
missed something, but if I didn't then giving unprivileged users access
to /dev/cuse seems like a very bad idea.

Ok, agreed. The original author mainly mentioned fuse. I thought fuse
couldn't create device nodes though.

Yeah, but since he did mention cuse I thought I'd throw out a warning.
With fuse it is technically possible to have device nodes, but it's
usually prevented for unprivileged users by the suid helper (fusermount)
adding MS_NODEV to the mountflags. With my patches for fuse in user
namespaces the kernel will add nodev for any userns mount, and from a
security perspective I don't see any way around that.
Seth
_______________________________________________
lxc-devel mailing list
http://lists.linuxcontainers.org/listinfo/lxc-devel

Eric W. Biederman

2014-09-24 05:04:30 UTC

Permalink

(Please pardon multiple emails, artifact of merging all separate conversations)
Thanks for your feedback!
Letting the kernel know about what devices a container could access (based on
device cgroups) and having devtmpfs in the kernel create device nodes for a
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual framebuffer
(based on real fb0 SCREENINFO properties) for this process provided permissions
allow this operation. To view the framebuffer, the CUSE based virtual device
would talk to the actual hardware. Since namespaces would have different view of
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!

The solution hugely depends on what you are trying to do with it.

The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.

Therefore the question becomes what are you trying to support.

If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.

If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.

There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.

Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).

The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.

Eric

riya khanna

2014-09-24 05:32:27 UTC

Permalink

My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)? Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility. I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.

Post by riya khanna

Post by riya khanna
(Please pardon multiple emails, artifact of merging all separate

conversations)

Post by riya khanna
Thanks for your feedback!
Letting the kernel know about what devices a container could access

(based on

Post by riya khanna
device cgroups) and having devtmpfs in the kernel create device nodes

for a

Post by riya khanna
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual

framebuffer

Post by riya khanna
(based on real fb0 SCREENINFO properties) for this process provided

permissions

Post by riya khanna
allow this operation. To view the framebuffer, the CUSE based virtual

device

Post by riya khanna
would talk to the actual hardware. Since namespaces would have different

view of

Post by riya khanna
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!

The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Serge Hallyn

2014-09-24 16:37:40 UTC

Permalink

Isolation is provided by the devices cgroup. You want something more
than isolation.

Post by riya khanna
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)? Not everything should be
accessible (or even visible) from a container all the time (we have seen
people come up with different use cases for this). However, bind-mounting
takes away this flexibility. I agree that assigning fixed device numbers is
clearly not a long-term solution. Emulation for safe and flexible
multiplexing, like you suggested either using CUSE/FUSE or something like
devpts, is what I'm exploring.

Post by riya khanna

Post by riya khanna
(Please pardon multiple emails, artifact of merging all separate

conversations)

Post by riya khanna
Thanks for your feedback!
Letting the kernel know about what devices a container could access

(based on

Post by riya khanna
device cgroups) and having devtmpfs in the kernel create device nodes

for a

Post by riya khanna
container that map to corresponding CUSE nodes is what I thought of. For
example, "echo 29:0 > /proc/<pid>/devices" would prepare a virtual

framebuffer

Post by riya khanna
(based on real fb0 SCREENINFO properties) for this process provided

permissions

Post by riya khanna
allow this operation. To view the framebuffer, the CUSE based virtual

device

Post by riya khanna
would talk to the actual hardware. Since namespaces would have different

view of

Post by riya khanna
the underlying devices, "sysfs" has to made aware of this as well.
Please let me know your inputs. Thanks again!

The solution hugely depends on what you are trying to do with it.
The situation today is that device nodes are slowly fading out. In
another 20 years linux may not have any device nodes at all.
Therefore the question becomes what are you trying to support.
If it is just filtering of existing device nodes. We can do a pretty
good approximation with bind mounts.
If you want to emulate a device you can use normal fuse (not cuse).
As normal fuse file will support arbitrary ioctls.
There are a few cases where it is desirable to emulate what devpts
does for allowing arbitrary users to creating virtual devices in the
kernel. Loop devices in particular.
Ultimately given the existence of device hotplug I don't see any call
for being able to create device nodes with well known device numbers
(fundamentally what a device namespace would be about).
The conversation last year was about people wanting to multiplex devices
that don't have multiplexer support in the kernel. If that is your
desire I think it is entirely reasonable to device type by device type
add support for multiplexing that device type to the kernel, or
potentially just use fuse or cuse to implement your multiplexer in
userspace but that has the potential to be unusably slow.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Eric W. Biederman

2014-09-24 17:43:12 UTC

Permalink

Post by Serge Hallyn
Isolation is provided by the devices cgroup. You want something more
than isolation.

Post by riya khanna
My use case for having device namespaces is device isolation. Isn't what
namespaces are there for (as I understand)?

Namespaces fundamentally provide for using the same ``global'' name
in different contexts. This allows them to be used for isolation
and process migration (because you can take the same name from
machine to machine).

Unless someone cares about device numbers at a namespace level
the work is done.

The mount namespace provides exsits to deal with file names.
The devices cgroup will limit which devices you can access (although
I can't ever imagine a case where the mout namespace would be
insufficient).