Discussion:
RFC(v2): Audit Kernel Container IDs
Richard Guy Briggs
2017-10-12 14:14:00 UTC
Permalink
Containers are a userspace concept. The kernel knows nothing of them.

The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.

Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.

The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.

Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration. At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.

Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.

Forked and cloned processes inherit their parent's container ID,
referenced in the process' task_struct.

Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all threads and children to the new containerID.

Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.

Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable. Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.

Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]

Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.

When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.) A
container object may need a list of processes and/or namespaces.

A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container. A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.

(v2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and children into same container

- RGB

--
Richard Guy Briggs <***@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Steve Grubb
2017-10-12 15:45:24 UTC
Permalink
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The requirements for common criteria around containers should be very closely
modeled on the requirements for virtualization. It would be the container
manager that is responsible for logging the resource assignment events.
Post by Richard Guy Briggs
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Wouldn't CAP_AUDIT_WRITE be sufficient? After all, this is for auditing.
Post by Richard Guy Briggs
At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
This would be in addition to the normal audit fields.
Post by Richard Guy Briggs
Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.
Forked and cloned processes inherit their parent's container ID,
referenced in the process' task_struct.
Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all threads and children to the new containerID.
Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.
Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable. Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.
Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
In the virtualization requirements, we only log removal of resources when
something is removed by intention. If the VM shuts down, the manager issues a
VIRT_CONTROL stop event and the user space utilities knows this means all
resources have been unassigned.
Post by Richard Guy Briggs
Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.
When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.) A
container object may need a list of processes and/or namespaces.
A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container. A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.
I'm thinking that there needs to be a clear delineation between what the
container manager is responsible for and what the kernel needs to do. The
kernel needs the registration system and to associate an identifier with
events inside the container.

But would the container manager be mostly responsible for auditing the events
described here:

https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualization-Manager-Guest-Lifecycle-Events

Also, we can already audit exit, unshare, setns, and clone. If the kernel just
sticks the identifier on them, isn't that sufficient?

-Steve
Post by Richard Guy Briggs
(v2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and children
into same container
- RGB
--
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs
2017-10-19 19:57:47 UTC
Permalink
Post by Steve Grubb
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The requirements for common criteria around containers should be very closely
modeled on the requirements for virtualization. It would be the container
manager that is responsible for logging the resource assignment events.
I suspect we are in violent agreement here.
Post by Steve Grubb
Post by Richard Guy Briggs
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Wouldn't CAP_AUDIT_WRITE be sufficient? After all, this is for auditing.
No, because then any process with that capability (vsftpd) could change
its own container ID. This is discussed more in other parts of the
thread...
Post by Steve Grubb
Post by Richard Guy Briggs
At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
This would be in addition to the normal audit fields.
It was intended that this be an auxilliary record, but this issue is
being debated in threads about other upstream issues currently so I
won't cover that here.
Post by Steve Grubb
Post by Richard Guy Briggs
Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.
Forked and cloned processes inherit their parent's container ID,
referenced in the process' task_struct.
Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all threads and children to the new containerID.
Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.
Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable. Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.
Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
In the virtualization requirements, we only log removal of resources when
something is removed by intention. If the VM shuts down, the manager issues a
VIRT_CONTROL stop event and the user space utilities knows this means all
resources have been unassigned.
Ok, this assumes the orchestrator is waiting on that child process (and
that it is in turn waiting on all its children) so it knows when that
job has exited naturally or errored out. I don't know if there is any
consensus or best practice with orchestrators out there now. The kernel
should know, so it seemed reasonable to report what was known. Besides,
in this case, I was talking specifically about namespace creation and
destruction rather than containers.
Post by Steve Grubb
Post by Richard Guy Briggs
Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.
When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.) A
container object may need a list of processes and/or namespaces.
A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container. A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.
I'm thinking that there needs to be a clear delineation between what the
container manager is responsible for and what the kernel needs to do. The
kernel needs the registration system and to associate an identifier with
events inside the container.
Agreed this needs to be defined much better than it is.
Post by Steve Grubb
But would the container manager be mostly responsible for auditing the events
https://github.com/linux-audit/audit-documentation/wiki/SPEC-Virtualization-Manager-Guest-Lifecycle-Events
I'm having trouble fitting all these events into the container model,
but recognize its importance in continuing to try to do so or to be able
to justify deviations from this SPEC.
Post by Steve Grubb
Also, we can already audit exit, unshare, setns, and clone. If the kernel just
sticks the identifier on them, isn't that sufficient?
I think this last one is incomplete without a way to identify the
namespaces involved.
Post by Steve Grubb
-Steve
Post by Richard Guy Briggs
(v2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and children
into same container
- RGB
- RGB

--
Richard Guy Briggs <***@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Aleksa Sarai
2017-10-19 23:11:33 UTC
Permalink
Post by Richard Guy Briggs
Post by Steve Grubb
Post by Richard Guy Briggs
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Wouldn't CAP_AUDIT_WRITE be sufficient? After all, this is for auditing.
No, because then any process with that capability (vsftpd) could change
its own container ID. This is discussed more in other parts of the
thread...
Not if we make the container ID append-only (to support nesting), or
write-once (the other idea thrown around). In that case, you can't move
"out" from a particular container ID, you can only go "deeper". These
semantics don't make sense for generic containers, but since the point
of this facility is *specifically* for audit I imagine that not being
able to move a process from a sub-container's ID is a benefit.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Aleksa Sarai
2017-10-19 23:15:25 UTC
Permalink
Post by Aleksa Sarai
Post by Richard Guy Briggs
Post by Steve Grubb
Post by Richard Guy Briggs
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container.  This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID.  A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Wouldn't CAP_AUDIT_WRITE be sufficient? After all, this is for auditing.
No, because then any process with that capability (vsftpd) could change
its own container ID.  This is discussed more in other parts of the
thread...
Not if we make the container ID append-only (to support nesting), or
write-once (the other idea thrown around). In that case, you can't move
"out" from a particular container ID, you can only go "deeper". These
semantics don't make sense for generic containers, but since the point
of this facility is *specifically* for audit I imagine that not being
able to move a process from a sub-container's ID is a benefit.
[This assumes it's CAP_AUDIT_CONTROL which is what we are discussing in
a sister thread.]
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Steve Grubb
2017-10-20 02:25:54 UTC
Permalink
Post by Aleksa Sarai
Post by Richard Guy Briggs
Post by Steve Grubb
Post by Richard Guy Briggs
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Wouldn't CAP_AUDIT_WRITE be sufficient? After all, this is for auditing.
No, because then any process with that capability (vsftpd) could change
its own container ID. This is discussed more in other parts of the
thread...
For the record, I changed my mind. CAP_AUDIT_CONTROL is the correct
capability.
Post by Aleksa Sarai
Not if we make the container ID append-only (to support nesting), or
write-once (the other idea thrown around).
Well...I like to use lessons learned if they can be applied. In the normal
world without containers we have uid, auid, and session_id. uid is who you are
now, auid is how you got into the system, session_id distinguishes individual
auids. We have a default auid of -1 for system objects and a real number for
people.

I think there should be the equivalent of auid and session_id but tailored for
containers. Loginuid == container id. It can be set, overridden, or appended
to (we'll figure this out later) in very limited circumstances.
Container_session == session which is tamper-proof. This way things can enter
a container with the same ID but under a different session. And everything
else gets to inherit the original ID. This way we can trace actions to
something that entered the container rather than normal system activity in the
container.

What a security officer wants to know is what did people do inside the
system / container. The system objects we typically don't care about. Sure
they might get hacked and then work on behalf of someone, but they would
almost always pop a shell so that they can have freedom. That should set off
an AVC or create other activity that gets picked up.

-Steve
Post by Aleksa Sarai
In that case, you can't move "out" from a particular container ID, you can
only go "deeper". These semantics don't make sense for generic containers,
but since the point of this facility is *specifically* for audit I imagine
that not being able to move a process from a sub-container's ID is a
benefit.
Casey Schaufler
2017-10-12 16:33:49 UTC
Permalink
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Hang on. If containers are a user space concept, how can
you want CAP_CONTAINER_ANYTHING? If there's not such thing as
a container, how can you be asking for a capability to manage
them?
Post by Richard Guy Briggs
At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.
Forked and cloned processes inherit their parent's container ID,
referenced in the process' task_struct.
Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all threads and children to the new containerID.
Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.
Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable. Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.
Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.
When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.) A
container object may need a list of processes and/or namespaces.
A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container. A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.
(v2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and children into same container
- RGB
--
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
--
Linux-audit mailing list
https://www.redhat.com/mailman/listinfo/linux-audit
Richard Guy Briggs
2017-10-17 00:33:40 UTC
Permalink
Post by Casey Schaufler
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Hang on. If containers are a user space concept, how can
you want CAP_CONTAINER_ANYTHING? If there's not such thing as
a container, how can you be asking for a capability to manage
them?
There is such a thing, but the kernel doesn't know about it yet. This
same situation exists for loginuid and sessionid which are userspace
concepts that the kernel tracks for the convenience of userspace. As
for its name, I'm not particularly picky, so if you don't like
CAP_CONTAINER_* then I'm fine with CAP_AUDIT_CONTAINERID. It really
needs to be distinct from CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we
don't want to give the ability to set a containerID to any process that
is able to do audit logging (such as vsftpd) and similarly we don't want
to give the orchestrator the ability to control the setup of the audit
daemon.
Post by Casey Schaufler
Post by Richard Guy Briggs
At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.
Forked and cloned processes inherit their parent's container ID,
referenced in the process' task_struct.
Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all threads and children to the new containerID.
Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.
Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable. Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.
Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.
When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.) A
container object may need a list of processes and/or namespaces.
A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container. A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.
(v2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and children into same container
- RGB
- RGB

--
Richard Guy Briggs <***@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Casey Schaufler
2017-10-17 01:10:42 UTC
Permalink
Post by Richard Guy Briggs
Post by Casey Schaufler
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Hang on. If containers are a user space concept, how can
you want CAP_CONTAINER_ANYTHING? If there's not such thing as
a container, how can you be asking for a capability to manage
them?
There is such a thing, but the kernel doesn't know about it yet.
Then how can it be the kernel's place to control access to a
container resource, that is, the containerID.
Post by Richard Guy Briggs
This
same situation exists for loginuid and sessionid which are userspace
concepts that the kernel tracks for the convenience of userspace.
Ah, no. Loginuid identifies a user, which is a kernel concept in
that a user is defined by the uid. The session ID has well defined
kernel semantics. You're trying to say that the containerID is an
opaque value that is meaningless to the kernel, but you still want
the kernel to protect it. How can the kernel know if it is protecting
it correctly?
Post by Richard Guy Briggs
As
for its name, I'm not particularly picky, so if you don't like
CAP_CONTAINER_* then I'm fine with CAP_AUDIT_CONTAINERID. It really
needs to be distinct from CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we
don't want to give the ability to set a containerID to any process that
is able to do audit logging (such as vsftpd) and similarly we don't want
to give the orchestrator the ability to control the setup of the audit
daemon.
Sorry, but what aspect of the kernel security policy is this
capability supposed to protect? That's what capabilities are
for, not the undefined support of undefined user-space behavior.

If it's audit behavior, you want CAP_AUDIT_CONTROL. If it's
more than audit behavior you have to define what system security
policy you're dealing with in order to pick the right capability.

We get this request pretty regularly. "I need my own capability
because I have a niche thing that isn't part of the system security
policy but that is important!" Fit the containerID into the
system security policy, and if that results in using CAP_SYS_ADMIN,
oh well.
Post by Richard Guy Briggs
Post by Casey Schaufler
Post by Richard Guy Briggs
At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.
Forked and cloned processes inherit their parent's container ID,
referenced in the process' task_struct.
Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all threads and children to the new containerID.
Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.
Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable. Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.
Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.
When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.) A
container object may need a list of processes and/or namespaces.
A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container. A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.
(v2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and children into same container
- RGB
- RGB
--
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs
2017-10-19 00:05:27 UTC
Permalink
Post by Casey Schaufler
Post by Richard Guy Briggs
Post by Casey Schaufler
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Hang on. If containers are a user space concept, how can
you want CAP_CONTAINER_ANYTHING? If there's not such thing as
a container, how can you be asking for a capability to manage
them?
There is such a thing, but the kernel doesn't know about it yet.
Then how can it be the kernel's place to control access to a
container resource, that is, the containerID.
Ok, let me try to address your objections.

The kernel can know enough that if it is already set to not allow it to
be set again. Or if the user doesn't have permission to set it that the
user be denied this action. How is this different from loginuid and
sessionid?
Post by Casey Schaufler
Post by Richard Guy Briggs
This
same situation exists for loginuid and sessionid which are userspace
concepts that the kernel tracks for the convenience of userspace.
Ah, no. Loginuid identifies a user, which is a kernel concept in
that a user is defined by the uid.
This simple explanation doesn't help me. What makes that a kernel
concept? The fact that it is stored and compared in more than one
place?
Post by Casey Schaufler
The session ID has well defined kernel semantics. You're trying to say
that the containerID is an opaque value that is meaningless to the
kernel, but you still want the kernel to protect it. How can the
kernel know if it is protecting it correctly?
How so? A userspace process triggers this. Does the kernel know what
these values mean? Does it do anything with them other than report
them or allow audit to filter them? It is given some instructions on
how to treat it.

This is what we're trying to do with the containerID.
Post by Casey Schaufler
Post by Richard Guy Briggs
As
for its name, I'm not particularly picky, so if you don't like
CAP_CONTAINER_* then I'm fine with CAP_AUDIT_CONTAINERID. It really
needs to be distinct from CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we
don't want to give the ability to set a containerID to any process that
is able to do audit logging (such as vsftpd) and similarly we don't want
to give the orchestrator the ability to control the setup of the audit
daemon.
Sorry, but what aspect of the kernel security policy is this
capability supposed to protect? That's what capabilities are
for, not the undefined support of undefined user-space behavior.
Similarly, loginuids and sessionIDs are only used for audit tracking and
filtering.
Post by Casey Schaufler
If it's audit behavior, you want CAP_AUDIT_CONTROL. If it's
more than audit behavior you have to define what system security
policy you're dealing with in order to pick the right capability.
It isn't audit behaviour (yet), it is audit reporting information, a
level above simply writing logs and a level below controlling daemon
behaviour.
Post by Casey Schaufler
We get this request pretty regularly. "I need my own capability
because I have a niche thing that isn't part of the system security
policy but that is important!" Fit the containerID into the
system security policy, and if that results in using CAP_SYS_ADMIN,
oh well.
There's far too much piled in to CAP_SYS_ADMIN already, which is making
capabilites less and less useful. I realize that capabilities are
limited compared with netlink message types, but this falls in between
the abilities needed by CAP_AUDIT_CONTROL and CAP_AUDIT_WRITE.

I'll continue on Steve Grubb's comment...
Post by Casey Schaufler
Post by Richard Guy Briggs
Post by Casey Schaufler
Post by Richard Guy Briggs
At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.
Forked and cloned processes inherit their parent's container ID,
referenced in the process' task_struct.
Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all threads and children to the new containerID.
Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.
Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable. Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.
Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.
When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.) A
container object may need a list of processes and/or namespaces.
A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container. A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.
(v2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and children into same container
- RGB
- RGB
--
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
- RGB

--
Richard Guy Briggs <***@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Casey Schaufler
2017-10-19 13:32:30 UTC
Permalink
Post by Richard Guy Briggs
Post by Casey Schaufler
Post by Richard Guy Briggs
Post by Casey Schaufler
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Hang on. If containers are a user space concept, how can
you want CAP_CONTAINER_ANYTHING? If there's not such thing as
a container, how can you be asking for a capability to manage
them?
There is such a thing, but the kernel doesn't know about it yet.
Then how can it be the kernel's place to control access to a
container resource, that is, the containerID.
Ok, let me try to address your objections.
The kernel can know enough that if it is already set to not allow it to
be set again. Or if the user doesn't have permission to set it that the
user be denied this action. How is this different from loginuid and
sessionid?
Post by Casey Schaufler
Post by Richard Guy Briggs
This
same situation exists for loginuid and sessionid which are userspace
concepts that the kernel tracks for the convenience of userspace.
Ah, no. Loginuid identifies a user, which is a kernel concept in
that a user is defined by the uid.
This simple explanation doesn't help me. What makes that a kernel
concept? The fact that it is stored and compared in more than one
place?
Post by Casey Schaufler
The session ID has well defined kernel semantics. You're trying to say
that the containerID is an opaque value that is meaningless to the
kernel, but you still want the kernel to protect it. How can the
kernel know if it is protecting it correctly?
How so? A userspace process triggers this. Does the kernel know what
these values mean? Does it do anything with them other than report
them or allow audit to filter them? It is given some instructions on
how to treat it.
This is what we're trying to do with the containerID.
Post by Casey Schaufler
Post by Richard Guy Briggs
As
for its name, I'm not particularly picky, so if you don't like
CAP_CONTAINER_* then I'm fine with CAP_AUDIT_CONTAINERID. It really
needs to be distinct from CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we
don't want to give the ability to set a containerID to any process that
is able to do audit logging (such as vsftpd) and similarly we don't want
to give the orchestrator the ability to control the setup of the audit
daemon.
Sorry, but what aspect of the kernel security policy is this
capability supposed to protect? That's what capabilities are
for, not the undefined support of undefined user-space behavior.
Similarly, loginuids and sessionIDs are only used for audit tracking and
filtering.
Tell me again why you're not reusing either of these?
Post by Richard Guy Briggs
Post by Casey Schaufler
If it's audit behavior, you want CAP_AUDIT_CONTROL. If it's
more than audit behavior you have to define what system security
policy you're dealing with in order to pick the right capability.
It isn't audit behaviour (yet), it is audit reporting information, a
level above simply writing logs and a level below controlling daemon
behaviour.
You are changing audit information. That's CAP_AUDIT_CONTROL.
Post by Richard Guy Briggs
Post by Casey Schaufler
We get this request pretty regularly. "I need my own capability
because I have a niche thing that isn't part of the system security
policy but that is important!" Fit the containerID into the
system security policy, and if that results in using CAP_SYS_ADMIN,
oh well.
There's far too much piled in to CAP_SYS_ADMIN already, which is making
capabilites less and less useful.
No. The value of capabilities is in separating privilege from DAC.
Granularity is a bonus. The current granularity is too fine in some
cases and too coarse in others.
Post by Richard Guy Briggs
I realize that capabilities are
limited compared with netlink message types, but this falls in between
the abilities needed by CAP_AUDIT_CONTROL and CAP_AUDIT_WRITE.
There is *nothing* about your use that makes a compelling
argument for a new capability. If you can't decide between
CAP_AUDIT_CONTROL and CAP_AUDIT_WRITE require both.
Post by Richard Guy Briggs
I'll continue on Steve Grubb's comment...
Post by Casey Schaufler
Post by Richard Guy Briggs
Post by Casey Schaufler
Post by Richard Guy Briggs
At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
Issue a new auxilliary record AUDIT_CONTAINER_INFO for each valid
container ID present on an auditable action or event.
Forked and cloned processes inherit their parent's container ID,
referenced in the process' task_struct.
Mimic setns(2) and return an error if the process has already initiated
threading or forked since this registration should happen before the
process execution is started by the orchestrator and hence should not
yet have any threads or children. If this is deemed overly restrictive,
switch all threads and children to the new containerID.
Trust the orchestrator to judiciously use and restrict CAP_CONTAINER_ADMIN.
Log the creation of every namespace, inheriting/adding its spawning
process' containerID(s), if applicable. Include the spawning and
spawned namespace IDs (device and inode number tuples).
[AUDIT_NS_CREATE, AUDIT_NS_DESTROY] [clone(2), unshare(2), setns(2)]
Note: At this point it appears only network namespaces may need to track
container IDs apart from processes since incoming packets may cause an
auditable event before being associated with a process.
Log the destruction of every namespace when it is no longer used by any
process, include the namespace IDs (device and inode number tuples).
[AUDIT_NS_DESTROY] [process exit, unshare(2), setns(2)]
Issue a new auxilliary record AUDIT_NS_CHANGE listing (opt: op=$action)
the parent and child namespace IDs for any changes to a process'
namespaces. [setns(2)]
Note: It may be possible to combine AUDIT_NS_* record formats and
distinguish them with an op=$action field depending on the fields
required for each message type.
When a container ceases to exist because the last process in that
container has exited and hence the last namespace has been destroyed and
its refcount dropping to zero, log the fact.
(This latter is likely needed for certification accountability.) A
container object may need a list of processes and/or namespaces.
A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container. A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.
(v2)
- switch from u64 to u128 UUID
- switch from "signal" and "trigger" to "register"
- restrict registration to single process or force all threads and children into same container
- RGB
- RGB
--
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
- RGB
--
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore
2017-10-19 15:51:13 UTC
Permalink
Post by Casey Schaufler
Post by Richard Guy Briggs
Post by Casey Schaufler
Post by Richard Guy Briggs
Post by Casey Schaufler
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Hang on. If containers are a user space concept, how can
you want CAP_CONTAINER_ANYTHING? If there's not such thing as
a container, how can you be asking for a capability to manage
them?
There is such a thing, but the kernel doesn't know about it yet.
Then how can it be the kernel's place to control access to a
container resource, that is, the containerID.
Ok, let me try to address your objections.
The kernel can know enough that if it is already set to not allow it to
be set again. Or if the user doesn't have permission to set it that the
user be denied this action. How is this different from loginuid and
sessionid?
Post by Casey Schaufler
Post by Richard Guy Briggs
This
same situation exists for loginuid and sessionid which are userspace
concepts that the kernel tracks for the convenience of userspace.
Ah, no. Loginuid identifies a user, which is a kernel concept in
that a user is defined by the uid.
This simple explanation doesn't help me. What makes that a kernel
concept? The fact that it is stored and compared in more than one
place?
Post by Casey Schaufler
The session ID has well defined kernel semantics. You're trying to say
that the containerID is an opaque value that is meaningless to the
kernel, but you still want the kernel to protect it. How can the
kernel know if it is protecting it correctly?
How so? A userspace process triggers this. Does the kernel know what
these values mean? Does it do anything with them other than report
them or allow audit to filter them? It is given some instructions on
how to treat it.
This is what we're trying to do with the containerID.
Post by Casey Schaufler
Post by Richard Guy Briggs
As
for its name, I'm not particularly picky, so if you don't like
CAP_CONTAINER_* then I'm fine with CAP_AUDIT_CONTAINERID. It really
needs to be distinct from CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we
don't want to give the ability to set a containerID to any process that
is able to do audit logging (such as vsftpd) and similarly we don't want
to give the orchestrator the ability to control the setup of the audit
daemon.
Sorry, but what aspect of the kernel security policy is this
capability supposed to protect? That's what capabilities are
for, not the undefined support of undefined user-space behavior.
Similarly, loginuids and sessionIDs are only used for audit tracking and
filtering.
Tell me again why you're not reusing either of these?
Ah, granularity arguments, welcome back old friend :)

Once again, we're still trying to sort all this out so I reserve the
right to change my mind, but my current thinking is as follows ...
CAP_AUDIT_WRITE exists to control which applications can submit
userspace generated audit records to the kernel, CAP_AUDIT_CONTROL
exists to control which applications can manage the in-kernel audit
configuration (e.g. filter rules) and the current task's loginuid
value. Reusing CAP_AUDIT_WRITE here would allow any application that
can submit userspace audit records the ability to change the audit
container ID; this would be bad, we don't allow CAP_AUDIT_WRITE to
change the loginuid, it would be even worse to allow it to change the
audit container ID. Reusing CAP_AUDIT_CONTROL is less worse than than
CAP_AUDIT_WRITE, but it gets sticky once we get to the part where we
want to auditd instances in containers, complete with their own
queues, filtering rules, etc.. Perhaps we could use CAP_AUDIT_CONTROL
to guard the audit container ID value, but we would always want to do
that check in the init userns in order to prevent container bound
processes from manipulating their own audit container ID.
--
paul moore
www.paul-moore.com
Steve Grubb
2017-10-17 01:42:52 UTC
Permalink
Post by Richard Guy Briggs
Post by Casey Schaufler
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Hang on. If containers are a user space concept, how can
you want CAP_CONTAINER_ANYTHING? If there's not such thing as
a container, how can you be asking for a capability to manage
them?
There is such a thing, but the kernel doesn't know about it yet. This
same situation exists for loginuid and sessionid which are userspace
concepts that the kernel tracks for the convenience of userspace. As
for its name, I'm not particularly picky, so if you don't like
CAP_CONTAINER_* then I'm fine with CAP_AUDIT_CONTAINERID. It really
needs to be distinct from CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we
don't want to give the ability to set a containerID to any process that
is able to do audit logging (such as vsftpd) and similarly we don't want
to give the orchestrator the ability to control the setup of the audit
daemon.
A long time ago, we were debating what should guard against rouge processes
from setting the loginuid. Casey argued that the ability to set the loginuid
means they have the ability to control the audit trail. That means that it
should be guarded by CAP_AUDIT_CONTROL. I think the same logic applies today.

The ability to arbitrarily set a container ID means the process has the
ability to indirectly control the audit trail.

-Steve
Simo Sorce
2017-10-17 12:31:09 UTC
Permalink
Post by Steve Grubb
Post by Richard Guy Briggs
There is such a thing, but the kernel doesn't know about it
yet.  This same situation exists for loginuid and sessionid which
are userspace concepts that the kernel tracks for the convenience
of userspace.  As for its name, I'm not particularly picky, so if
you don't like CAP_CONTAINER_* then I'm fine with
CAP_AUDIT_CONTAINERID.  It really needs to be distinct from
CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we don't want to give
the ability to set a containerID to any process that is able to do
audit logging (such as vsftpd) and similarly we don't want to give
the orchestrator the ability to control the setup of the audit
daemon.
A long time ago, we were debating what should guard against rouge
processes from setting the loginuid. Casey argued that the ability to
set the loginuid means they have the ability to control the audit
trail. That means that it should be guarded by CAP_AUDIT_CONTROL. I
think the same logic applies today. 
The difference is that with loginuid you needed to give processes able
to audit also the ability to change it. You do not want to tie the
ability to change container ids to the ability to audit. You want to be
able to do audit stuff (within the container) without allowing it to
change the container id.
Of course if we made container id a write-once property maybe there is
no need for controls at all, but I'm pretty sure there will be
situations where write-once may not be usable in practice.
Post by Steve Grubb
The ability to arbitrarily set a container ID means the process has
the ability to indirectly control the audit trail.
The container Id can be used also for authorization purposes (by other
processes on the host), not just audit, I think this is why a separate
control has been proposed.

Simo.
--
Simo Sorce
Sr. Principal Software Engineer
Red Hat, Inc
Casey Schaufler
2017-10-17 14:59:43 UTC
Permalink
Post by Simo Sorce
Post by Steve Grubb
Post by Richard Guy Briggs
There is such a thing, but the kernel doesn't know about it
yet.  This same situation exists for loginuid and sessionid which
are userspace concepts that the kernel tracks for the convenience
of userspace.  As for its name, I'm not particularly picky, so if
you don't like CAP_CONTAINER_* then I'm fine with
CAP_AUDIT_CONTAINERID.  It really needs to be distinct from
CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we don't want to give
the ability to set a containerID to any process that is able to do
audit logging (such as vsftpd) and similarly we don't want to give
the orchestrator the ability to control the setup of the audit
daemon.
A long time ago, we were debating what should guard against rouge
processes from setting the loginuid. Casey argued that the ability to
set the loginuid means they have the ability to control the audit
trail. That means that it should be guarded by CAP_AUDIT_CONTROL. I
think the same logic applies today. 
The difference is that with loginuid you needed to give processes able
to audit also the ability to change it. You do not want to tie the
ability to change container ids to the ability to audit. You want to be
able to do audit stuff (within the container) without allowing it to
change the container id.
Without a *kernel* policy on containerIDs you can't say what
security policy is being exempted. Without that you can't say what
capability is (or isn't) appropriate. You need a reason to have
a capability check that makes sense in the context of the kernel
security policy. Since we don't know what a container is in the
kernel, that's pretty hard. We don't create "fuzzy" capabilities
based on the trendy application behavior of the moment. If the
behavior is not related it audit, there's no reason for it, and
if it is, CAP_AUDIT_CONTROL works just fine. If this doesn't work
in your application security model I suggest that is where you
need to make changes.
Post by Simo Sorce
Of course if we made container id a write-once property maybe there is
no need for controls at all, but I'm pretty sure there will be
situations where write-once may not be usable in practice.
Post by Steve Grubb
The ability to arbitrarily set a container ID means the process has
the ability to indirectly control the audit trail.
The container Id can be used also for authorization purposes (by other
processes on the host), not just audit, I think this is why a separate
control has been proposed.
Simo.
Simo Sorce
2017-10-17 15:28:40 UTC
Permalink
Post by Casey Schaufler
Post by Simo Sorce
Post by Steve Grubb
Post by Richard Guy Briggs
There is such a thing, but the kernel doesn't know about it
yet.  This same situation exists for loginuid and sessionid which
are userspace concepts that the kernel tracks for the
convenience
of userspace.  As for its name, I'm not particularly picky, so if
you don't like CAP_CONTAINER_* then I'm fine with
CAP_AUDIT_CONTAINERID.  It really needs to be distinct from
CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we don't want to give
the ability to set a containerID to any process that is able to do
audit logging (such as vsftpd) and similarly we don't want to give
the orchestrator the ability to control the setup of the audit
daemon.
A long time ago, we were debating what should guard against rouge
processes from setting the loginuid. Casey argued that the ability to
set the loginuid means they have the ability to control the audit
trail. That means that it should be guarded by CAP_AUDIT_CONTROL. I
think the same logic applies today. 
The difference is that with loginuid you needed to give processes able
to audit also the ability to change it. You do not want to tie the
ability to change container ids to the ability to audit. You want to be
able to do audit stuff (within the container) without allowing it to
change the container id.
Without a *kernel* policy on containerIDs you can't say what
security policy is being exempted.
The policy has been basically stated earlier.

A way to track a set of processes from a specific point in time
forward. The name used is "container id", but it could be anything.
This marker is mostly used by user space to track process hierarchies
without races, these processes can be very privileged, and must not be
allowed to change the marker themselves when granted the current common
capabilities.

Is this a good enough description ? If not can you clarify your
expectations ?
Post by Casey Schaufler
Without that you can't say what capability is (or isn't)
appropriate.
See if the above is sufficient please.
Post by Casey Schaufler
You need a reason to have a capability check that makes sense in the
context of the kernel security policy.
I think the proposal had a reason, we may debate on whether that reason
is good enough.
Post by Casey Schaufler
Since we don't know what a container is in the kernel,
Please do not fixate on the word container.
Post by Casey Schaufler
that's pretty hard. We don't create "fuzzy" capabilities
based on the trendy application behavior of the moment. If the
behavior is not related it audit, there's no reason for it, and
if it is, CAP_AUDIT_CONTROL works just fine. If this doesn't work
in your application security model I suggest that is where you
need to make changes.
The authors of the proposal came to the conclusion that kernel
assistance is needed. It would be nice to discuss the merits of it.
If you do not understand why the request has been made it would be more
useful to ask specific questions to understand what and why is the ask.

Pushing back is fine, if you have understood the problem and have valid
arguments against a kernel level solution (and possibly suggestions for
a working user space solution), otherwise you are not adding value to
the discussion.

Simo.
--
Simo Sorce
Sr. Principal Software Engineer
Red Hat, Inc
James Bottomley
2017-10-17 15:44:51 UTC
Permalink
Post by Simo Sorce
Post by Casey Schaufler
Without a *kernel* policy on containerIDs you can't say what
security policy is being exempted.
The policy has been basically stated earlier.
A way to track a set of processes from a specific point in time
forward. The name used is "container id", but it could be anything.
This marker is mostly used by user space to track process hierarchies
without races, these processes can be very privileged, and must not
be allowed to change the marker themselves when granted the current
common capabilities.
Is this a good enough description ? If not can you clarify your
expectations ?
I think you mean you want to be able to apply a label to a process
which is inherited across forks.  The label should only be susceptible
to modification by something possessing a capability (which one TBD).
 The idea is that processes spawned into a container would be labelled
by the container orchestration system.  It's unclear what should happen
to processes using nsenter after the fact, but policy for that should
be up to the orchestration system.

The label will be used as a tag for audit information.

I think you were missing label inheritance above.

The security implications are that anything that can change the label
could also hide itself and its doings from the audit system and thus
would be used as a means to evade detection.  I actually think this
means the label should be write once (once you've set it, you can't
change it) and orchestration systems should begin as unlabelled
processes allowing them to do arbitrary forks.

For nested containers, I actually think the label should be
hierarchical, so you can add a label for the new nested container but
it still also contains its parents label as well.

James
Casey Schaufler
2017-10-17 16:43:18 UTC
Permalink
Post by James Bottomley
Post by Simo Sorce
Post by Casey Schaufler
Without a *kernel* policy on containerIDs you can't say what
security policy is being exempted.
The policy has been basically stated earlier.
A way to track a set of processes from a specific point in time
forward. The name used is "container id", but it could be anything.
This marker is mostly used by user space to track process hierarchies
without races, these processes can be very privileged, and must not
be allowed to change the marker themselves when granted the current
common capabilities.
Is this a good enough description ? If not can you clarify your
expectations ?
I think you mean you want to be able to apply a label to a process
which is inherited across forks.
That would be PTAGS. I agree that such a general mechanism
could be very useful for a variety of purposes, not just
containers. I do not agree that a single integer (e.g. a
containerID) warrants more than trivial mechanism.
Post by James Bottomley
The label should only be susceptible
to modification by something possessing a capability (which one TBD).
I think that the reason we're going to have crying and gnashing
of teeth is that whatever capability is used. There will always be
an issue of the capability granted being less specific than the
application security model would like.

And no, we're not going down the 330 capabilities road. It's been
done in the UNIX world. Application security models hate that
just as much as they hate the coarser granularity.
Post by James Bottomley
The idea is that processes spawned into a container would be labelled
by the container orchestration system.  It's unclear what should happen
to processes using nsenter after the fact, but policy for that should
be up to the orchestration system.
I'm fine with that. The user space policy can be anything y'all like.
Post by James Bottomley
The label will be used as a tag for audit information.
Deep breath ...

Which *is* a kernel security policy mechanism. Since the "label"
is part of the audit information that the kernel is guaranteeing
changing it would be covered by CAP_AUDIT_CONTROL. If the kernel
does not use the "label" for any other purpose this is the only
capability that makes sense for it.
Post by James Bottomley
I think you were missing label inheritance above.
The security implications are that anything that can change the label
could also hide itself and its doings from the audit system and thus
would be used as a means to evade detection.  
Yes. This is a consequence of the capability granularity. There is
no way we can make the capability granularity sufficiently fine to
prevent this. No one wants the 330 capabilities that Data General
had in their secure UNIX system.
Post by James Bottomley
I actually think this
means the label should be write once (once you've set it, you can't
change it) and orchestration systems should begin as unlabelled
processes allowing them to do arbitrary forks.
For nested containers, I actually think the label should be
hierarchical, so you can add a label for the new nested container but
it still also contains its parents label as well.
You can't support this reasonably with a single containerID.
You want PTAGS for this. I know that there is resistance to
requiring anything beyond what's in the base kernel (and for
good reasons) for containers. Especially something that is
pending future work. But let's not jam something into the base
kernel that isn't really going to address the issue.
Post by James Bottomley
James
Steve Grubb
2017-10-17 17:15:00 UTC
Permalink
Post by Casey Schaufler
Post by James Bottomley
The idea is that processes spawned into a container would be labelled
by the container orchestration system. It's unclear what should happen
to processes using nsenter after the fact, but policy for that should
be up to the orchestration system.
I'm fine with that. The user space policy can be anything y'all like.
I think there should be a login event.
Post by Casey Schaufler
Post by James Bottomley
The label will be used as a tag for audit information.
Deep breath ...
Which *is* a kernel security policy mechanism. Since the "label"
is part of the audit information that the kernel is guaranteeing
changing it would be covered by CAP_AUDIT_CONTROL. If the kernel
does not use the "label" for any other purpose this is the only
capability that makes sense for it.
I agree. The ability to set the container label grants the ability to evade
rules or modify audit rules. CAP_AUDIT_CONTROL makes sense to me.
Post by Casey Schaufler
Post by James Bottomley
I think you were missing label inheritance above.
The security implications are that anything that can change the label
could also hide itself and its doings from the audit system and thus
would be used as a means to evade detection.
Yes. We have the same problem with loginuid. There are restrictions on who can
change it once set. And then we made an immutable flag so that people that
want a hard guarantee can get that.

-Steve
James Bottomley
2017-10-17 17:57:43 UTC
Permalink
Post by Steve Grubb
Post by Casey Schaufler
Post by James Bottomley
The idea is that processes spawned into a container would be
labelled by the container orchestration system.  It's unclear
what should happen to processes using nsenter after the fact, but
policy for that should be up to the orchestration system.
I'm fine with that. The user space policy can be anything y'all like.
I think there should be a login event.
I thought you wanted this for containers?  Container creation doesn't
have login events.  In an unprivileged orchestration system it may be
hard to synthetically manufacture them.

James
Steve Grubb
2017-10-18 00:23:01 UTC
Permalink
Post by Steve Grubb
Post by Casey Schaufler
Post by James Bottomley
The idea is that processes spawned into a container would be
labelled by the container orchestration system. It's unclear
what should happen to processes using nsenter after the fact, but
policy for that should be up to the orchestration system.
I'm fine with that. The user space policy can be anything y'all like.
I think there should be a login event.
I thought you wanted this for containers? Container creation doesn't
have login events. In an unprivileged orchestration system it may be
hard to synthetically manufacture them.
I realize this. This work is very similar to problems we've solved 12 years
ago. We'll figure out what the right name is for it down the road. But the
concept is the same. If something enters a container, we need to know about
it. It needs to get tagged and be associated with the container. The way this
was solved for the loginuid problem was to add a session identifier so that
new logins of the same loginuid can coexist and we can trace actions back to a
specific login. I'd think we can apply lessons learned from a while back to
make container identification act similarly.

-Steve
Paul Moore
2017-10-18 20:56:06 UTC
Permalink
On Tue, Oct 17, 2017 at 11:44 AM, James Bottomley
Post by James Bottomley
Post by Simo Sorce
Post by Casey Schaufler
Without a *kernel* policy on containerIDs you can't say what
security policy is being exempted.
The policy has been basically stated earlier.
A way to track a set of processes from a specific point in time
forward. The name used is "container id", but it could be anything.
This marker is mostly used by user space to track process hierarchies
without races, these processes can be very privileged, and must not
be allowed to change the marker themselves when granted the current
common capabilities.
Is this a good enough description ? If not can you clarify your
expectations ?
I think you mean you want to be able to apply a label to a process
which is inherited across forks. The label should only be susceptible
to modification by something possessing a capability (which one TBD).
The idea is that processes spawned into a container would be labelled
by the container orchestration system. It's unclear what should happen
to processes using nsenter after the fact, but policy for that should
be up to the orchestration system.
The label will be used as a tag for audit information.
I think you were missing label inheritance above.
That is a pretty good summary of what we want to do, and what Richard
and I have discussed while brainstorming this offline. The details
may not have translated well into those initial emails from Richard,
but I think you've got the idea, even if some of the smaller details
are still TBD. FWIW, right now I'm not as worried about the exact
capability or the size of the audit container ID, I think those things
will sort themselves out as we progress through the implementation,
especially once we get to the next stage when we start to allow copies
of the audit records to be routed to audit daemons running inside
containers (note well that I said "copies", the host system still sees
all).
Post by James Bottomley
The security implications are that anything that can change the label
could also hide itself and its doings from the audit system and thus
would be used as a means to evade detection. I actually think this
means the label should be write once (once you've set it, you can't
change it) ...
Richard and I have talked about a write once approach, but the
thinking was that you may want to allow a nested container
orchestrator (Why? I don't know, but people always want to do the
craziest things.) and a write-once policy makes that impossible. If
we punt on the nested orchestrator, I believe we can seriously think
about a write-once policy to simplify things.

A bit off topic, but I've also wondered about not even implementing
read access, just to help ensure the audit container ID wouldn't be
abused, but I'm not sure how practical that will be. Something else
to sort out during the RFC phase of the implementation with the
container orchestrators.
Post by James Bottomley
... and orchestration systems should begin as unlabelled
processes allowing them to do arbitrary forks.
My current thinking is that the default state is to start unlabeled (I
just vomited a little into my SELinux hat); in other words
init/systemd/PID-1 in the host system starts with an "unset" audit
container ID. This not only helps define the host system (anything
that has an unset audit container ID) but provides a blank slate for
the orchestrator(s).
Post by James Bottomley
For nested containers, I actually think the label should be
hierarchical, so you can add a label for the new nested container but
it still also contains its parents label as well.
I haven't made up my mind on this completely just yet, but I'm
currently of the mindset that supporting multiple audit container IDs
on a given process is not a good idea.
--
paul moore
www.paul-moore.com
Aleksa Sarai
2017-10-18 23:46:18 UTC
Permalink
Post by Paul Moore
Post by James Bottomley
The security implications are that anything that can change the label
could also hide itself and its doings from the audit system and thus
would be used as a means to evade detection. I actually think this
means the label should be write once (once you've set it, you can't
change it) ...
Richard and I have talked about a write once approach, but the
thinking was that you may want to allow a nested container
orchestrator (Why? I don't know, but people always want to do the
craziest things.) and a write-once policy makes that impossible. If
we punt on the nested orchestrator, I believe we can seriously think
about a write-once policy to simplify things.
Nested containers are a very widely used use-case (see LXC system
containers, inside of which people run other container runtimes). So I
would definitely consider it something that "needs to be supported in
some way". While the LXC guys might be a *tad* crazy, the use-case isn't. :P
Post by Paul Moore
Post by James Bottomley
... and orchestration systems should begin as unlabelled
processes allowing them to do arbitrary forks.
My current thinking is that the default state is to start unlabeled (I
just vomited a little into my SELinux hat); in other words
init/systemd/PID-1 in the host system starts with an "unset" audit
container ID. This not only helps define the host system (anything
that has an unset audit container ID) but provides a blank slate for
the orchestrator(s).
Post by James Bottomley
For nested containers, I actually think the label should be
hierarchical, so you can add a label for the new nested container but
it still also contains its parents label as well.
I haven't made up my mind on this completely just yet, but I'm
currently of the mindset that supporting multiple audit container IDs
on a given process is not a good idea.
As long as creating a new "container" (that is, changing a process's
"audit container ID") is an audit event then I think that having a
hierarchy be explicit is not necessary (userspace audit can figure out
the hierarchy quite easily -- but also there are cases where thinking of
it as being hierarchical isn't necessarily correct).
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Eric W. Biederman
2017-10-19 00:43:25 UTC
Permalink
Post by Paul Moore
Post by James Bottomley
The security implications are that anything that can change the label
could also hide itself and its doings from the audit system and thus
would be used as a means to evade detection. I actually think this
means the label should be write once (once you've set it, you can't
change it) ...
Richard and I have talked about a write once approach, but the
thinking was that you may want to allow a nested container
orchestrator (Why? I don't know, but people always want to do the
craziest things.) and a write-once policy makes that impossible. If
we punt on the nested orchestrator, I believe we can seriously think
about a write-once policy to simplify things.
Nested containers are a very widely used use-case (see LXC system containers,
inside of which people run other container runtimes). So I would definitely
consider it something that "needs to be supported in some way". While the LXC
guys might be a *tad* crazy, the use-case isn't. :P
Of course some of that gets to running auditd inside a container which
we don't have yet either.

So I think to start it is perfectly fine to figure out the non-nested
case first and what makes sense there. Then to sort out the nested
container case.

The solution might be that a process gets at most one id per ``audit
namespace''.

Eric
Eric W. Biederman
2017-10-19 16:25:10 UTC
Permalink
On Wed, Oct 18, 2017 at 8:43 PM, Eric W. Biederman
Post by Eric W. Biederman
Post by Paul Moore
Post by James Bottomley
The security implications are that anything that can change the label
could also hide itself and its doings from the audit system and thus
would be used as a means to evade detection. I actually think this
means the label should be write once (once you've set it, you can't
change it) ...
Richard and I have talked about a write once approach, but the
thinking was that you may want to allow a nested container
orchestrator (Why? I don't know, but people always want to do the
craziest things.) and a write-once policy makes that impossible. If
we punt on the nested orchestrator, I believe we can seriously think
about a write-once policy to simplify things.
Nested containers are a very widely used use-case (see LXC system containers,
inside of which people run other container runtimes). So I would definitely
consider it something that "needs to be supported in some way". While the LXC
guys might be a *tad* crazy, the use-case isn't. :P
No worries, we're all a little crazy in our own special ways ;)
Kidding aside, thanks for explaining the use case.
Post by Eric W. Biederman
Of course some of that gets to running auditd inside a container which
we don't have yet either.
So I think to start it is perfectly fine to figure out the non-nested
case first and what makes sense there. Then to sort out the nested
container case.
The solution might be that a process gets at most one id per ``audit
namespace''.
In an attempt to stay on-topic, let's try to stick with "audit
container ID" or "container ID" if you must. I really want to avoid
the term "audit namespace" simply because the term "namespace" implies
some things which we aren't planning on doing.
This is 100% on topic. I am saying that unless we are planing to have
auditd running in a container with it's own set of rules you probably
don't care about nested containers. Last time I heard a discussion
about that the term in use was audit namespace. So I was referring to
that support when I said audit namespace, even if the end result only
loosely fits the term namespace.

I could be wrong of course. I don't fully understand what is driving
the desire to connect audit and containers. But my naive guess is that
one from an audit perspective you don't care about nested containers
unless there is also a nested auditd who is looking at it from a nested
perspective.

So far we have established with the term container that we are talking
about a running instance of processes, not a filesystem instance that
Docker and friends ship around. Beyond that I am not certain what you
care about.

Eric
Paul Moore
2017-10-19 17:47:59 UTC
Permalink
On Thu, Oct 19, 2017 at 12:25 PM, Eric W. Biederman
Post by Eric W. Biederman
On Wed, Oct 18, 2017 at 8:43 PM, Eric W. Biederman
Post by Eric W. Biederman
Post by Paul Moore
Post by James Bottomley
The security implications are that anything that can change the label
could also hide itself and its doings from the audit system and thus
would be used as a means to evade detection. I actually think this
means the label should be write once (once you've set it, you can't
change it) ...
Richard and I have talked about a write once approach, but the
thinking was that you may want to allow a nested container
orchestrator (Why? I don't know, but people always want to do the
craziest things.) and a write-once policy makes that impossible. If
we punt on the nested orchestrator, I believe we can seriously think
about a write-once policy to simplify things.
Nested containers are a very widely used use-case (see LXC system containers,
inside of which people run other container runtimes). So I would definitely
consider it something that "needs to be supported in some way". While the LXC
guys might be a *tad* crazy, the use-case isn't. :P
No worries, we're all a little crazy in our own special ways ;)
Kidding aside, thanks for explaining the use case.
Post by Eric W. Biederman
Of course some of that gets to running auditd inside a container which
we don't have yet either.
So I think to start it is perfectly fine to figure out the non-nested
case first and what makes sense there. Then to sort out the nested
container case.
The solution might be that a process gets at most one id per ``audit
namespace''.
In an attempt to stay on-topic, let's try to stick with "audit
container ID" or "container ID" if you must. I really want to avoid
the term "audit namespace" simply because the term "namespace" implies
some things which we aren't planning on doing.
This is 100% on topic. I am saying that unless we are planing to have
auditd running in a container with it's own set of rules you probably
don't care about nested containers. Last time I heard a discussion
about that the term in use was audit namespace. So I was referring to
that support when I said audit namespace, even if the end result only
loosely fits the term namespace.
My "stay on-topic" comment is directed at, and limited to, your choice
of terminology, not the discussion about container nesting. I'm
purposefully not using the term "audit namespace" to refer to anything
that Richard has presented, and I'm kindly asking you to do the same,
it simply doesn't fit.
Post by Eric W. Biederman
I could be wrong of course. I don't fully understand what is driving
the desire to connect audit and containers. But my naive guess is that
one from an audit perspective you don't care about nested containers
unless there is also a nested auditd who is looking at it from a nested
perspective.
Two motivations that are clear to me: the first is the desire to be
able to associate events in the audit log with a container (much like
how the session ID helped us associate events with a login session),
the second is the desire for users to run an audit daemon instance in
their containers to capture audit events generated by their container.
There is also a security certification motivation, see some of Steve's
comments for more on that.
--
paul moore
www.paul-moore.com
Casey Schaufler
2017-10-17 16:10:43 UTC
Permalink
Post by Simo Sorce
Post by Casey Schaufler
Post by Simo Sorce
Post by Steve Grubb
Post by Richard Guy Briggs
There is such a thing, but the kernel doesn't know about it
yet.  This same situation exists for loginuid and sessionid which
are userspace concepts that the kernel tracks for the
convenience
of userspace.  As for its name, I'm not particularly picky, so if
you don't like CAP_CONTAINER_* then I'm fine with
CAP_AUDIT_CONTAINERID.  It really needs to be distinct from
CAP_AUDIT_WRITE and CAP_AUDIT_CONTROL since we don't want to give
the ability to set a containerID to any process that is able to do
audit logging (such as vsftpd) and similarly we don't want to give
the orchestrator the ability to control the setup of the audit
daemon.
A long time ago, we were debating what should guard against rouge
processes from setting the loginuid. Casey argued that the
ability to
set the loginuid means they have the ability to control the audit
trail. That means that it should be guarded by CAP_AUDIT_CONTROL. I
think the same logic applies today. 
The difference is that with loginuid you needed to give processes able
to audit also the ability to change it. You do not want to tie the
ability to change container ids to the ability to audit. You want to be
able to do audit stuff (within the container) without allowing it to
change the container id.
Without a *kernel* policy on containerIDs you can't say what
security policy is being exempted.
The policy has been basically stated earlier.
No. The expected user space behavior has been stated.
Post by Simo Sorce
A way to track a set of processes from a specific point in time
forward. The name used is "container id", but it could be anything.
Then you want Jose Bollo's PTAGS. It's insane to add yet another
arbitrary ID to the task for a special purpose. Add a general tagging
mechanism instead. We could add a gazillion new id's, each with it's
own capability if we head down this road.
Post by Simo Sorce
This marker is mostly used by user space to track process hierarchies
without races, these processes can be very privileged, and must not be
allowed to change the marker themselves when granted the current common
capabilities.
Let's be clear. What happens in user space stays in user space.
The kernel does not give a fig about user space policy. There has
to be a kernel policy involved that a capability can exempt.
Post by Simo Sorce
Is this a good enough description ? If not can you clarify your
expectations ?
The kernel enforces kernel policy. Capabilities provide a mechanism
to mark a process as exempt from some aspect of kernel policy. If
you don't have a kernel policy, you don't get a capability. Clear?
Post by Simo Sorce
Post by Casey Schaufler
Without that you can't say what capability is (or isn't)
appropriate.
See if the above is sufficient please.
Post by Casey Schaufler
You need a reason to have a capability check that makes sense in the
context of the kernel security policy.
I think the proposal had a reason, we may debate on whether that reason
is good enough.
Post by Casey Schaufler
Since we don't know what a container is in the kernel,
Please do not fixate on the word container.
Post by Casey Schaufler
that's pretty hard. We don't create "fuzzy" capabilities
based on the trendy application behavior of the moment. If the
behavior is not related it audit, there's no reason for it, and
if it is, CAP_AUDIT_CONTROL works just fine. If this doesn't work
in your application security model I suggest that is where you
need to make changes.
The authors of the proposal came to the conclusion that kernel
assistance is needed. It would be nice to discuss the merits of it.
If you do not understand why the request has been made it would be more
useful to ask specific questions to understand what and why is the ask.
I understand pretty darn well.
Post by Simo Sorce
Pushing back is fine, if you have understood the problem and have valid
arguments against a kernel level solution (and possibly suggestions for
a working user space solution), otherwise you are not adding value to
the discussion.
The presumption is that the request is reasonable. Adding a capability
in support of an undefined behavior is unreasonable. Based on the discussion,
CAP_AUDIT_CONTROL is completely rational. I understand that it would be
difficult to support your application privilege model. I would like to look
into helping out with that, but have too many burning knives in the air
just now.
Post by Simo Sorce
Simo.
Paul Moore
2017-10-18 19:58:13 UTC
Permalink
Post by Simo Sorce
The container Id can be used also for authorization purposes (by other
processes on the host), not just audit, I think this is why a separate
control has been proposed.
Apologies, but I'm just now getting a chance to work my way through
this thread, and I wanted to make a quick comment on this point ...

The audit container ID (note I said "audit container ID" not
"container ID") is intended strictly for use by the audit subsystem at
this point. Allowing other uses opens the door to a larger set of
problems we are trying to avoid (e.g. handling migration across
hosts). We would love to have a generic kernel facility that the
audit subsystem could use to identify containers, but we don't, and
previous attempts have failed, so we have to create our own. We are
intentionally trying to limit its scope in an attempt to limit
problems. If a more general solution appears in the future I think we
would make every effect to migrate to that; keeping this initial
effort small should make that easier.
--
paul moore
www.paul-moore.com
Mickaël Salaün
2017-12-09 10:20:48 UTC
Permalink
Post by Casey Schaufler
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Hang on. If containers are a user space concept, how can
you want CAP_CONTAINER_ANYTHING? If there's not such thing as
a container, how can you be asking for a capability to manage
them?
Post by Richard Guy Briggs
At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
Here is an idea to avoid privilege problems or the need for a new
capability: make it automatic. What makes a container a container seems
to be the use of at least a namespace. What about automatically create
and assign an ID to a process when it enters a namespace different than
one of its parent process? This delegates the (permission)
responsibility to the use of namespaces (e.g. /proc/sys/user/max_* limit).

One interesting side effect of this approach would be to be able to
identify which processes are in the same set of namespaces, even if not
spawn from the container but entered after its creation (i.e. using
setns), by creating container IDs as a (deterministic) checksum from the
/proc/self/ns/* IDs.

Since the concern is to identify a container, I think the ability to
audit the switch from one container ID to another is enough. I don't
think we need nested IDs.

As a side note, you may want to take a look at the Linux-VServer's XID.

Regards,
Mickaël
Casey Schaufler
2017-12-09 18:28:08 UTC
Permalink
Post by Mickaël Salaün
Post by Casey Schaufler
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Hang on. If containers are a user space concept, how can
you want CAP_CONTAINER_ANYTHING? If there's not such thing as
a container, how can you be asking for a capability to manage
them?
Post by Richard Guy Briggs
At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
Here is an idea to avoid privilege problems or the need for a new
capability: make it automatic. What makes a container a container seems
to be the use of at least a namespace.
You might think so, but I am assured that you can have a container
without using namespaces. Intel's "Clear Containers", which use
virtualization technology, are one example. I have considered creating
"Smack Containers" using mandatory access control technology, more
to press the point that "containers" is a marketing concept, not
technology.
Post by Mickaël Salaün
What about automatically create
and assign an ID to a process when it enters a namespace different than
one of its parent process? This delegates the (permission)
responsibility to the use of namespaces (e.g. /proc/sys/user/max_* limit).
That gets ugly when you have a container that uses user, filesystem,
network and whatever else namespaces. If all containers used the same
set of namespaces I think this would be a fine idea, but they don't.
Post by Mickaël Salaün
One interesting side effect of this approach would be to be able to
identify which processes are in the same set of namespaces, even if not
spawn from the container but entered after its creation (i.e. using
setns), by creating container IDs as a (deterministic) checksum from the
/proc/self/ns/* IDs.
Since the concern is to identify a container, I think the ability to
audit the switch from one container ID to another is enough. I don't
think we need nested IDs.
Because a container doesn't have to use namespaces to be a container
you still need a mechanism for a process to declare that it is in fact
in a container, and to identify the container.
Post by Mickaël Salaün
As a side note, you may want to take a look at the Linux-VServer's XID.
Regards,
Micka�l
Eric Paris
2017-12-11 16:30:57 UTC
Permalink
Post by Casey Schaufler
Post by Mickaël Salaün
What about automatically create
and assign an ID to a process when it enters a namespace different than
one of its parent process? This delegates the (permission)
responsibility to the use of namespaces (e.g. /proc/sys/user/max_* limit).
That gets ugly when you have a container that uses user, filesystem,
network and whatever else namespaces. If all containers used the same
set of namespaces I think this would be a fine idea, but they don't.
Post by Mickaël Salaün
One interesting side effect of this approach would be to be able to
identify which processes are in the same set of namespaces, even if not
spawn from the container but entered after its creation (i.e. using
setns), by creating container IDs as a (deterministic) checksum from the
/proc/self/ns/* IDs.
Since the concern is to identify a container, I think the ability to
audit the switch from one container ID to another is enough. I don't
think we need nested IDs.
Because a container doesn't have to use namespaces to be a container
you still need a mechanism for a process to declare that it is in fact
in a container, and to identify the container.
I like the idea but I'm still tossing it around in my head (and
thinking about Casey's statement too). Lets say we have a 'docker-like'
container with pid=100 netns=X,userns=Y,mountns=Z. If I'm on the host
in all init namespaces and I run
nsenter -t 100 -n ip link set eth0 promisc on
How should this be logged? Did this command run in it's own 'container'
unrelated to the 'docker-like' container?

-Eric
Casey Schaufler
2017-12-11 16:52:43 UTC
Permalink
Post by Eric Paris
Post by Casey Schaufler
Because a container doesn't have to use namespaces to be a container
you still need a mechanism for a process to declare that it is in fact
in a container, and to identify the container.
I like the idea but I'm still tossing it around in my head (and
thinking about Casey's statement too). Lets say we have a 'docker-like'
container with pid=100 netns=X,userns=Y,mountns=Z. If I'm on the host
in all init namespaces and I run
nsenter -t 100 -n ip link set eth0 promisc on
How should this be logged? Did this command run in it's own 'container'
unrelated to the 'docker-like' container?
Jose Bollo's PTAGS ( https://gitlab.com/jobol/ptags ) would be
prefect. Any time you declare something to be a container or
enter a namespace you slap a tag on it. Identifying nested
containers would be easy, you'd have multiple tags.

PTAGS unfortunately needs module stacking, but how hard could that be?
Post by Eric Paris
-Eric
Steve Grubb
2017-12-11 19:37:05 UTC
Permalink
Post by Eric Paris
Post by Casey Schaufler
Because a container doesn't have to use namespaces to be a container
you still need a mechanism for a process to declare that it is in fact
in a container, and to identify the container.
I like the idea but I'm still tossing it around in my head (and
thinking about Casey's statement too). Lets say we have a 'docker-like'
container with pid=100 netns=X,userns=Y,mountns=Z. If I'm on the host
in all init namespaces and I run
nsenter -t 100 -n ip link set eth0 promisc on
How should this be logged?
If it is a normal process, then everything would match the init name space and
you wouldn't have entered a container. If it were a container, any generated
event should have the container ID from registration attached to it.
Post by Eric Paris
Did this command run in it's own 'container' unrelated to the 'docker-like'
container?
That should be determined by what's in the task struct.

-Steve

Richard Guy Briggs
2017-12-11 15:10:57 UTC
Permalink
Post by Mickaël Salaün
Post by Casey Schaufler
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
The registration is a pseudo filesystem (proc, since PID tree already
exists) write of a u8[16] UUID representing the container ID to a file
representing a process that will become the first process in a new
container. This write might place restrictions on mount namespaces
required to define a container, or at least careful checking of
namespaces in the kernel to verify permissions of the orchestrator so it
can't change its own container ID. A bind mount of nsfs may be
necessary in the container orchestrator's mntNS.
Note: Use a 128-bit scalar rather than a string to make compares faster
and simpler.
Require a new CAP_CONTAINER_ADMIN to be able to carry out the
registration.
Hang on. If containers are a user space concept, how can
you want CAP_CONTAINER_ANYTHING? If there's not such thing as
a container, how can you be asking for a capability to manage
them?
Post by Richard Guy Briggs
At that time, record the target container's user-supplied
container identifier along with the target container's first process
(which may become the target container's "init" process) process ID
(referenced from the initial PID namespace), all namespace IDs (in the
form of a nsfs device number and inode number tuple) in a new auxilliary
record AUDIT_CONTAINER with a qualifying op=$action field.
Here is an idea to avoid privilege problems or the need for a new
capability: make it automatic. What makes a container a container seems
to be the use of at least a namespace. What about automatically create
and assign an ID to a process when it enters a namespace different than
one of its parent process? This delegates the (permission)
responsibility to the use of namespaces (e.g. /proc/sys/user/max_* limit).
A container doesn't imply a namespace and vice versa.
Post by Mickaël Salaün
One interesting side effect of this approach would be to be able to
identify which processes are in the same set of namespaces, even if not
spawn from the container but entered after its creation (i.e. using
setns), by creating container IDs as a (deterministic) checksum from the
/proc/self/ns/* IDs.
This would be really helpful, but it isn't the case.
Post by Mickaël Salaün
Since the concern is to identify a container, I think the ability to
audit the switch from one container ID to another is enough. I don't
think we need nested IDs.
Since container namespace membership is arbitrary between container
orchestrators, this needs a registration process and a way for the
container orchestrator to know the ID.


I completely agree with Casey here.
Post by Mickaël Salaün
As a side note, you may want to take a look at the Linux-VServer's XID.
Regards,
Mickaël
- RGB

--
Richard Guy Briggs <***@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Eric W. Biederman
2017-10-12 17:59:57 UTC
Permalink
Post by Richard Guy Briggs
A namespace cannot directly migrate from one container to another but
could be assigned to a newly spawned container. A namespace can be
moved from one container to another indirectly by having that namespace
used in a second process in another container and then ending all the
processes in the first container.
Ugh no. The semantics here are way too mushy. We need a clean crisp
unambiguous definition or it will be impossible to get this correct and
impossible to use for any security purpose.

I understand the challenge. Some of the container managers share
namespaces between containers. Leading to things that are not really
contained.

Please make this concept like an indellibale die. Once you are stained
with it you can not escape. If you don't meet all of the criteria you
aren't stained.

The justification that I heard, and that seems legitimate is that it is
not timely and it is hard to make the connection between the distinct
unshare, setns, and clone events and what is happening in the kernel.

With that justification definitely the network namespace needs to be
stained if it is appropriate.

I also don't see why this can't be a special dedicated audit message.
I just looked at the code in the kernel and nlmsg_type is a u16. There
are only a handful of audit message types defined. There is absolutely
no reason to bring proc into this.

I have the same reservation as the others about defining a new cap for
this. It should be enough to make setting the container id a one time
thing for a set of processes and namespaces.

If this is going to be security it needs to be very simple and very well defined.

Eric
Alan Cox
2017-10-13 13:43:11 UTC
Permalink
On Thu, 12 Oct 2017 10:14:00 -0400
Post by Richard Guy Briggs
Containers are a userspace concept. The kernel knows nothing of them.
The Linux audit system needs a way to be able to track the container
provenance of events and actions. Audit needs the kernel's help to do
this.
Since the concept of a container is entirely a userspace concept, a
registration from the userspace container orchestration system initiates
this. This will define a point in time and a set of resources
associated with a particular container with an audit container ID.
I don't think this has anything to do with containers directly. If i
read it right you need a subtree of stuff to be asigned a (possibly
irrevocable) magic identifier that you can use for other purposes.

Traditional Unix in the more 'secure' space had that decades ago in the
form of luid. At login time you did a setluid() and that set an
irrevocable tag onthe session which was (traditionally) the uid of the
login process so that audit and other related tools always knew how to
tie the process back to the login session.

That doesn't quite work as of itself (if you login you'd get luid set and
not be able to change it for the container), but it seems something
similarly trivial like a "setauditid(void)" would do the trick providing
the kernel picked the UUID randomly [otherwise I can copy another known
UUID to confuse or hide].

As you say a container is a userspace concept. So IMHO any audit
interface should be about auditing and what needs tracking, not about
containers. If the container management tool wants to set a suitable tag
then let it. If not then it doesn't.

Then it's a simple as checking CAP_AUDIT_WRITE to see if you are allowed
to setauditit(), generating a random uuid and a matching getauditid() to
copy it back.

Alan
Loading...