Discussion:
[PATCH v2] xattr: Enable security.capability in user namespaces
(too old to reply)
Stefan Berger
2017-07-11 15:10:01 UTC
Permalink
From: Stefan Berger <***@linux.vnet.ibm.com>

This patch enables security.capability in user namespaces but also
takes a more general approach to enabling extended attributes in user
namespaces.

The following rules describe the approach using security.foo as a
'user namespace enabled' extended attribute:

Reading of extended attributes:

1a) Reading security.foo from a user namespace will read
***@uid=<uid> of the parent user namespace instead with uid
being the mapping of root in that parent user namespace. An
exception is if root is mapped to uid 0 on the host, and in this case
we will read security.foo directly.
--> reading security.foo will read ***@uid=1000 for uid
mapping of root to 1000.

1b) If ***@uid=<uid> is not available, the security.foo of the
parent namespace is tried to be read. This procedure is repeated up to
the init user namespace. This step only applies for reading of extended
attributes and provides the same behavior as older system where the
host's extended attributes applied to user namespaces.

2) All ***@uid=<uid> with valid uid mapping in the user namespace
can be read. The uid within the user namespace will be mapped to the
corresponding uid on the host and that uid will be used in the name of
the extended attribute.
-> reading ***@uid=1 will read ***@uid=1001 for uid
mapping of root to 1000, size of at least 2.

All ***@uid=<uid> can be read (by root) on the host with values
of <uid> also being subject to checking for valid mappings.

3) No other security.foo* can be read.

The same rules for reading apply to writing and removing of user
namespace enabled extended attributes.

When listing extended attributes of a file, only those are presented
to the user namespace that have a valid mapping. Besides that, names
of the extended attributes are adjusted to represent the mapping.
This means that if root is mapped to uid 1000 on the host, the
***@uid=1000 will be listed as security.foo in the user
namespace, ***@uid=1001 becomes ***@uid=1 and so on.

Signed-off-by: Stefan Berger <***@linux.vnet.ibm.com>
Signed-off-by: Serge Hallyn <***@hallyn.com>
Reviewed-by: Serge Hallyn <***@hallyn.com>
---
fs/xattr.c | 509 +++++++++++++++++++++++++++++++++++++++++++++--
security/commoncap.c | 36 +++-
security/selinux/hooks.c | 9 +-
3 files changed, 523 insertions(+), 31 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 464c94b..eacad9e 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -133,20 +133,440 @@ xattr_permission(struct inode *inode, const char *name, int mask)
return inode_permission(inode, mask);
}

+/*
+ * A list of extended attributes that are supported in user namespaces
+ */
+static const char *const userns_xattrs[] = {
+ XATTR_NAME_CAPS,
+ NULL
+};
+
+/*
+ * xattrs_is_userns_supported - Check whether an xattr is supported in userns
+ *
+ * @name: full name of the extended attribute
+ * @prefix: do a prefix match (true) or a full match (false)
+ *
+ * This function returns < 0 if not supported, an index into userns_xattrs[]
+ * otherwise.
+ */
+static int
+xattr_is_userns_supported(const char *name, int prefix)
+{
+ int i;
+
+ if (!name)
+ return -1;
+
+ for (i = 0; userns_xattrs[i]; i++) {
+ if (prefix) {
+ if (!strncmp(userns_xattrs[i], name,
+ strlen(userns_xattrs[i])))
+ return i;
+ } else {
+ if (!strcmp(userns_xattrs[i], name))
+ return i;
+ }
+ }
+ return -1;
+}
+
+/*
+ * xattr_write_uid - print a string in the format of "%***@uid=%u", which
+ * includes a prefix string
+ *
+ * @uid: the uid
+ * @prefix: prefix string; may be NULL
+ *
+ * This function returns a buffer with the string, or a NULL pointer in
+ * case of out-of-memory error.
+ */
+static char *
+xattr_write_uid(uid_t uid, const char *prefix)
+{
+ size_t buflen;
+ char *buffer;
+
+ buflen = sizeof("@uid=") - 1 + sizeof("4294967295") - 1 + 1;
+ if (prefix)
+ buflen += strlen(prefix);
+
+ buffer = kmalloc(buflen, GFP_KERNEL);
+ if (!buffer)
+ return NULL;
+
+ if (uid == 0)
+ *buffer = 0;
+ else
+ sprintf(buffer, "%***@uid=%u",
+ (prefix) ? prefix : "",
+ uid);
+
+ return buffer;
+}
+
+/*
+ * xattr_parse_uid_from_kuid - parse string in the format @uid=<uid>; consider
+ * user namespaces and check mappings
+ *
+ * @uidstr : string in the format "@uid=<uid>"
+ * @userns : the user namespace to consult for uid mappings
+ * @n_uidstr : returned pointer holding the rewritten @uid=<uid> string with
+ * the uid remapped
+ *
+ * This function returns an error code or 0 in case of success. In case
+ * of success, 'n_uidstr' will hold a valid string.
+ */
+static int
+xattr_parse_uid_from_kuid(const char *uidstr, struct user_namespace *userns,
+ char **n_uidstr)
+{
+ int n;
+ uid_t muid, p_uid;
+ char d;
+ kuid_t tuid;
+
+ *n_uidstr = NULL;
+
+ n = sscanf(uidstr, "@uid=%u%c", &p_uid, &d);
+ if (n != 1)
+ return -EINVAL;
+
+ /* do we have a mapping of the uid? */
+ tuid = KUIDT_INIT(p_uid);
+ muid = from_kuid(userns, tuid);
+ if (muid == -1)
+ return -ENOENT;
+
+ *n_uidstr = xattr_write_uid(muid, NULL);
+ if (!*n_uidstr)
+ return -ENOMEM;
+
+ return 0;
+}
+
+/*
+ * xattr_parse_uid_make_kuid - parse string in the format @uid=<uid>; consider
+ * user namespaces and check mappings
+ *
+ * @uidstr : string in the format "@uid=<uid>"
+ * @userns : the user namespace to consult for uid mappings
+ * @N_uidstr : returned pointer holding the rewritten @uid=<uid> string with
+ * the uid remapped
+ *
+ * This function returns an error code or 0 in case of success. In case
+ * of success, 'n_uidstr' will hold a valid string.
+ */
+static int
+xattr_parse_uid_make_kuid(const char *uidstr, struct user_namespace *userns,
+ char **n_uidstr)
+{
+ int n;
+ uid_t p_uid;
+ char d;
+ kuid_t tuid;
+
+ *n_uidstr = NULL;
+
+ n = sscanf(uidstr, "@uid=%u%c", &p_uid, &d);
+ if (n != 1)
+ return -EINVAL;
+
+ tuid = make_kuid(userns, p_uid);
+ if (!uid_valid(tuid))
+ return -ENOENT;
+
+ *n_uidstr = xattr_write_uid(__kuid_val(tuid), NULL);
+ if (!*n_uidstr)
+ return -ENOMEM;
+
+ return 0;
+}
+
+/*
+ * xattr_rewrite_userns_xattr - Rewrite and filter an extended attribute
+ * considering user namespace uid mappings and
+ * user namespace support extended attributes
+ *
+ * @name: full name of the extended attribute
+ *
+ * This function returns NULL if the name is to be filtered. Otherwise it can
+ * return the input buffer or a new buffer that the caller needs to free. The
+ * new buffer contains a rewritten extended attribute whose string length may
+ * exceed that of the given name.
+ */
+static char *
+xattr_rewrite_userns_xattr(char *name)
+{
+ int idx, error;
+ size_t len = 0, buflen;
+ char *buffer, *n_uidstr;
+
+ /* prefix-match name against supported attributes */
+ idx = xattr_is_userns_supported(name, true);
+ if (idx < 0) {
+ /* only rewrite those in userns_xattr[*] */
+ return name;
+ }
+
+ /* exact match ? */
+ len = strlen(userns_xattrs[idx]);
+ if (name[len] == 0)
+ return NULL;
+
+ /*
+ * We must have a name[len] == '@'.
+ */
+ error = xattr_parse_uid_from_kuid(&name[len], current_user_ns(),
+ &n_uidstr);
+ if (error)
+ return NULL;
+
+ buflen = len + strlen(n_uidstr) + 1;
+ buffer = kmalloc(buflen, GFP_KERNEL);
+ if (!buffer) {
+ kfree(n_uidstr);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ name[len] = 0;
+
+ snprintf(buffer, buflen, "%s%s", name, n_uidstr);
+
+ name[len] = '@';
+
+ kfree(n_uidstr);
+
+ return buffer;
+}
+
+/*
+ * xattr_list_contains - check whether an xattr list already contains a needle
+ *
+ * @list : 0-byte separated strings
+ * @listlen : length of the list
+ * @needle : the needle to search for
+ */
+static int
+xattr_list_contains(const char *list, size_t listlen, const char *needle)
+{
+ size_t o = 0;
+
+ while (o < listlen) {
+ if (!strcmp(&list[o], needle))
+ return true;
+ o += strlen(&list[o]) + 1;
+ }
+ return false;
+}
+
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * Besides that we filter out those with @uid=<uid> when there is no mapping
+ * for that uid in the current user namespace.
+ *
+ * @list: list of 0-byte separated xattr names
+ * @size: the size of the list; may be 0 to determine needed list size
+ * @list_maxlen: allocated buffer size of list
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
+ return size;
+
+ if (size) {
+ nlist = kmalloc(list_maxlen, GFP_KERNEL);
+ if (!nlist)
+ return -ENOMEM;
+ }
+
+ s_off = d_off = 0;
+ while (s_off < size || size == 0) {
+ name = &list[s_off];
+
+ len = strlen(name);
+ if (!len)
+ break;
+
+ if (xattr_is_userns_supported(name, false) >= 0)
+ newname = name;
+ else {
+ newname = xattr_rewrite_userns_xattr(name);
+ if (IS_ERR(newname)) {
+ d_off = PTR_ERR(newname);
+ goto out_free;
+ }
+ }
+ if (newname && !xattr_list_contains(nlist, d_off, newname)) {
+ nlen = strlen(newname);
+
+ if (nlist) {
+ if (nlen + 1 > list_maxlen)
+ break;
+ strcpy(&nlist[d_off], newname);
+ }
+
+ d_off += nlen + 1;
+ if (newname != name)
+ kfree(newname);
+ }
+ s_off += len + 1;
+ }
+ if (nlist)
+ memcpy(list, nlist, d_off);
+out_free:
+ kfree(nlist);
+
+ return d_off;
+}
+
+/*
+ * xattr_userns_name - modify the name of a user namespace supported
+ * extended attribute
+ *
+ * In a user namespace we prevent read/write accesses to the host's
+ * security.foo to protect these extended attributes.
+ *
+ * Reading:
+ * 1a) Reading security.foo from a user namespace will read
+ * ***@uid=<uid> of the parent user namespace instead with uid
+ * being the mapping of root in that parent user namespace. An
+ * exception is if root is mapped to uid 0 on the host, and in this case
+ * we will read security.foo directly.
+ * -> reading security.foo will read ***@uid=1000 for a uid
+ * mapping of root to 1000.
+ *
+ * 1b) If ***@uid=<uid> is not available, the security.foo of the
+ * parent namespace is tried to be read. This procedure is repeated up to
+ * the init user namespace. This step only applies for reading of extended
+ * attributes and provides the same behavior as older systems where the
+ * host's extended attributes applied to user namespaces.
+ *
+ * 2) All ***@uid=<uid> with valid uid mappings in the user namespace
+ * an be read. The uid within the user namespace will be mapped to the
+ * corresponding uid on the host and that uid will be used in the name of
+ * the extended attribute.
+ * -> reading ***@uid=1 will read ***@uid=1001 for a uid
+ * mapping of root to 1000, size of at least 2.
+ *
+ * All ***@uid=<uid> can be read (by root) on the host with values
+ * of <uid> also being subject to checking for valid mappings.
+ *
+ * 3) No other security.foo* can be read.
+ *
+ * Writing and removing:
+ * The same rules for reading apply to writing and removing, except for 1b).
+ *
+ * This function returns a buffer with either the original name or the
+ * user namespace adjusted name of the extended attribute.
+ *
+ * @name: the full name of the extended attribute, e.g. security.foo
+ */
+char *
+xattr_userns_name(const char *name, struct user_namespace *userns)
+{
+ size_t buflen;
+ char *buffer, *n_uidstr;
+ kuid_t root_uid = make_kuid(userns, 0);
+ int idx, error;
+ size_t len;
+
+ /* only security.foo will be changed here - prefix match here */
+ idx = xattr_is_userns_supported(name, true);
+ if (idx < 0)
+ goto out_copy;
+
+ /* read security.foo? --> read ***@uid=<uid> instead */
+ len = strlen(userns_xattrs[idx]);
+ if (name[len] == 0) {
+ /*
+ * init_user_ns or userns with root mapped to uid 0
+ * may read security.foo directly
+ */
+ if (userns == &init_user_ns ||
+ __kuid_val(root_uid) == 0)
+ goto out_copy;
+
+ if (!uid_valid(root_uid))
+ return ERR_PTR(-EINVAL);
+
+ buffer = xattr_write_uid(__kuid_val(root_uid), name);
+ if (!buffer)
+ return ERR_PTR(-ENOMEM);
+
+ return buffer;
+ }
+
+ /*
+ * We must have name[len] == '@'.
+ */
+ error = xattr_parse_uid_make_kuid(&name[len],
+ userns,
+ &n_uidstr);
+ if (error)
+ return ERR_PTR(error);
+
+ /* name[len] == '@' */
+ buflen = len + strlen(n_uidstr) + 1;
+ buffer = kmalloc(buflen, GFP_KERNEL);
+ if (!buffer) {
+ kfree(n_uidstr);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ snprintf(buffer, len + 1, "%s", name);
+ snprintf(&buffer[len], buflen - len, "%s", n_uidstr);
+ kfree(n_uidstr);
+
+ return buffer;
+
+out_copy:
+ buffer = kstrdup(name, GFP_KERNEL);
+ if (!buffer)
+ return ERR_PTR(-ENOMEM);
+
+ return buffer;
+}
+
int
__vfs_setxattr(struct dentry *dentry, struct inode *inode, const char *name,
const void *value, size_t size, int flags)
{
const struct xattr_handler *handler;
+ char *newname;
+ int ret;

+ newname = xattr_userns_name(name, current_user_ns());
+ if (IS_ERR(newname))
+ return PTR_ERR(newname);
+ name = newname;
handler = xattr_resolve_name(inode, &name);
- if (IS_ERR(handler))
- return PTR_ERR(handler);
- if (!handler->set)
- return -EOPNOTSUPP;
+ if (IS_ERR(handler)) {
+ ret = PTR_ERR(handler);
+ goto out;
+ }
+ if (!handler->set) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
if (size == 0)
value = ""; /* empty EA, do not remove */
- return handler->set(handler, dentry, inode, name, value, size, flags);
+ ret = handler->set(handler, dentry, inode, name, value, size, flags);
+
+out:
+ kfree(newname);
+ return ret;
}
EXPORT_SYMBOL(__vfs_setxattr);

@@ -301,14 +721,39 @@ ssize_t
__vfs_getxattr(struct dentry *dentry, struct inode *inode, const char *name,
void *value, size_t size)
{
- const struct xattr_handler *handler;
+ const struct xattr_handler *handler = NULL;
+ char *newname = NULL;
+ int ret, userns_supt_xattr;
+ struct user_namespace *userns = current_user_ns();
+
+ userns_supt_xattr = (xattr_is_userns_supported(name, false) >= 0);
+
+ do {
+ kfree(newname);
+
+ newname = xattr_userns_name(name, userns);
+ if (IS_ERR(newname))
+ return PTR_ERR(newname);
+
+ if (!handler) {
+ name = newname;
+ handler = xattr_resolve_name(inode, &name);
+ if (IS_ERR(handler)) {
+ ret = PTR_ERR(handler);
+ goto out;
+ }
+ if (!handler->get) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+ }
+ ret = handler->get(handler, dentry, inode, name, value, size);
+ userns = userns->parent;
+ } while ((ret == -ENODATA) && userns && userns_supt_xattr);

- handler = xattr_resolve_name(inode, &name);
- if (IS_ERR(handler))
- return PTR_ERR(handler);
- if (!handler->get)
- return -EOPNOTSUPP;
- return handler->get(handler, dentry, inode, name, value, size);
+out:
+ kfree(newname);
+ return ret;
}
EXPORT_SYMBOL(__vfs_getxattr);

@@ -328,8 +773,16 @@ vfs_getxattr(struct dentry *dentry, const char *name, void *value, size_t size)

if (!strncmp(name, XATTR_SECURITY_PREFIX,
XATTR_SECURITY_PREFIX_LEN)) {
- const char *suffix = name + XATTR_SECURITY_PREFIX_LEN;
- int ret = xattr_getsecurity(inode, suffix, value, size);
+ int ret;
+ const char *suffix;
+ char *newname = xattr_userns_name(name, current_user_ns());
+ if (IS_ERR(newname))
+ return PTR_ERR(newname);
+
+ suffix = newname + XATTR_SECURITY_PREFIX_LEN;
+
+ ret = xattr_getsecurity(inode, suffix, value, size);
+ kfree(newname);
/*
* Only overwrite the return value if a security module
* is actually active.
@@ -360,6 +813,9 @@ vfs_listxattr(struct dentry *dentry, char *list, size_t size)
if (size && error > size)
error = -ERANGE;
}
+ if (error > 0)
+ error = xattr_list_userns_rewrite(list, error, size);
+
return error;
}
EXPORT_SYMBOL_GPL(vfs_listxattr);
@@ -369,13 +825,28 @@ __vfs_removexattr(struct dentry *dentry, const char *name)
{
struct inode *inode = d_inode(dentry);
const struct xattr_handler *handler;
+ char *newname;
+ int ret;

+ newname = xattr_userns_name(name, current_user_ns());
+ if (IS_ERR(newname))
+ return PTR_ERR(newname);
+ name = newname;
handler = xattr_resolve_name(inode, &name);
- if (IS_ERR(handler))
- return PTR_ERR(handler);
- if (!handler->set)
- return -EOPNOTSUPP;
- return handler->set(handler, dentry, inode, name, NULL, 0, XATTR_REPLACE);
+ if (IS_ERR(handler)) {
+ ret = PTR_ERR(handler);
+ goto out;
+ }
+ if (!handler->set) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+ ret = handler->set(handler, dentry, inode, name, NULL, 0, XATTR_REPLACE);
+
+out:
+ kfree(newname);
+
+ return ret;
}
EXPORT_SYMBOL(__vfs_removexattr);

diff --git a/security/commoncap.c b/security/commoncap.c
index 7abebd7..c842690 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -660,15 +660,23 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
int cap_inode_setxattr(struct dentry *dentry, const char *name,
const void *value, size_t size, int flags)
{
- if (!strcmp(name, XATTR_NAME_CAPS)) {
- if (!capable(CAP_SETFCAP))
+ if (strncmp(name, XATTR_SECURITY_PREFIX,
+ sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
+ return 0;
+
+ if (strncmp(name, XATTR_NAME_CAPS,
+ sizeof(XATTR_NAME_CAPS) - 1) == 0) {
+ struct inode *inode = d_backing_inode(dentry);
+
+ if (!inode)
+ return -EINVAL;
+ if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
return -EPERM;
+
return 0;
}

- if (!strncmp(name, XATTR_SECURITY_PREFIX,
- sizeof(XATTR_SECURITY_PREFIX) - 1) &&
- !capable(CAP_SYS_ADMIN))
+ if (!capable(CAP_SYS_ADMIN))
return -EPERM;
return 0;
}
@@ -686,15 +694,23 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
*/
int cap_inode_removexattr(struct dentry *dentry, const char *name)
{
- if (!strcmp(name, XATTR_NAME_CAPS)) {
- if (!capable(CAP_SETFCAP))
+ if (strncmp(name, XATTR_SECURITY_PREFIX,
+ sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
+ return 0;
+
+ if (strncmp(name, XATTR_NAME_CAPS,
+ sizeof(XATTR_NAME_CAPS) - 1) == 0) {
+ struct inode *inode = d_backing_inode(dentry);
+
+ if (!inode)
+ return -EINVAL;
+ if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
return -EPERM;
+
return 0;
}

- if (!strncmp(name, XATTR_SECURITY_PREFIX,
- sizeof(XATTR_SECURITY_PREFIX) - 1) &&
- !capable(CAP_SYS_ADMIN))
+ if (!capable(CAP_SYS_ADMIN))
return -EPERM;
return 0;
}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 819fd68..702c225 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3091,8 +3091,13 @@ static int selinux_inode_setotherxattr(struct dentry *dentry, const char *name)

if (!strncmp(name, XATTR_SECURITY_PREFIX,
sizeof XATTR_SECURITY_PREFIX - 1)) {
- if (!strcmp(name, XATTR_NAME_CAPS)) {
- if (!capable(CAP_SETFCAP))
+ if (!strncmp(name, XATTR_NAME_CAPS,
+ sizeof(XATTR_NAME_CAPS) - 1)) {
+ struct inode *inode = d_backing_inode(dentry);
+
+ if (!inode)
+ return -EINVAL;
+ if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
return -EPERM;
} else if (!capable(CAP_SYS_ADMIN)) {
/* A different attribute in the security namespace.
--
2.7.4
Serge E. Hallyn
2017-07-11 17:20:02 UTC
Permalink
er.kernel.org>
Content-Length: 19839
Lines: 700
X-UID: 24770
Status: RO
This patch enables security.capability in user namespaces but also
takes a more general approach to enabling extended attributes in user
namespaces.
The following rules describe the approach using security.foo as a
1a) Reading security.foo from a user namespace will read
being the mapping of root in that parent user namespace. An
exception is if root is mapped to uid 0 on the host, and in this case
we will read security.foo directly.
mapping of root to 1000.
parent namespace is tried to be read. This procedure is repeated up to
the init user namespace. This step only applies for reading of extended
attributes and provides the same behavior as older system where the
host's extended attributes applied to user namespaces.
can be read. The uid within the user namespace will be mapped to the
corresponding uid on the host and that uid will be used in the name of
the extended attribute.
mapping of root to 1000, size of at least 2.
of <uid> also being subject to checking for valid mappings.
3) No other security.foo* can be read.
The same rules for reading apply to writing and removing of user
namespace enabled extended attributes.
When listing extended attributes of a file, only those are presented
to the user namespace that have a valid mapping. Besides that, names
of the extended attributes are adjusted to represent the mapping.
This means that if root is mapped to uid 1000 on the host, the
---
fs/xattr.c | 509 +++++++++++++++++++++++++++++++++++++++++++++--
security/commoncap.c | 36 +++-
security/selinux/hooks.c | 9 +-
3 files changed, 523 insertions(+), 31 deletions(-)
diff --git a/fs/xattr.c b/fs/xattr.c
index 464c94b..eacad9e 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -133,20 +133,440 @@ xattr_permission(struct inode *inode, const char *name, int mask)
return inode_permission(inode, mask);
}
+/*
+ * A list of extended attributes that are supported in user namespaces
+ */
+static const char *const userns_xattrs[] = {
+ XATTR_NAME_CAPS,
+ NULL
+};
+
+/*
+ * xattrs_is_userns_supported - Check whether an xattr is supported in userns
+ *
+ *
+ * This function returns < 0 if not supported, an index into userns_xattrs[]
+ * otherwise.
+ */
+static int
+xattr_is_userns_supported(const char *name, int prefix)
+{
+ int i;
+
+ if (!name)
+ return -1;
+
+ for (i = 0; userns_xattrs[i]; i++) {
+ if (prefix) {
+ if (!strncmp(userns_xattrs[i], name,
+ strlen(userns_xattrs[i])))
+ return i;
I think you here need to also check that the next char is either
+ } else {
+ if (!strcmp(userns_xattrs[i], name))
+ return i;
+ }
+ }
+ return -1;
+}
+
+/*
+ * includes a prefix string
+ *
+ *
+ * This function returns a buffer with the string, or a NULL pointer in
+ * case of out-of-memory error.
+ */
+static char *
+xattr_write_uid(uid_t uid, const char *prefix)
+{
+ size_t buflen;
+ char *buffer;
+
+ if (prefix)
+ buflen += strlen(prefix);
+
+ buffer = kmalloc(buflen, GFP_KERNEL);
+ if (!buffer)
+ return NULL;
+
+ if (uid == 0)
+ *buffer = 0;
Do you need to print out the prefix here?
+ else
+ (prefix) ? prefix : "",
+ uid);
+
+ return buffer;
+}
Stefan Berger
2017-07-12 00:20:01 UTC
Permalink
Post by Serge E. Hallyn
er.kernel.org>
Content-Length: 19839
Lines: 700
X-UID: 24770
Status: RO
This patch enables security.capability in user namespaces but also
takes a more general approach to enabling extended attributes in user
namespaces.
The following rules describe the approach using security.foo as a
1a) Reading security.foo from a user namespace will read
being the mapping of root in that parent user namespace. An
exception is if root is mapped to uid 0 on the host, and in this case
we will read security.foo directly.
mapping of root to 1000.
parent namespace is tried to be read. This procedure is repeated up to
the init user namespace. This step only applies for reading of extended
attributes and provides the same behavior as older system where the
host's extended attributes applied to user namespaces.
can be read. The uid within the user namespace will be mapped to the
corresponding uid on the host and that uid will be used in the name of
the extended attribute.
mapping of root to 1000, size of at least 2.
of <uid> also being subject to checking for valid mappings.
3) No other security.foo* can be read.
The same rules for reading apply to writing and removing of user
namespace enabled extended attributes.
When listing extended attributes of a file, only those are presented
to the user namespace that have a valid mapping. Besides that, names
of the extended attributes are adjusted to represent the mapping.
This means that if root is mapped to uid 1000 on the host, the
---
fs/xattr.c | 509 +++++++++++++++++++++++++++++++++++++++++++++--
security/commoncap.c | 36 +++-
security/selinux/hooks.c | 9 +-
3 files changed, 523 insertions(+), 31 deletions(-)
diff --git a/fs/xattr.c b/fs/xattr.c
index 464c94b..eacad9e 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -133,20 +133,440 @@ xattr_permission(struct inode *inode, const char *name, int mask)
return inode_permission(inode, mask);
}
+/*
+ * A list of extended attributes that are supported in user namespaces
+ */
+static const char *const userns_xattrs[] = {
+ XATTR_NAME_CAPS,
+ NULL
+};
+
+/*
+ * xattrs_is_userns_supported - Check whether an xattr is supported in userns
+ *
+ *
+ * This function returns < 0 if not supported, an index into userns_xattrs[]
+ * otherwise.
+ */
+static int
+xattr_is_userns_supported(const char *name, int prefix)
+{
+ int i;
+
+ if (!name)
+ return -1;
+
+ for (i = 0; userns_xattrs[i]; i++) {
+ if (prefix) {
+ if (!strncmp(userns_xattrs[i], name,
+ strlen(userns_xattrs[i])))
+ return i;
I think you here need to also check that the next char is either
I have the checks for '@' and '\0' done by the caller. With the current
support of only security.capability I don't think we need to check for '.'.
Post by Serge E. Hallyn
+ } else {
+ if (!strcmp(userns_xattrs[i], name))
+ return i;
+ }
+ }
+ return -1;
+}
+
+/*
+ * includes a prefix string
+ *
+ *
+ * This function returns a buffer with the string, or a NULL pointer in
+ * case of out-of-memory error.
+ */
+static char *
+xattr_write_uid(uid_t uid, const char *prefix)
+{
+ size_t buflen;
+ char *buffer;
+
+ if (prefix)
+ buflen += strlen(prefix);
+
+ buffer = kmalloc(buflen, GFP_KERNEL);
+ if (!buffer)
+ return NULL;
+
+ if (uid == 0)
+ *buffer = 0;
Do you need to print out the prefix here?
Right. Fixed.
Post by Serge E. Hallyn
+ else
+ (prefix) ? prefix : "",
+ uid);
+
+ return buffer;
+}
Thanks.

Stefan
Serge E. Hallyn
2017-07-12 00:50:01 UTC
Permalink
Post by Serge E. Hallyn
Post by Stefan Berger
diff --git a/fs/xattr.c b/fs/xattr.c
index 464c94b..eacad9e 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -133,20 +133,440 @@ xattr_permission(struct inode *inode, const char *name, int mask)
return inode_permission(inode, mask);
}
+/*
+ * A list of extended attributes that are supported in user namespaces
+ */
+static const char *const userns_xattrs[] = {
+ XATTR_NAME_CAPS,
+ NULL
+};
+
+/*
+ * xattrs_is_userns_supported - Check whether an xattr is supported in userns
+ *
+ *
+ * This function returns < 0 if not supported, an index into userns_xattrs[]
+ * otherwise.
+ */
+static int
+xattr_is_userns_supported(const char *name, int prefix)
+{
+ int i;
+
+ if (!name)
+ return -1;
+
+ for (i = 0; userns_xattrs[i]; i++) {
+ if (prefix) {
+ if (!strncmp(userns_xattrs[i], name,
+ strlen(userns_xattrs[i])))
+ return i;
I think you here need to also check that the next char is either
current support of only security.capability I don't think we need to
check for '.'.
Ah - ok, thanks.
Serge E. Hallyn
2017-07-12 03:50:01 UTC
Permalink
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
+ return size;
+
+ if (size) {
+ nlist = kmalloc(list_maxlen, GFP_KERNEL);
+ if (!nlist)
+ return -ENOMEM;
+ }
+
+ s_off = d_off = 0;
+ while (s_off < size || size == 0) {
+ name = &list[s_off];
+
+ len = strlen(name);
+ if (!len)
+ break;
+
+ if (xattr_is_userns_supported(name, false) >= 0)
+ newname = name;
+ else {
+ newname = xattr_rewrite_userns_xattr(name);
Why are you doing this here? If we get here it means that
xattr_is_userns_supported() returned < 0, meaning name is
not userns-supported. So xattr_rewrite_userns_xattr() will
just return name. Am I missing something?
Post by Stefan Berger
+ if (IS_ERR(newname)) {
+ d_off = PTR_ERR(newname);
+ goto out_free;
+ }
+ }
+ if (newname && !xattr_list_contains(nlist, d_off, newname)) {
Now here, if name was recalculated to @newname, and @newname is
found in the nlist, that should raise an error right? Something
fishy is going on?
Post by Stefan Berger
+ nlen = strlen(newname);
+
+ if (nlist) {
+ if (nlen + 1 > list_maxlen)
d_off needs to be set to -ERANGE here.
Post by Stefan Berger
+ break;
+ strcpy(&nlist[d_off], newname);
+ }
+
+ d_off += nlen + 1;
+ if (newname != name)
+ kfree(newname);
+ }
+ s_off += len + 1;
+ }
+ if (nlist)
+ memcpy(list, nlist, d_off);
+ kfree(nlist);
+
+ return d_off;
+}
Stefan Berger
2017-07-12 11:40:01 UTC
Permalink
Post by Serge E. Hallyn
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
+ return size;
+
+ if (size) {
+ nlist = kmalloc(list_maxlen, GFP_KERNEL);
+ if (!nlist)
+ return -ENOMEM;
+ }
+
+ s_off = d_off = 0;
+ while (s_off < size || size == 0) {
+ name = &list[s_off];
+
+ len = strlen(name);
+ if (!len)
+ break;
+
+ if (xattr_is_userns_supported(name, false) >= 0)
+ newname = name;
+ else {
+ newname = xattr_rewrite_userns_xattr(name);
Why are you doing this here? If we get here it means that
xattr_is_userns_supported() returned < 0, meaning name is
not userns-supported. So xattr_rewrite_userns_xattr() will
just return name. Am I missing something?
xattr_is_userns_support(name, false) does a _full string match_ rather
than a prefix match and will only return >= 0 for security.capability.
This case handles the hosts's security.capability which 'shines
through' for read and needs to be listed. Only in this case we set
newname=name.

In the else branch we handle ***@uid=1000 and rewrite
that to security.capability for root mapping to uid=1000.
Post by Serge E. Hallyn
Post by Stefan Berger
+ if (IS_ERR(newname)) {
+ d_off = PTR_ERR(newname);
+ goto out_free;
+ }
+ }
+ if (newname && !xattr_list_contains(nlist, d_off, newname)) {
found in the nlist, that should raise an error right? Something
fishy is going on?
If security.capability is set on a file but the container doesn't have
***@uid=1000, we still need to list the former here.
However, we end up with duplicates if security.capability is there and
***@uid=1000 is also there and root is mapped to
uid=1000. Both would be shown as security.capability inside the
container. In this case we need to filter.

I think the code is correct. More problematic is a memory leak in the
error case. Will fix that.
Post by Serge E. Hallyn
Post by Stefan Berger
+ nlen = strlen(newname);
+
+ if (nlist) {
+ if (nlen + 1 > list_maxlen)
d_off needs to be set to -ERANGE here.
Fixed.
Post by Serge E. Hallyn
Post by Stefan Berger
+ break;
+ strcpy(&nlist[d_off], newname);
+ }
+
+ d_off += nlen + 1;
+ if (newname != name)
+ kfree(newname);
+ }
+ s_off += len + 1;
+ }
+ if (nlist)
+ memcpy(list, nlist, d_off);
+ kfree(nlist);
+
+ return d_off;
+}
Serge E. Hallyn
2017-07-12 17:40:01 UTC
Permalink
Post by Stefan Berger
Post by Serge E. Hallyn
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
+ return size;
+
+ if (size) {
+ nlist = kmalloc(list_maxlen, GFP_KERNEL);
+ if (!nlist)
+ return -ENOMEM;
+ }
+
+ s_off = d_off = 0;
+ while (s_off < size || size == 0) {
+ name = &list[s_off];
+
+ len = strlen(name);
+ if (!len)
+ break;
+
+ if (xattr_is_userns_supported(name, false) >= 0)
+ newname = name;
+ else {
+ newname = xattr_rewrite_userns_xattr(name);
Why are you doing this here? If we get here it means that
xattr_is_userns_supported() returned < 0, meaning name is
not userns-supported. So xattr_rewrite_userns_xattr() will
just return name. Am I missing something?
xattr_is_userns_support(name, false) does a _full string match_
rather than a prefix match and will only return >= 0 for
security.capability. This case handles the hosts's
security.capability which 'shines through' for read and needs to be
listed. Only in this case we set newname=name.
Ah, right.

I think it would be worth #defining XATTR_PREFIX_SEARCH and
XATTR_FULLNAME_SEARCH or something. Or maybe not, maybe I was
just being dense.
Post by Stefan Berger
rewrite that to security.capability for root mapping to uid=1000.
Post by Serge E. Hallyn
Post by Stefan Berger
+ if (IS_ERR(newname)) {
+ d_off = PTR_ERR(newname);
+ goto out_free;
+ }
+ }
+ if (newname && !xattr_list_contains(nlist, d_off, newname)) {
found in the nlist, that should raise an error right? Something
fishy is going on?
If security.capability is set on a file but the container doesn't
here. However, we end up with duplicates if security.capability is
mapped to uid=1000. Both would be shown as security.capability
inside the container. In this case we need to filter.
Gotcha, thanks.
Post by Stefan Berger
I think the code is correct. More problematic is a memory leak in
the error case. Will fix that.
Great.
Post by Stefan Berger
Post by Serge E. Hallyn
Post by Stefan Berger
+ nlen = strlen(newname);
+
+ if (nlist) {
+ if (nlen + 1 > list_maxlen)
d_off needs to be set to -ERANGE here.
Fixed.
Great, thanks.

-serge
James Morris
2017-07-12 08:10:02 UTC
Permalink
Why not strlen() here?
--
James Morris
<***@namei.org>
Eric W. Biederman
2017-07-12 13:40:02 UTC
Permalink
Post by Stefan Berger
This patch enables security.capability in user namespaces but also
takes a more general approach to enabling extended attributes in user
namespaces.
The following rules describe the approach using security.foo as a
1a) Reading security.foo from a user namespace will read
being the mapping of root in that parent user namespace. An
exception is if root is mapped to uid 0 on the host, and in this case
we will read security.foo directly.
mapping of root to 1000.
parent namespace is tried to be read. This procedure is repeated up to
the init user namespace. This step only applies for reading of extended
attributes and provides the same behavior as older system where the
host's extended attributes applied to user namespaces.
can be read. The uid within the user namespace will be mapped to the
corresponding uid on the host and that uid will be used in the name of
the extended attribute.
mapping of root to 1000, size of at least 2.
of <uid> also being subject to checking for valid mappings.
3) No other security.foo* can be read.
The same rules for reading apply to writing and removing of user
namespace enabled extended attributes.
When listing extended attributes of a file, only those are presented
to the user namespace that have a valid mapping. Besides that, names
of the extended attributes are adjusted to represent the mapping.
This means that if root is mapped to uid 1000 on the host, the
It doesn't look like this is coming through Serge so I don't see how
the Signed-off-by tag is legtimate.

From the replies to this it doesn't look like Serge has reviewed this
version either.

I am disappointed that all of my concerns about technical feasibility
remain unaddressed.

I hope my reading and review of the code goes better than my reading of
it's introduction.

Eric
Post by Stefan Berger
---
fs/xattr.c | 509 +++++++++++++++++++++++++++++++++++++++++++++--
security/commoncap.c | 36 +++-
security/selinux/hooks.c | 9 +-
3 files changed, 523 insertions(+), 31 deletions(-)
diff --git a/fs/xattr.c b/fs/xattr.c
index 464c94b..eacad9e 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -133,20 +133,440 @@ xattr_permission(struct inode *inode, const char *name, int mask)
return inode_permission(inode, mask);
}
+/*
+ * A list of extended attributes that are supported in user namespaces
+ */
+static const char *const userns_xattrs[] = {
+ XATTR_NAME_CAPS,
+ NULL
+};
+
+/*
+ * xattrs_is_userns_supported - Check whether an xattr is supported in userns
+ *
+ *
+ * This function returns < 0 if not supported, an index into userns_xattrs[]
+ * otherwise.
+ */
+static int
+xattr_is_userns_supported(const char *name, int prefix)
+{
+ int i;
+
+ if (!name)
+ return -1;
+
+ for (i = 0; userns_xattrs[i]; i++) {
+ if (prefix) {
+ if (!strncmp(userns_xattrs[i], name,
+ strlen(userns_xattrs[i])))
+ return i;
+ } else {
+ if (!strcmp(userns_xattrs[i], name))
+ return i;
+ }
+ }
+ return -1;
+}
+
+/*
+ * includes a prefix string
+ *
+ *
+ * This function returns a buffer with the string, or a NULL pointer in
+ * case of out-of-memory error.
+ */
+static char *
+xattr_write_uid(uid_t uid, const char *prefix)
+{
+ size_t buflen;
+ char *buffer;
+
+ if (prefix)
+ buflen += strlen(prefix);
+
+ buffer = kmalloc(buflen, GFP_KERNEL);
+ if (!buffer)
+ return NULL;
+
+ if (uid == 0)
+ *buffer = 0;
+ else
+ (prefix) ? prefix : "",
+ uid);
+
+ return buffer;
+}
+
+/*
+ * user namespaces and check mappings
+ *
+ * the uid remapped
+ *
+ * This function returns an error code or 0 in case of success. In case
+ * of success, 'n_uidstr' will hold a valid string.
+ */
+static int
+xattr_parse_uid_from_kuid(const char *uidstr, struct user_namespace *userns,
+ char **n_uidstr)
+{
+ int n;
+ uid_t muid, p_uid;
+ char d;
+ kuid_t tuid;
+
+ *n_uidstr = NULL;
+
+ if (n != 1)
+ return -EINVAL;
+
+ /* do we have a mapping of the uid? */
+ tuid = KUIDT_INIT(p_uid);
+ muid = from_kuid(userns, tuid);
+ if (muid == -1)
+ return -ENOENT;
+
+ *n_uidstr = xattr_write_uid(muid, NULL);
+ if (!*n_uidstr)
+ return -ENOMEM;
+
+ return 0;
+}
+
+/*
+ * user namespaces and check mappings
+ *
+ * the uid remapped
+ *
+ * This function returns an error code or 0 in case of success. In case
+ * of success, 'n_uidstr' will hold a valid string.
+ */
+static int
+xattr_parse_uid_make_kuid(const char *uidstr, struct user_namespace *userns,
+ char **n_uidstr)
+{
+ int n;
+ uid_t p_uid;
+ char d;
+ kuid_t tuid;
+
+ *n_uidstr = NULL;
+
+ if (n != 1)
+ return -EINVAL;
+
+ tuid = make_kuid(userns, p_uid);
+ if (!uid_valid(tuid))
+ return -ENOENT;
+
+ *n_uidstr = xattr_write_uid(__kuid_val(tuid), NULL);
+ if (!*n_uidstr)
+ return -ENOMEM;
+
+ return 0;
+}
+
+/*
+ * xattr_rewrite_userns_xattr - Rewrite and filter an extended attribute
+ * considering user namespace uid mappings and
+ * user namespace support extended attributes
+ *
+ *
+ * This function returns NULL if the name is to be filtered. Otherwise it can
+ * return the input buffer or a new buffer that the caller needs to free. The
+ * new buffer contains a rewritten extended attribute whose string length may
+ * exceed that of the given name.
+ */
+static char *
+xattr_rewrite_userns_xattr(char *name)
+{
+ int idx, error;
+ size_t len = 0, buflen;
+ char *buffer, *n_uidstr;
+
+ /* prefix-match name against supported attributes */
+ idx = xattr_is_userns_supported(name, true);
+ if (idx < 0) {
+ /* only rewrite those in userns_xattr[*] */
+ return name;
+ }
+
+ /* exact match ? */
+ len = strlen(userns_xattrs[idx]);
+ if (name[len] == 0)
+ return NULL;
+
+ /*
+ */
+ error = xattr_parse_uid_from_kuid(&name[len], current_user_ns(),
+ &n_uidstr);
+ if (error)
+ return NULL;
+
+ buflen = len + strlen(n_uidstr) + 1;
+ buffer = kmalloc(buflen, GFP_KERNEL);
+ if (!buffer) {
+ kfree(n_uidstr);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ name[len] = 0;
+
+ snprintf(buffer, buflen, "%s%s", name, n_uidstr);
+
+
+ kfree(n_uidstr);
+
+ return buffer;
+}
+
+/*
+ * xattr_list_contains - check whether an xattr list already contains a needle
+ *
+ */
+static int
+xattr_list_contains(const char *list, size_t listlen, const char *needle)
+{
+ size_t o = 0;
+
+ while (o < listlen) {
+ if (!strcmp(&list[o], needle))
+ return true;
+ o += strlen(&list[o]) + 1;
+ }
+ return false;
+}
+
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
+ return size;
+
+ if (size) {
+ nlist = kmalloc(list_maxlen, GFP_KERNEL);
+ if (!nlist)
+ return -ENOMEM;
+ }
+
+ s_off = d_off = 0;
+ while (s_off < size || size == 0) {
+ name = &list[s_off];
+
+ len = strlen(name);
+ if (!len)
+ break;
+
+ if (xattr_is_userns_supported(name, false) >= 0)
+ newname = name;
+ else {
+ newname = xattr_rewrite_userns_xattr(name);
+ if (IS_ERR(newname)) {
+ d_off = PTR_ERR(newname);
+ goto out_free;
+ }
+ }
+ if (newname && !xattr_list_contains(nlist, d_off, newname)) {
+ nlen = strlen(newname);
+
+ if (nlist) {
+ if (nlen + 1 > list_maxlen)
+ break;
+ strcpy(&nlist[d_off], newname);
+ }
+
+ d_off += nlen + 1;
+ if (newname != name)
+ kfree(newname);
+ }
+ s_off += len + 1;
+ }
+ if (nlist)
+ memcpy(list, nlist, d_off);
+ kfree(nlist);
+
+ return d_off;
+}
+
+/*
+ * xattr_userns_name - modify the name of a user namespace supported
+ * extended attribute
+ *
+ * In a user namespace we prevent read/write accesses to the host's
+ * security.foo to protect these extended attributes.
+ *
+ * 1a) Reading security.foo from a user namespace will read
+ * being the mapping of root in that parent user namespace. An
+ * exception is if root is mapped to uid 0 on the host, and in this case
+ * we will read security.foo directly.
+ * mapping of root to 1000.
+ *
+ * parent namespace is tried to be read. This procedure is repeated up to
+ * the init user namespace. This step only applies for reading of extended
+ * attributes and provides the same behavior as older systems where the
+ * host's extended attributes applied to user namespaces.
+ *
+ * an be read. The uid within the user namespace will be mapped to the
+ * corresponding uid on the host and that uid will be used in the name of
+ * the extended attribute.
+ * mapping of root to 1000, size of at least 2.
+ *
+ * of <uid> also being subject to checking for valid mappings.
+ *
+ * 3) No other security.foo* can be read.
+ *
+ * The same rules for reading apply to writing and removing, except for 1b).
+ *
+ * This function returns a buffer with either the original name or the
+ * user namespace adjusted name of the extended attribute.
+ *
+ */
+char *
+xattr_userns_name(const char *name, struct user_namespace *userns)
+{
+ size_t buflen;
+ char *buffer, *n_uidstr;
+ kuid_t root_uid = make_kuid(userns, 0);
+ int idx, error;
+ size_t len;
+
+ /* only security.foo will be changed here - prefix match here */
+ idx = xattr_is_userns_supported(name, true);
+ if (idx < 0)
+ goto out_copy;
+
+ len = strlen(userns_xattrs[idx]);
+ if (name[len] == 0) {
+ /*
+ * init_user_ns or userns with root mapped to uid 0
+ * may read security.foo directly
+ */
+ if (userns == &init_user_ns ||
+ __kuid_val(root_uid) == 0)
+ goto out_copy;
+
+ if (!uid_valid(root_uid))
+ return ERR_PTR(-EINVAL);
+
+ buffer = xattr_write_uid(__kuid_val(root_uid), name);
+ if (!buffer)
+ return ERR_PTR(-ENOMEM);
+
+ return buffer;
+ }
+
+ /*
+ */
+ error = xattr_parse_uid_make_kuid(&name[len],
+ userns,
+ &n_uidstr);
+ if (error)
+ return ERR_PTR(error);
+
+ buflen = len + strlen(n_uidstr) + 1;
+ buffer = kmalloc(buflen, GFP_KERNEL);
+ if (!buffer) {
+ kfree(n_uidstr);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ snprintf(buffer, len + 1, "%s", name);
+ snprintf(&buffer[len], buflen - len, "%s", n_uidstr);
+ kfree(n_uidstr);
+
+ return buffer;
+
+ buffer = kstrdup(name, GFP_KERNEL);
+ if (!buffer)
+ return ERR_PTR(-ENOMEM);
+
+ return buffer;
+}
+
int
__vfs_setxattr(struct dentry *dentry, struct inode *inode, const char *name,
const void *value, size_t size, int flags)
{
const struct xattr_handler *handler;
+ char *newname;
+ int ret;
+ newname = xattr_userns_name(name, current_user_ns());
+ if (IS_ERR(newname))
+ return PTR_ERR(newname);
+ name = newname;
handler = xattr_resolve_name(inode, &name);
- if (IS_ERR(handler))
- return PTR_ERR(handler);
- if (!handler->set)
- return -EOPNOTSUPP;
+ if (IS_ERR(handler)) {
+ ret = PTR_ERR(handler);
+ goto out;
+ }
+ if (!handler->set) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
if (size == 0)
value = ""; /* empty EA, do not remove */
- return handler->set(handler, dentry, inode, name, value, size, flags);
+ ret = handler->set(handler, dentry, inode, name, value, size, flags);
+
+ kfree(newname);
+ return ret;
}
EXPORT_SYMBOL(__vfs_setxattr);
@@ -301,14 +721,39 @@ ssize_t
__vfs_getxattr(struct dentry *dentry, struct inode *inode, const char *name,
void *value, size_t size)
{
- const struct xattr_handler *handler;
+ const struct xattr_handler *handler = NULL;
+ char *newname = NULL;
+ int ret, userns_supt_xattr;
+ struct user_namespace *userns = current_user_ns();
+
+ userns_supt_xattr = (xattr_is_userns_supported(name, false) >= 0);
+
+ do {
+ kfree(newname);
+
+ newname = xattr_userns_name(name, userns);
+ if (IS_ERR(newname))
+ return PTR_ERR(newname);
+
+ if (!handler) {
+ name = newname;
+ handler = xattr_resolve_name(inode, &name);
+ if (IS_ERR(handler)) {
+ ret = PTR_ERR(handler);
+ goto out;
+ }
+ if (!handler->get) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+ }
+ ret = handler->get(handler, dentry, inode, name, value, size);
+ userns = userns->parent;
+ } while ((ret == -ENODATA) && userns && userns_supt_xattr);
- handler = xattr_resolve_name(inode, &name);
- if (IS_ERR(handler))
- return PTR_ERR(handler);
- if (!handler->get)
- return -EOPNOTSUPP;
- return handler->get(handler, dentry, inode, name, value, size);
+ kfree(newname);
+ return ret;
}
EXPORT_SYMBOL(__vfs_getxattr);
@@ -328,8 +773,16 @@ vfs_getxattr(struct dentry *dentry, const char *name, void *value, size_t size)
if (!strncmp(name, XATTR_SECURITY_PREFIX,
XATTR_SECURITY_PREFIX_LEN)) {
- const char *suffix = name + XATTR_SECURITY_PREFIX_LEN;
- int ret = xattr_getsecurity(inode, suffix, value, size);
+ int ret;
+ const char *suffix;
+ char *newname = xattr_userns_name(name, current_user_ns());
+ if (IS_ERR(newname))
+ return PTR_ERR(newname);
+
+ suffix = newname + XATTR_SECURITY_PREFIX_LEN;
+
+ ret = xattr_getsecurity(inode, suffix, value, size);
+ kfree(newname);
/*
* Only overwrite the return value if a security module
* is actually active.
@@ -360,6 +813,9 @@ vfs_listxattr(struct dentry *dentry, char *list, size_t size)
if (size && error > size)
error = -ERANGE;
}
+ if (error > 0)
+ error = xattr_list_userns_rewrite(list, error, size);
+
return error;
}
EXPORT_SYMBOL_GPL(vfs_listxattr);
@@ -369,13 +825,28 @@ __vfs_removexattr(struct dentry *dentry, const char *name)
{
struct inode *inode = d_inode(dentry);
const struct xattr_handler *handler;
+ char *newname;
+ int ret;
+ newname = xattr_userns_name(name, current_user_ns());
+ if (IS_ERR(newname))
+ return PTR_ERR(newname);
+ name = newname;
handler = xattr_resolve_name(inode, &name);
- if (IS_ERR(handler))
- return PTR_ERR(handler);
- if (!handler->set)
- return -EOPNOTSUPP;
- return handler->set(handler, dentry, inode, name, NULL, 0, XATTR_REPLACE);
+ if (IS_ERR(handler)) {
+ ret = PTR_ERR(handler);
+ goto out;
+ }
+ if (!handler->set) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+ ret = handler->set(handler, dentry, inode, name, NULL, 0, XATTR_REPLACE);
+
+ kfree(newname);
+
+ return ret;
}
EXPORT_SYMBOL(__vfs_removexattr);
diff --git a/security/commoncap.c b/security/commoncap.c
index 7abebd7..c842690 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -660,15 +660,23 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
int cap_inode_setxattr(struct dentry *dentry, const char *name,
const void *value, size_t size, int flags)
{
- if (!strcmp(name, XATTR_NAME_CAPS)) {
- if (!capable(CAP_SETFCAP))
+ if (strncmp(name, XATTR_SECURITY_PREFIX,
+ sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
+ return 0;
+
+ if (strncmp(name, XATTR_NAME_CAPS,
+ sizeof(XATTR_NAME_CAPS) - 1) == 0) {
+ struct inode *inode = d_backing_inode(dentry);
+
+ if (!inode)
+ return -EINVAL;
+ if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
return -EPERM;
+
return 0;
}
- if (!strncmp(name, XATTR_SECURITY_PREFIX,
- sizeof(XATTR_SECURITY_PREFIX) - 1) &&
- !capable(CAP_SYS_ADMIN))
+ if (!capable(CAP_SYS_ADMIN))
return -EPERM;
return 0;
}
@@ -686,15 +694,23 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
*/
int cap_inode_removexattr(struct dentry *dentry, const char *name)
{
- if (!strcmp(name, XATTR_NAME_CAPS)) {
- if (!capable(CAP_SETFCAP))
+ if (strncmp(name, XATTR_SECURITY_PREFIX,
+ sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
+ return 0;
+
+ if (strncmp(name, XATTR_NAME_CAPS,
+ sizeof(XATTR_NAME_CAPS) - 1) == 0) {
+ struct inode *inode = d_backing_inode(dentry);
+
+ if (!inode)
+ return -EINVAL;
+ if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
return -EPERM;
+
return 0;
}
- if (!strncmp(name, XATTR_SECURITY_PREFIX,
- sizeof(XATTR_SECURITY_PREFIX) - 1) &&
- !capable(CAP_SYS_ADMIN))
+ if (!capable(CAP_SYS_ADMIN))
return -EPERM;
return 0;
}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 819fd68..702c225 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3091,8 +3091,13 @@ static int selinux_inode_setotherxattr(struct dentry *dentry, const char *name)
if (!strncmp(name, XATTR_SECURITY_PREFIX,
sizeof XATTR_SECURITY_PREFIX - 1)) {
- if (!strcmp(name, XATTR_NAME_CAPS)) {
- if (!capable(CAP_SETFCAP))
+ if (!strncmp(name, XATTR_NAME_CAPS,
+ sizeof(XATTR_NAME_CAPS) - 1)) {
+ struct inode *inode = d_backing_inode(dentry);
+
+ if (!inode)
+ return -EINVAL;
+ if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
return -EPERM;
} else if (!capable(CAP_SYS_ADMIN)) {
/* A different attribute in the security namespace.
Serge E. Hallyn
2017-07-12 17:10:02 UTC
Permalink
Post by Eric W. Biederman
It doesn't look like this is coming through Serge so I don't see how
the Signed-off-by tag is legtimate.
This is mostly explained by the fact that there have been a *lot* of
changes, many of them discussed in private emails.
Post by Eric W. Biederman
Post by Eric W. Biederman
From the replies to this it doesn't look like Serge has reviewed this
version either.
I am disappointed that all of my concerns about technical feasibility
remain unaddressed.
Can you re-state those, or give a link to them?

I'd really like to get to a point where unprivileged containers can start
using filecaps - at this point if that means having an extra temporary
file format based on my earlier patchset while we hash this out, that
actually seems worthwhile. But it would of course be ideal if we could
do the name based caps right in the first place.

-serge
James Morris
2017-07-12 22:30:01 UTC
Permalink
Post by Serge E. Hallyn
Post by Eric W. Biederman
It doesn't look like this is coming through Serge so I don't see how
the Signed-off-by tag is legtimate.
This is mostly explained by the fact that there have been a *lot* of
changes, many of them discussed in private emails.
Please try and keep technical discussions public or at least document them
when reposting the patches.
--
James Morris
<***@namei.org>
Eric W. Biederman
2017-07-13 00:50:01 UTC
Permalink
Post by James Morris
Post by Serge E. Hallyn
Post by Eric W. Biederman
It doesn't look like this is coming through Serge so I don't see how
the Signed-off-by tag is legtimate.
This is mostly explained by the fact that there have been a *lot* of
changes, many of them discussed in private emails.
Please try and keep technical discussions public or at least document them
when reposting the patches.
Yes please.

A public discussion helps to understand what the challenges are.

Eric
Serge E. Hallyn
2017-07-13 01:10:01 UTC
Permalink
Post by Eric W. Biederman
Post by James Morris
Post by Serge E. Hallyn
Post by Eric W. Biederman
It doesn't look like this is coming through Serge so I don't see how
the Signed-off-by tag is legtimate.
This is mostly explained by the fact that there have been a *lot* of
changes, many of them discussed in private emails.
I wasn't clear here. There were a lot of changes between the first
mention of the approach and the posting of v1.

My point was that I did in fact agree to Reviewed-by, and the fact that
I've found a few more things to point out only reflects my missing them
before. I don't think my name is being mis-used.
Post by Eric W. Biederman
Post by James Morris
Please try and keep technical discussions public or at least document them
when reposting the patches.
Yes please.
A public discussion helps to understand what the challenges are.
Yes, all discussion (that I can find in my mbox) since v1 has been public.

-serge
Eric W. Biederman
2017-07-12 23:30:01 UTC
Permalink
Post by Serge E. Hallyn
Post by Eric W. Biederman
It doesn't look like this is coming through Serge so I don't see how
the Signed-off-by tag is legtimate.
This is mostly explained by the fact that there have been a *lot* of
changes, many of them discussed in private emails.
Post by Eric W. Biederman
Post by Eric W. Biederman
From the replies to this it doesn't look like Serge has reviewed this
version either.
I am disappointed that all of my concerns about technical feasibility
remain unaddressed.
Can you re-state those, or give a link to them?
Well I only posted about one substantive comment on the last round
so it should be easy to find that said.

The big question is how does this intereact with filesystems
xattr implementations?

There is the potential that we create many more security xattrs this
way. How does that scale? With more names etc.
What happens if we have one xattr per uid for 1000+ uids?

How does this interact with filesystems optimization of xattr names?
For some filesystems they optmize the xattr names, and don't store the
entire thing.
Post by Serge E. Hallyn
I'd really like to get to a point where unprivileged containers can start
using filecaps - at this point if that means having an extra temporary
file format based on my earlier patchset while we hash this out, that
actually seems worthwhile. But it would of course be ideal if we could
do the name based caps right in the first place.
This whole new version has set my review back to square one
unfortunately.

Eric
Stefan Berger
2017-07-13 00:50:01 UTC
Permalink
Post by Eric W. Biederman
Post by Serge E. Hallyn
Post by Eric W. Biederman
It doesn't look like this is coming through Serge so I don't see how
the Signed-off-by tag is legtimate.
This is mostly explained by the fact that there have been a *lot* of
changes, many of them discussed in private emails.
Post by Eric W. Biederman
Post by Eric W. Biederman
From the replies to this it doesn't look like Serge has reviewed this
version either.
I am disappointed that all of my concerns about technical feasibility
remain unaddressed.
Can you re-state those, or give a link to them?
Well I only posted about one substantive comment on the last round
so it should be easy to find that said.
The big question is how does this intereact with filesystems
xattr implementations?
There is the potential that we create many more security xattrs this
way. How does that scale? With more names etc.
It doesn't scale. Shared filesystems are a problem if many containers
use them.

'man listxattr' also mentions this here as a BUG:

" As noted in xattr(7), the VFS imposes a limit of 64 kB on the size of
the extended attribute name list returned by listxattr(7). If the
total size of attribute names attached to a file exceeds this limit,
it is no longer possible to retrieve the list of attribute names."

A simple test on ext4:

#> touch foo
#> for ((i = 0; i < 200; i++)); do setfattr -n user.foo${i} -v hello
foo; done

user.foo126 was the last one created...

Depending on the size of the data the xattrs are writing, the limit is
reached sooner. Writing 'hellohello' only goes up to 'user.foo112'.
Maybe one could try to encode the data more efficiently or as Serge did
write the uid on the xattr value side, but either way, it won't scale
due to that VFS limit.

Stefan
Post by Eric W. Biederman
What happens if we have one xattr per uid for 1000+ uids?
How does this interact with filesystems optimization of xattr names?
For some filesystems they optmize the xattr names, and don't store the
entire thing.
Post by Serge E. Hallyn
I'd really like to get to a point where unprivileged containers can start
using filecaps - at this point if that means having an extra temporary
file format based on my earlier patchset while we hash this out, that
actually seems worthwhile. But it would of course be ideal if we could
do the name based caps right in the first place.
This whole new version has set my review back to square one
unfortunately.
Eric
Theodore Ts'o
2017-07-13 01:20:01 UTC
Permalink
I'm really confused what problem that is trying to be solved, here,
but it **feels** really, really wrong.

Why do we need to store all of this state on a per-file basis, instead
of some kind of per-file system or per-container data structure?

And how many of these ***@uid=bar xattrs do you expect there
to be? How many "foo", and how many "bar"?

Maybe I missed the full write up, in which case please send me a link
to the full writeup --- ideally in the form of a design doc that
explains the problem statement, gives some examples of how it's going
to be used, what were the other alternatives that were considered, and
why they were rejected, etc.

Thanks,

- Ted
Serge E. Hallyn
2017-07-13 02:40:02 UTC
Permalink
Post by Theodore Ts'o
I'm really confused what problem that is trying to be solved, here,
but it **feels** really, really wrong.
Hi,

The intro to my original patch might help (or maybe not), as it
has a different motivating text:

http://lkml.org/lkml/2016/11/19/158

We want file capabilities to be supported in unprivileged containers,
so that a piece of software can count on them being available rather
than having to supporting multiple ways of getting+dropping privilege
(for instance, being installed as uid 1000 with cap_net_raw=pe, versus
being installed setuid-root and being expected to do PR_SET_KEEPCAPS
and setuid).

If subuids 10000-20000 are delegated to uid 1001 on the host, and uid
1001 sets up a container with subuid 100000 mapped to container uid 0,
then the container root should be able to write file capabilities
which affect (that is, delegate container root's privilege to) all ids
over which it has privilege (all uids mapped into the container), but
should not have privilege over any uids not mapped into the container.
With regular file capabilities, this is impossible, since any filecap
he writes can then be exercised on the host by uid 1000.

The point of this set (and the ones before it) is to make it so that
the filecap written by the container root is tagged on disk as belonging
to subuid 100000.
Post by Theodore Ts'o
Why do we need to store all of this state on a per-file basis, instead
of some kind of per-file system or per-container data structure?
This needs to be writeable by an unprivileged user, with no help from
the admin. AFAICS that rules out per-fs data structure.

Note we are not assuming a filesystem per container. The typical case
is (for instance) ~/.local/share/lxc/c1/rootfs being the root of
container c1's filesystem. Mounting a filesystem from inside a user
namespace is still mostly science fiction today.
Post by Theodore Ts'o
to be? How many "foo", and how many "bar"?
For now I'm expecting two foos - security and ima. The '@uid=bar' is
generic enough that it *can* be re-used for a different kind of
property if we decide to later, but I have no intention of adding
anything.

Casey has mentioned 'smack=', but i think only to keep the option open.
I don't believe he has concrete plans.
Post by Theodore Ts'o
Maybe I missed the full write up, in which case please send me a link
to the full writeup --- ideally in the form of a design doc that
explains the problem statement, gives some examples of how it's going
to be used, what were the other alternatives that were considered, and
why they were rejected, etc.
As I'd mentioned in an even older patch, http://lkml.org/lkml/2016/5/18/622 ,
I had considered using a completely separate xattr name, but that would
have required invasive userspace changes.

There's no design doc as such, mainly a progressive series of patches to
lkml. I am very seriously considering writing a paper to detail both
this design and the user ns design in general, as it has become clear
(in unrelated conversations) there is still a lot of confusiong out
there regarding uid namespaces and targeted capabilities. But it's not
written yet.

-serge
Eric W. Biederman
2017-07-13 12:20:04 UTC
Permalink
Post by Theodore Ts'o
I'm really confused what problem that is trying to be solved, here,
but it **feels** really, really wrong.
Why do we need to store all of this state on a per-file basis, instead
of some kind of per-file system or per-container data structure?
to be? How many "foo", and how many "bar"?
Maybe I missed the full write up, in which case please send me a link
to the full writeup --- ideally in the form of a design doc that
explains the problem statement, gives some examples of how it's going
to be used, what were the other alternatives that were considered, and
why they were rejected, etc.
The concise summary:

Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.

User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.

We currently have two proposals on the table. One is to bump the
revision number of security.capable and add more information in that xattr.
The other is to use a sligthly different capability name.

We are currently evaluating between the two proposals.

Given that it appears the IMA xattrs will want similar treatment coming
up with a good pattern to follow is part of the analysis here.

Eric
Theodore Ts'o
2017-07-13 16:50:01 UTC
Permalink
Post by Eric W. Biederman
Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.
User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.
So correct me if I am wrong; in general, there will only be one
variant of the form:

***@uid=15000

It's not like there will be:

***@uid=1000
***@uid=2000

Except.... if you have an Distribution root directory which is shared
by many containers, you would need to put the xattrs in the overlay
inodes. Worse, each time you launch a new container, with a new
subuid allocation, you will have to iterate over all files with
capabilities and do a copy-up operations on the xattrs in overlayfs.
So that's actually a bit of a disaster.

So for distribution overlays, you will need to do things a different
way, which is to map the distro subdirectory so you know that the
capability with the global uid 0 should be used for the container
"root" uid, right?

So this hack of using ***@uid=1000 is *only* useful when the
subcontainer root wants to create the privileged executable. You
still have to do things the other way.

So can we make perhaps the assertion that *either*:

security.foo

exists, *or*

***@uid=BAR

exists, but never both? And there BAR is exclusive to only one
instances?

Otherwise, I suspect that the architecture is going to turn around and
bite us in the *ss eventually, because someone will want to do
something crazy and the solution will not be scalable.

-Ted
Stefan Berger
2017-07-13 17:10:02 UTC
Permalink
Post by Theodore Ts'o
Post by Eric W. Biederman
Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.
User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.
So correct me if I am wrong; in general, there will only be one
A file shared by 2 containers, one mapping root to uid=1000, the other
mapping root to uid=2000, will show these two xattrs on the host
(init_user_ns) once these containers set xattrs on that file.
Post by Theodore Ts'o
Except.... if you have an Distribution root directory which is shared
by many containers, you would need to put the xattrs in the overlay
inodes. Worse, each time you launch a new container, with a new
subuid allocation, you will have to iterate over all files with
capabilities and do a copy-up operations on the xattrs in overlayfs.
So that's actually a bit of a disaster.
Note that we do keep compatibility to existing behavior. The
security.foo of the host is visible inside any container for as long as
the container root user doesn't set its own security.foo on that file,
which then hides it. Does that address this concern?
Post by Theodore Ts'o
So for distribution overlays, you will need to do things a different
way, which is to map the distro subdirectory so you know that the
capability with the global uid 0 should be used for the container
"root" uid, right?
subcontainer root wants to create the privileged executable. You
still have to do things the other way.
security.foo
exists, *or*
exists, but never both? And there BAR is exclusive to only one
instances?
In the current implementation BAR is visible inside of any instance that
'covers' this uid with the mapping range. Above example of
***@uid=1000 appears as security.foo inside the container with
root mapping to uid 1000 (@uid=0 is suppressed) but also appears as
***@uid=100 with root uid mapping to 900 (and range of at least
101).
Post by Theodore Ts'o
Otherwise, I suspect that the architecture is going to turn around and
bite us in the *ss eventually, because someone will want to do
something crazy and the solution will not be scalable.
Can you define what 'scalable' means for you in this context?
From what I can see sharing a filesystem between multiple containers
doesn't 'scale well' for virtualizing the xattrs primarily because of
size limitations of xattrs per file.

Stefan
Post by Theodore Ts'o
-Ted
Eric W. Biederman
2017-07-13 17:50:01 UTC
Permalink
Post by Stefan Berger
Post by Theodore Ts'o
Post by Eric W. Biederman
Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.
User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.
So correct me if I am wrong; in general, there will only be one
A file shared by 2 containers, one mapping root to uid=1000, the other
mapping root to uid=2000, will show these two xattrs on the host
(init_user_ns) once these containers set xattrs on that file.
There is an interesting solution for shared directory trees containing
executables.

Overlayfs is needed if you need those directory trees to be writable and
for the files to show up as owned by uid 0. An overlayfs will have to
do something with the security.capable attribute. So ignoring that case.

If you don't care about the ownership of the files, and read only is
acceptable, and you still don't want to give these executables
capabilities in the initial user namespace. What you can do is
make everything owned by some non-zero uid including the security
capability. Call this non-zero uid image-root.

When the container starts it creates two nested user namespaces first
with image-root mapped to 0. Then with the containers choice of uid
mapped to 0 image-root unmapped. This will ensure the capability
attributes work for all containers that share that root image. And it
ensures the file are read-only from the container.

So I don't think there is ever a case where we would share a filesystem
image where we would need to set multiple security attributes on a file.
Post by Stefan Berger
Post by Theodore Ts'o
Otherwise, I suspect that the architecture is going to turn around and
bite us in the *ss eventually, because someone will want to do
something crazy and the solution will not be scalable.
Can you define what 'scalable' means for you in this context?
From what I can see sharing a filesystem between multiple containers
doesn't 'scale well' for virtualizing the xattrs primarily because of
size limitations of xattrs per file.
Worse than that I believe you will find that filesystems are built on
the assumption that there will be a small number of xattrs per file.
So even if the vfs limitations were lifted the filesystem performance
would suffer.

Even if the filesystem performed well I believe there are other issues
with stat, and simply not having so much meta-data that adminstrators
and tools get confused.

So I believe there are some very good fundamental reasons why we want to
limit the amount of meta-data per file.

Eric
Theodore Ts'o
2017-07-13 19:20:02 UTC
Permalink
Post by Eric W. Biederman
Post by Stefan Berger
Can you define what 'scalable' means for you in this context?
From what I can see sharing a filesystem between multiple containers
doesn't 'scale well' for virtualizing the xattrs primarily because of
size limitations of xattrs per file.
Worse than that I believe you will find that filesystems are built on
the assumption that there will be a small number of xattrs per file.
So even if the vfs limitations were lifted the filesystem performance
would suffer.
That's why I've been pushing here. If people try to do

***@uid=1000
***@uid=2000
***@uid=3000
***@uid=4000
***@uid=5000
***@uid=6000
***@uid=7000
***@uid=8000
***@uid=9000
.
.
.

... where the values of all of these will be the same, this is going
to be *awful* even if the file system can support it.

So maybe we are better off if we define an xattr

***@guest-container

... so the property is that it is ignored by the host ("real")
container, and in all of the subcontainers, it will be used if the
local container root is trying to execute the file.

Now, this doesn't support the solution of the "turtles all the way
down" insane containers configuraiton.

E.g., where in one container we boot a RHEL distro, which then
launches another container running another RHEL distro, and repeat
this 1000 times, for a very deeply nested subsubsubsubsub...container.

I think this is insane and we shouldn't support this, but I know there
are people who think this is a perfectly sane thing to do.


The other thing this doesn't support is someone who wants to use IMA,
and where the every single IMA is using a different signed HMAC:

***@uid=1000
***@uid=2000
***@uid=3000
***@uid=4000
***@uid=5000
***@uid=6000
***@uid=7000
***@uid=8000
***@uid=9000
.
.
.

Where each security IMA could either be a 32 byte HMAC, or worse, a
256 byte RSA signed signature. Now let's assume there are 10,000
containers, each of which needs a separate RSA signature. This sounds
insane.... but I've seen things that I've thought were more insane
coming out of containerland, so it would be nice if we can get
something signed in blood promisng that no, we would *never* do
something that insane, or better yet, make it impossible to do from an
architectural standpoint.

- Ted
Serge E. Hallyn
2017-07-13 19:50:01 UTC
Permalink
Post by Theodore Ts'o
Post by Eric W. Biederman
Post by Stefan Berger
Can you define what 'scalable' means for you in this context?
From what I can see sharing a filesystem between multiple containers
doesn't 'scale well' for virtualizing the xattrs primarily because of
size limitations of xattrs per file.
Worse than that I believe you will find that filesystems are built on
the assumption that there will be a small number of xattrs per file.
So even if the vfs limitations were lifted the filesystem performance
would suffer.
That's why I've been pushing here. If people try to do
.
.
.
... where the values of all of these will be the same, this is going
to be *awful* even if the file system can support it.
Typically users will be allocated a single range of ids, for instance
100000-200000. We might therefore consider putting a range in the uid=,
i.e. ***@uid=100000-200000. I don't think that's really
needed, but it's an option.

Consider that the executable will be owned by some kuid+kgid. If we
have all the xattrs you list above, then who would we have actually
owning the file? If we're chown'ing it anyway (to be root-owned but
not seutid-root), then this discussion is moot, because we'll have
to re-write the xattr after the chown. So for this to matter, we
would have an fs owned by either uid nobody in the container, or
by some special user (mapped to 100000 in the container perhaps)
which is always special-case-mapped into the container.
Post by Theodore Ts'o
So maybe we are better off if we define an xattr
... so the property is that it is ignored by the host ("real")
container, and in all of the subcontainers, it will be used if the
local container root is trying to execute the file.
In the previous discussion we considered having '***@uid='
with no following integer, meaning that it would take effect in all
user namespaces which do not have kuid 0 as root.

This could be useful for cases like docker hosts, but note that writing
this has to require either global CAP_SETFCAP, or CAP_SETFCAP in a
user namespace that has every kuid except 0 mapped. If joe, uid 1000,
has subuids 100000-20000 delegated to him, then he must not be allowed
to write something that can affect someone with kuids 300000-400000.

-serge
Serge E. Hallyn
2017-07-13 21:20:02 UTC
Permalink
Post by Eric W. Biederman
If you don't care about the ownership of the files, and read only is
acceptable, and you still don't want to give these executables
capabilities in the initial user namespace. What you can do is
make everything owned by some non-zero uid including the security
capability. Call this non-zero uid image-root.
When the container starts it creates two nested user namespaces first
with image-root mapped to 0. Then with the containers choice of uid
mapped to 0 image-root unmapped. This will ensure the capability
attributes work for all containers that share that root image. And it
ensures the file are read-only from the container.
So I don't think there is ever a case where we would share a filesystem
image where we would need to set multiple security attributes on a file.
Neat idea. In fact, you can take it a step further and still have the
files be owned by valid uids in the containers. The parent ns just
needs to have its *root* map to a common kuid not mapped into the child
namespaces, but the files can be owned by another kuid which *is* mapped
into the child containers.
James Morris
2017-07-18 07:10:02 UTC
Permalink
A file shared by 2 containers, one mapping root to uid=1000, the other mapping
root to uid=2000, will show these two xattrs on the host (init_user_ns) once
these containers set xattrs on that file.
I may be missing something here, but what happens when say the uid=2000
container and associated user is deleted from the system, then another is
created with the same uid?

Won't this mean that you have unexpected capabilities turning up in the
new container?
--
James Morris
<***@namei.org>
Stefan Berger
2017-07-18 12:20:02 UTC
Permalink
Post by James Morris
A file shared by 2 containers, one mapping root to uid=1000, the other mapping
root to uid=2000, will show these two xattrs on the host (init_user_ns) once
these containers set xattrs on that file.
I may be missing something here, but what happens when say the uid=2000
container and associated user is deleted from the system, then another is
created with the same uid?
Won't this mean that you have unexpected capabilities turning up in the
new container?
Yes, that's right. I don't know any solution for that. We would have to
walk the filesystems and find all 'stale' xattrs with such a uid. This
is independent of whether the uid is encoded on the name side, as in
this patch, or on the value side, as in Serge's original proposal. And
uids of a mapped container root user don't necessarily have to have an
account on the host so that an account deletion could trigger that.

Stefan
Eric W. Biederman
2017-07-18 13:40:01 UTC
Permalink
Post by James Morris
A file shared by 2 containers, one mapping root to uid=1000, the other mapping
root to uid=2000, will show these two xattrs on the host (init_user_ns) once
these containers set xattrs on that file.
I may be missing something here, but what happens when say the uid=2000
container and associated user is deleted from the system, then another is
created with the same uid?
Won't this mean that you have unexpected capabilities turning up in the
new container?
Yes, that's right. I don't know any solution for that. We would have to walk the
filesystems and find all 'stale' xattrs with such a uid. This is independent of
whether the uid is encoded on the name side, as in this patch, or on the value
side, as in Serge's original proposal. And uids of a mapped container root user
don't necessarily have to have an account on the host so that an account
deletion could trigger that.
This problem is actually independent of this piece of code entirely.
Any lingering files owned by that uid have the same issue.

Uids are are persistent system property. It must be arranged that they
are managed carefully. If you want to reuse a uid userdel or the
equivalent must be able to go out and delete everything they have owned.

Which is to say fundamentally this is userspaces's responsibility.

I don't see this as being particularly subtle or novel. We already
track which uids and which subordinate uids go together. I will nack
anything that allows multiple capability attributes per file as we have
established they are unnecessary. So I don't see any scenarios where
this is likely to come up that it would not be a problem without these
uid tagged security capabilities.

Eric
Serge E. Hallyn
2017-07-18 23:20:02 UTC
Permalink
Post by Eric W. Biederman
Post by James Morris
A file shared by 2 containers, one mapping root to uid=1000, the other mapping
root to uid=2000, will show these two xattrs on the host (init_user_ns) once
these containers set xattrs on that file.
I may be missing something here, but what happens when say the uid=2000
container and associated user is deleted from the system, then another is
created with the same uid?
Won't this mean that you have unexpected capabilities turning up in the
new container?
Yes, that's right. I don't know any solution for that. We would have to walk the
filesystems and find all 'stale' xattrs with such a uid. This is independent of
whether the uid is encoded on the name side, as in this patch, or on the value
side, as in Serge's original proposal. And uids of a mapped container root user
don't necessarily have to have an account on the host so that an account
deletion could trigger that.
This problem is actually independent of this piece of code entirely.
Any lingering files owned by that uid have the same issue.
In particular, any setuid-root files in that container have the precisely
analogous issue.

-serge
Eric W. Biederman
2017-07-13 17:30:02 UTC
Permalink
Post by Theodore Ts'o
Post by Eric W. Biederman
Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.
User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.
So correct me if I am wrong; in general, there will only be one
Except.... if you have an Distribution root directory which is shared
by many containers, you would need to put the xattrs in the overlay
inodes. Worse, each time you launch a new container, with a new
subuid allocation, you will have to iterate over all files with
capabilities and do a copy-up operations on the xattrs in overlayfs.
So that's actually a bit of a disaster.
So for distribution overlays, you will need to do things a different
way, which is to map the distro subdirectory so you know that the
capability with the global uid 0 should be used for the container
"root" uid, right?
subcontainer root wants to create the privileged executable. You
still have to do things the other way.
security.foo
exists, *or*
exists, but never both? And there BAR is exclusive to only one
instances?
Otherwise, I suspect that the architecture is going to turn around and
bite us in the *ss eventually, because someone will want to do
something crazy and the solution will not be scalable.
Yep. That is what it looks like from here.

Which is why I asked the question about scalability of the xattr
implementations. It looks like trying to accomodate the general
case just gets us in trouble, and sets unrealistic expectations.

Which strongly suggests that Serge's previous version that
just reved the format of security.capable so that a uid field could
be added is likely to be the better approach.

I want to see what Serge and Stefan have to say but the case looks
pretty clear cut at the moment.

Eric
Stefan Berger
2017-07-13 17:40:02 UTC
Permalink
Post by Eric W. Biederman
Post by Theodore Ts'o
Post by Eric W. Biederman
Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.
User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.
So correct me if I am wrong; in general, there will only be one
Except.... if you have an Distribution root directory which is shared
by many containers, you would need to put the xattrs in the overlay
inodes. Worse, each time you launch a new container, with a new
subuid allocation, you will have to iterate over all files with
capabilities and do a copy-up operations on the xattrs in overlayfs.
So that's actually a bit of a disaster.
So for distribution overlays, you will need to do things a different
way, which is to map the distro subdirectory so you know that the
capability with the global uid 0 should be used for the container
"root" uid, right?
subcontainer root wants to create the privileged executable. You
still have to do things the other way.
security.foo
exists, *or*
exists, but never both? And there BAR is exclusive to only one
instances?
Otherwise, I suspect that the architecture is going to turn around and
bite us in the *ss eventually, because someone will want to do
something crazy and the solution will not be scalable.
Yep. That is what it looks like from here.
Which is why I asked the question about scalability of the xattr
implementations. It looks like trying to accomodate the general
case just gets us in trouble, and sets unrealistic expectations.
Which strongly suggests that Serge's previous version that
just reved the format of security.capable so that a uid field could
be added is likely to be the better approach.
I want to see what Serge and Stefan have to say but the case looks
pretty clear cut at the moment.
The approach of virtualizing the xattrs on the name-side, which is what
this patch does, provides a more general approach than to virtualizing
it on the value side, which is what Serge does in his other patch for
security.capability alone. With the virtualizing on-the-value side
virtualizing the xattr becomes an exercise that needs to be repeated for
every xattr name that one would want to virtualize. With this patch you
would just add another xattr name to a list, a one-line patch in the
end. Xattr with prefixes like trusted.* need a bit more work but this
can be woven in as well
(https://github.com/stefanberger/linux/commit/397b1a3b24045c67405fc83465e544fc865d402f).

For virtualizing the xattrs on the 'value' side I was looking for
whether there's something like a 'wrapper' structure around the actual
value of the xattr so that that wrapper could be extended to support
different values at different uids and applied to any xattr.
Unfortunately there's no such 'wrapper'.

Stefan
Post by Eric W. Biederman
Eric
Eric W. Biederman
2017-07-13 18:00:01 UTC
Permalink
Post by Stefan Berger
Post by Eric W. Biederman
Post by Theodore Ts'o
Post by Eric W. Biederman
Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.
User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.
So correct me if I am wrong; in general, there will only be one
Except.... if you have an Distribution root directory which is shared
by many containers, you would need to put the xattrs in the overlay
inodes. Worse, each time you launch a new container, with a new
subuid allocation, you will have to iterate over all files with
capabilities and do a copy-up operations on the xattrs in overlayfs.
So that's actually a bit of a disaster.
So for distribution overlays, you will need to do things a different
way, which is to map the distro subdirectory so you know that the
capability with the global uid 0 should be used for the container
"root" uid, right?
subcontainer root wants to create the privileged executable. You
still have to do things the other way.
security.foo
exists, *or*
exists, but never both? And there BAR is exclusive to only one
instances?
Otherwise, I suspect that the architecture is going to turn around and
bite us in the *ss eventually, because someone will want to do
something crazy and the solution will not be scalable.
Yep. That is what it looks like from here.
Which is why I asked the question about scalability of the xattr
implementations. It looks like trying to accomodate the general
case just gets us in trouble, and sets unrealistic expectations.
Which strongly suggests that Serge's previous version that
just reved the format of security.capable so that a uid field could
be added is likely to be the better approach.
I want to see what Serge and Stefan have to say but the case looks
pretty clear cut at the moment.
The approach of virtualizing the xattrs on the name-side, which is
what this patch does, provides a more general approach than to
virtualizing it on the value side, which is what Serge does in his
other patch for security.capability alone. With the virtualizing
on-the-value side virtualizing the xattr becomes an exercise that
needs to be repeated for every xattr name that one would want to
virtualize. With this patch you would just add another xattr name to a
list, a one-line patch in the end. Xattr with prefixes like trusted.*
need a bit more work but this can be woven in as well
(https://github.com/stefanberger/linux/commit/397b1a3b24045c67405fc83465e544fc865d402f).
Reusable code has merit, as it reduces the maintenance burden.

My big question right now is can you implement Ted's suggested
restriction. Only one security.foo or ***@... attribute ?

The maintenance gains are definitely worth taking if they do not
penalize the common case.
Post by Stefan Berger
For virtualizing the xattrs on the 'value' side I was looking for
whether there's something like a 'wrapper' structure around the actual
value of the xattr so that that wrapper could be extended to support
different values at different uids and applied to any
xattr. Unfortunately there's no such 'wrapper'.
Different values at different uids currently appear to be undesirable.
At least for security.capable it does not appear to be useful.

A wrapper structure is also a reasonable suggestion. Put it's magic
number/version code where the existing version code is away we go.

Do you know of cases where we will truly want to have different
attributes for different containers?

The case that I can think of for IMA is that the signatures want to be
conntected to a key that goes with the filesystem image (so not a system
key) but that would not be something that would need to be changed
between containers.

Eric
Serge E. Hallyn
2017-07-13 19:50:01 UTC
Permalink
Post by Eric W. Biederman
Post by Theodore Ts'o
Post by Eric W. Biederman
Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.
User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.
So correct me if I am wrong; in general, there will only be one
Except.... if you have an Distribution root directory which is shared
by many containers, you would need to put the xattrs in the overlay
inodes. Worse, each time you launch a new container, with a new
subuid allocation, you will have to iterate over all files with
capabilities and do a copy-up operations on the xattrs in overlayfs.
So that's actually a bit of a disaster.
So for distribution overlays, you will need to do things a different
way, which is to map the distro subdirectory so you know that the
capability with the global uid 0 should be used for the container
"root" uid, right?
subcontainer root wants to create the privileged executable. You
still have to do things the other way.
security.foo
exists, *or*
exists, but never both? And there BAR is exclusive to only one
instances?
Otherwise, I suspect that the architecture is going to turn around and
bite us in the *ss eventually, because someone will want to do
something crazy and the solution will not be scalable.
Yep. That is what it looks like from here.
Which is why I asked the question about scalability of the xattr
implementations. It looks like trying to accomodate the general
case just gets us in trouble, and sets unrealistic expectations.
Which strongly suggests that Serge's previous version that
just reved the format of security.capable so that a uid field could
be added is likely to be the better approach.
I want to see what Serge and Stefan have to say but the case looks
pretty clear cut at the moment.
I'm fine with that. Now, we'll be doing the enforcement at xattr
write time, meaning someone *can* come up with an fs image with >1
such xattrs. Which is *fine*, I believe, it won't break anything
security-wise, and our goal is only to stop users from thinking it
is legitimate two write multiple such xattrs, so that they don't later
bug the fs folks like Ted saying "hey why can't I write 1000 of these,
I think that's a bug."

So at xattr write time,

1. if there is already an xattr, and it is either the global
non-namespaced xattr, or it has kuid=X where X is the kuid
mapped to root in a parent of the container, then we refuse
the write
2. if there is already an xattr, and it is for a kuid=X where
X is mapped into the container, then we overwrite the existing
xattr.

At read/use time, we use the rules we have now.

Does that seem reasonable?

-serge
Eric W. Biederman
2017-07-13 21:30:02 UTC
Permalink
Post by Serge E. Hallyn
Post by Eric W. Biederman
Post by Theodore Ts'o
Post by Eric W. Biederman
Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.
User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.
So correct me if I am wrong; in general, there will only be one
Except.... if you have an Distribution root directory which is shared
by many containers, you would need to put the xattrs in the overlay
inodes. Worse, each time you launch a new container, with a new
subuid allocation, you will have to iterate over all files with
capabilities and do a copy-up operations on the xattrs in overlayfs.
So that's actually a bit of a disaster.
So for distribution overlays, you will need to do things a different
way, which is to map the distro subdirectory so you know that the
capability with the global uid 0 should be used for the container
"root" uid, right?
subcontainer root wants to create the privileged executable. You
still have to do things the other way.
security.foo
exists, *or*
exists, but never both? And there BAR is exclusive to only one
instances?
Otherwise, I suspect that the architecture is going to turn around and
bite us in the *ss eventually, because someone will want to do
something crazy and the solution will not be scalable.
Yep. That is what it looks like from here.
Which is why I asked the question about scalability of the xattr
implementations. It looks like trying to accomodate the general
case just gets us in trouble, and sets unrealistic expectations.
Which strongly suggests that Serge's previous version that
just reved the format of security.capable so that a uid field could
be added is likely to be the better approach.
I want to see what Serge and Stefan have to say but the case looks
pretty clear cut at the moment.
I'm fine with that. Now, we'll be doing the enforcement at xattr
write time, meaning someone *can* come up with an fs image with >1
such xattrs. Which is *fine*, I believe, it won't break anything
security-wise, and our goal is only to stop users from thinking it
is legitimate two write multiple such xattrs, so that they don't later
bug the fs folks like Ted saying "hey why can't I write 1000 of these,
I think that's a bug."
So at xattr write time,
1. if there is already an xattr, and it is either the global
non-namespaced xattr, or it has kuid=X where X is the kuid
mapped to root in a parent of the container, then we refuse
the write
2. if there is already an xattr, and it is for a kuid=X where
X is mapped into the container, then we overwrite the existing
xattr.
At read/use time, we use the rules we have now.
Does that seem reasonable?
That sounds like it would keep us to one xattr of any given type so yes.

It occurs to me while I am writing this that this is also important
for ima/evm. There is an xattr that has a hash of all of the other
security relevant xattrs. Without a limit on the number of xattrs
calculating that security xattr could become time prohibitive.


Eric
Eric W. Biederman
2017-07-14 00:50:01 UTC
Permalink
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.

Eric
Stefan Berger
2017-07-14 11:40:02 UTC
Permalink
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr on
the host. Is that what we want? For limiting the number of xattrs and
getting that functionality (override IMA signature for example) the
former seems better...

For the former I now have the topmost patch here:
https://github.com/stefanberger/linux/commits/xattr_for_userns.v3

Stefan
Post by Eric W. Biederman
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric W. Biederman
2017-07-14 12:20:04 UTC
Permalink
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr on
the host. Is that what we want?
Most definitely. If a more privileged use has set secure.capable that
is better.
Post by Stefan Berger
For limiting the number of xattrs and
getting that functionality (override IMA signature for example) the
former seems better...
I don't know about IMA. But my feeling is that we will only be dealing
with a single signing key, so I don't see how having multiple IMA xattrs
make sense. Could you explain that to me?
Post by Stefan Berger
https://github.com/stefanberger/linux/commits/xattr_for_userns.v3
Thank you.

Eric
Stefan Berger
2017-07-14 12:50:03 UTC
Permalink
Post by Eric W. Biederman
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr on
the host. Is that what we want?
Most definitely. If a more privileged use has set secure.capable that
is better.
Post by Stefan Berger
For limiting the number of xattrs and
getting that functionality (override IMA signature for example) the
former seems better...
I don't know about IMA. But my feeling is that we will only be dealing
with a single signing key, so I don't see how having multiple IMA xattrs
make sense. Could you explain that to me?
Admittedly I would need to construct and example where the user inside
the container doesn't want to share the public key with the host on a
file mounted from the host for some reason.

An example related to security.capability could be a Fedora Docker
container where the container is distributed with the ping tool
installed. The ping tool is installed with cap_net_admin,cap_net_raw+ep.
On a normal Fedora container I cannot use this tool due to my
capabilities bounding set not including cap_net_admin. So, I overwrite
this and set only cap_net_raw+ep and I can use for pinging. I may loose
some functionality on the way due to the lost cap_net_admin but I can
now use the tool. I guess the point is one can override the capabilities
set of a distributed container if the container is started with less
capabilities.

Stefan
Post by Eric W. Biederman
Post by Stefan Berger
https://github.com/stefanberger/linux/commits/xattr_for_userns.v3
Thank you.
Eric
Serge E. Hallyn
2017-07-14 13:40:02 UTC
Permalink
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr
on the host. Is that what we want? For limiting the number of xattrs
Not really. If the file is owned by a uid mapped into the container,
then the container root can chown the file which will clear the file
capability, after which he can set a new one. If the file is not
owned by a uid mapped into the container, then container root could
not set a filecap anyway.

-serge
Stefan Berger
2017-07-14 15:30:02 UTC
Permalink
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr
on the host. Is that what we want? For limiting the number of xattrs
Not really. If the file is owned by a uid mapped into the container,
then the container root can chown the file which will clear the file
capability, after which he can set a new one. If the file is not
owned by a uid mapped into the container, then container root could
not set a filecap anyway.
Let's say I installed a container where all files are signed and thus
have security.ima. Now for some reason I want to re-sign some or all
files inside that container. How would I do that ? Would I need to get
rid of security.ima first, possibly by copying each file, deleting the
original file, and renaming the copied file to the original name, or
should I just be able to write out a new signature, thus creating
***@uid=1000 besides the security.ima ?

Stefan
Serge E. Hallyn
2017-07-14 17:40:03 UTC
Permalink
Post by Stefan Berger
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr
on the host. Is that what we want? For limiting the number of xattrs
Not really. If the file is owned by a uid mapped into the container,
then the container root can chown the file which will clear the file
capability, after which he can set a new one. If the file is not
owned by a uid mapped into the container, then container root could
not set a filecap anyway.
Let's say I installed a container where all files are signed and
thus have security.ima. Now for some reason I want to re-sign some
or all files inside that container. How would I do that ? Would I
need to get rid of security.ima first, possibly by copying each
file, deleting the original file, and renaming the copied file to
the original name, or should I just be able to write out a new
security.ima ?
Stefan
Hi Mimi,

what do you think makes most sense for IMA?

-serge
Eric W. Biederman
2017-07-14 18:30:01 UTC
Permalink
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr
on the host. Is that what we want? For limiting the number of xattrs
Not really. If the file is owned by a uid mapped into the container,
then the container root can chown the file which will clear the file
capability, after which he can set a new one. If the file is not
owned by a uid mapped into the container, then container root could
not set a filecap anyway.
Let's say I installed a container where all files are signed and
thus have security.ima. Now for some reason I want to re-sign some
or all files inside that container. How would I do that ? Would I
need to get rid of security.ima first, possibly by copying each
file, deleting the original file, and renaming the copied file to
the original name, or should I just be able to write out a new
security.ima ?
Stefan
Hi Mimi,
what do you think makes most sense for IMA?
I am going to give my two cents since I have been thinking about this.

First I think this entire scheme plays hobs with the security.evm
attribute as security.evm needs to know the names of the xattrs to
protect.

I forget which attributes has a hash and what has a message
athentication code.

If there is an attribute with a simple file hash I think it only make
sense for the kernel to touch it, and I don't see any sense in having
multiples.

If there is an attribute with a message authentication code (roughly a
signed hash) it makes sense to have that to be tied to the kernel key
ring that controlls the keys. (Which probably means a per user
namespace thing at some point). But again pretty untouchable otherwise.

Which brings us to the semantic question of would it be nice to have
stacked IMA/EVM on the same file.

I really don't think we do. I think allowing multiple keys for
different part of trusting files is easy enough that we should have no
need to fight over which keys do which.

Looking at integrity.h I see signature_v2_hdr that has a keyid. Any use
case I can think of for distributing a distribution image with ima/evm
xattrs will need to use asymmetric keys aka public/private keypairs so
that the originator of the content does not give away their private
keys.

Given that usefully we are talking about content that should be
connected to keys in one way or another I don't believe it even makes
sense at this point to attempt to use uids for dealing with ima and
evm content.

Further looking Serge's previous patch is 300 lines of code Setfan's
patch that provides the possibility of code resuse is 500 lines of code.

Increasingly it is looking to me that code reuse rather than concept
reuse is a false economy. The code does not get smaller. The semantic
differences make it problematic. Possibly to the problematic to the
point where significant pieces may not be reused. The format breaks
assumptions for other parts of the code like security.evm. The format
by multiple names instead of a single attribute requires more disk
access so is less efficient.

In short I am seeing more code that runs slower and is harder to
maintain. Please point out where I am wrong.

Eric
Mimi Zohar
2017-07-14 18:50:02 UTC
Permalink
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr
on the host. Is that what we want? For limiting the number of xattrs
Not really. If the file is owned by a uid mapped into the container,
then the container root can chown the file which will clear the file
capability, after which he can set a new one. If the file is not
owned by a uid mapped into the container, then container root could
not set a filecap anyway.
Let's say I installed a container where all files are signed and
thus have security.ima. Now for some reason I want to re-sign some
or all files inside that container. How would I do that ? Would I
need to get rid of security.ima first, possibly by copying each
file, deleting the original file, and renaming the copied file to
the original name, or should I just be able to write out a new
security.ima ?
Stefan
Hi Mimi,
what do you think makes most sense for IMA?
If I'm understanding the discussion correctly, this isn't an issue for
layered copy on write filesystems, as each fs layer could have it's
own set of xattrs.  The underlying and layered xattrs should be able
to co-exist.  Use the layered xattr if it exists, but fall back to
using the underlying xattr if it doesn't.

The concern is with a shared filesystems.  In that case, for IMA it
would make sense to support a native and a namespace xattr.  If due to
xattr space limitations we have to limit the number of xattrs, then we
should limit it to two - a native and a namespace version, with a
"uid=" tag - first namespace gets permission to write the namespace
xattr.  Again, like in the layered case, if the namespace xattr
doesn't exist, fall back to using the native xattr.

This allows most files to use the underlying xattrs, but allows a few
files to be re-signed inside the namespace, as needed.  For the
layered filesystem case, this would allow mutable file hashes to be
written.  (Unclear as to how shared filesystems would work in this
case.)

Mimi
James Bottomley
2017-07-14 19:00:02 UTC
Permalink
Post by Mimi Zohar
The concern is with a shared filesystems.  In that case, for IMA it
would make sense to support a native and a namespace xattr.  If due
to xattr space limitations we have to limit the number of xattrs,
then we should limit it to two - a native and a namespace version,
with a "uid=" tag - first namespace gets permission to write the
namespace xattr.  Again, like in the layered case, if the namespace
xattr doesn't exist, fall back to using the native xattr.
Just on this point: if we're really concerned about the need on shared
filesystems to have multiple IMA signatures per file, might it not make
sense simply to support multiple signatures within the security.ima
xattr? The rules for writing signature updates within user namespaces
would be somewhat complex (say only able to replace a signature for
which you demonstrate you possess the key) but it would lead to an
implementation which would work for traditional shared filesystems
(like NFS) as well as containerised bind mounts.

James
Mimi Zohar
2017-07-14 20:10:02 UTC
Permalink
Post by James Bottomley
Post by Mimi Zohar
The concern is with a shared filesystems.  In that case, for IMA it
would make sense to support a native and a namespace xattr.  If due
to xattr space limitations we have to limit the number of xattrs,
then we should limit it to two - a native and a namespace version,
with a "uid=" tag - first namespace gets permission to write the
namespace xattr.  Again, like in the layered case, if the namespace
xattr doesn't exist, fall back to using the native xattr.
Just on this point: if we're really concerned about the need on shared
filesystems to have multiple IMA signatures per file, might it not make
sense simply to support multiple signatures within the security.ima
xattr? The rules for writing signature updates within user namespaces
would be somewhat complex (say only able to replace a signature for
which you demonstrate you possess the key) but it would lead to an
implementation which would work for traditional shared filesystems
(like NFS) as well as containerised bind mounts.
Writing security.ima requires being root with CAP_SYS_ADMIN
privileges.  I wouldn't want to give root within the namespace
permission to over write or just extend the native security.ima.

Mimi
James Bottomley
2017-07-14 20:50:02 UTC
Permalink
Post by Mimi Zohar
Post by James Bottomley
Post by Mimi Zohar
The concern is with a shared filesystems.  In that case, for IMA
it would make sense to support a native and a namespace xattr.
 If due to xattr space limitations we have to limit the number of
xattrs, then we should limit it to two - a native and a namespace
version, with a "uid=" tag - first namespace gets permission to
write the namespace xattr.  Again, like in the layered case, if
the namespace xattr doesn't exist, fall back to using the native
xattr.
Just on this point: if we're really concerned about the need on
shared filesystems to have multiple IMA signatures per file, might
it not make sense simply to support multiple signatures within the
security.ima xattr? The rules for writing signature updates within
user namespaces would be somewhat complex (say only able to replace
a signature for which you demonstrate you possess the key) but it
would lead to an implementation which would work for traditional
shared filesystems (like NFS) as well as containerised bind mounts.
Writing security.ima requires being root with CAP_SYS_ADMIN
privileges.  I wouldn't want to give root within the namespace
permission to over write or just extend the native security.ima.
but why?  That's partly the point of all of this: some security.
attributes can't be written by container root without some supervision
(the capability ones are the hugely problematic ones from this point of
view), but for some there's no reason they shouldn't be.  What would be
the reason that root in a container shouldn't be able to write the ima
xattr the same as host root could on its filesystem?

James
Theodore Ts'o
2017-07-14 21:40:02 UTC
Permalink
Post by James Bottomley
but why?  That's partly the point of all of this: some security.
attributes can't be written by container root without some supervision
(the capability ones are the hugely problematic ones from this point of
view), but for some there's no reason they shouldn't be.  What would be
the reason that root in a container shouldn't be able to write the ima
xattr the same as host root could on its filesystem?
So I'm happy to say, "Ix-nay on nested containerization; that way lies
insanity". But my understanding is that there will be people who want
to run containers in containers in containers in containers... and
this is what scares me.

What if someone in the Nth layer of containerization wants to allow
the container root in the (N+1)th layer to set file capabilities that
will not be honored in the Nth layer of containerization?

Again I think that this is insane, and I'm happy for the answer to be,
"No, that's not supported". That the "Host container" can have
capabilities that it won't honor, but will be honored by all
subcontainers, but that same thing can't be done between a
subsubsub-container and its child subsubsubsub-container.

Are we OK with that? Because how we would encode this in the xattr
seems to be to be hopelessly not scalable.

- Ted
Eric W. Biederman
2017-07-14 23:40:01 UTC
Permalink
Post by Theodore Ts'o
Post by James Bottomley
but why?  That's partly the point of all of this: some security.
attributes can't be written by container root without some supervision
(the capability ones are the hugely problematic ones from this point of
view), but for some there's no reason they shouldn't be.  What would be
the reason that root in a container shouldn't be able to write the ima
xattr the same as host root could on its filesystem?
So I'm happy to say, "Ix-nay on nested containerization; that way lies
insanity". But my understanding is that there will be people who want
to run containers in containers in containers in containers... and
this is what scares me.
I am happy to say we need to bound the space we take in an inode.
So a design that needs more space the more containers you have is
suspicious.

I am not fond of decisions that don't allow nesting of containers. That
just paints us into a corner. I am in favor of things that require
little or bounded space.

Generally I will frown at a decision that won't allow nesting, because
nesting of containers happens naturally
Post by Theodore Ts'o
What if someone in the Nth layer of containerization wants to allow
the container root in the (N+1)th layer to set file capabilities that
will not be honored in the Nth layer of containerization?
That works perfectly well with the design we have today. And it only
needs a single security.capability attribute. The actual design is
associated with the security.capability attribute (either in the
attribute or in the most recent iteration in the attribute name *scowl*)
is to have the uid (from the filesystems point of view) of the root user
of a user namespace. Running that executable will give you those
capabilities if the uid matches the root user in your user namespace (or
a parent user namespace).

As anyone can who can modify a file can remove a security.capable
attribute just like anyone who can modify a file can remove the setuid
bit this works fine is is sufficient. Though perhaps a little different.
Post by Theodore Ts'o
Again I think that this is insane, and I'm happy for the answer to be,
"No, that's not supported". That the "Host container" can have
capabilities that it won't honor, but will be honored by all
subcontainers, but that same thing can't be done between a
subsubsub-container and its child subsubsubsub-container.
Are we OK with that? Because how we would encode this in the xattr
seems to be to be hopelessly not scalable.
That really isn't an issue right now.

The real question right now is what to do with security.ima and
security.evm. As it was proposed that we share a common code base for
them. Right now it looks to me like the semantics are sufficiently
different that it doesn't make sense to share code between the two
implementations. At which point all reason for storing any of this in
the xattr name goes away. So we just have a single xattr.

Right now I am very much in favor of security xattrs continuing to have
well known names. That easily limits how much space in the inode you
can have, and it makes thinking about things easier. It doesn't
preclude having acls in your xattr. That is exactly how posix
acls are implemented. But I am not going to build generic support for
them, and I really don't expect they will be needed.

Eric
Mimi Zohar
2017-07-14 23:40:01 UTC
Permalink
Post by James Bottomley
Post by Mimi Zohar
Post by James Bottomley
Post by Mimi Zohar
The concern is with a shared filesystems.  In that case, for IMA
it would make sense to support a native and a namespace xattr.
 If due to xattr space limitations we have to limit the number of
xattrs, then we should limit it to two - a native and a namespace
version, with a "uid=" tag - first namespace gets permission to
write the namespace xattr.  Again, like in the layered case, if
the namespace xattr doesn't exist, fall back to using the native
xattr.
Just on this point: if we're really concerned about the need on
shared filesystems to have multiple IMA signatures per file, might
it not make sense simply to support multiple signatures within the
security.ima xattr? The rules for writing signature updates within
user namespaces would be somewhat complex (say only able to replace
a signature for which you demonstrate you possess the key) but it
would lead to an implementation which would work for traditional
shared filesystems (like NFS) as well as containerised bind mounts.
Writing security.ima requires being root with CAP_SYS_ADMIN
privileges.  I wouldn't want to give root within the namespace
permission to over write or just extend the native security.ima.
but why?  That's partly the point of all of this: some security.
attributes can't be written by container root without some supervision
(the capability ones are the hugely problematic ones from this point of
view), but for some there's no reason they shouldn't be.  What would be
the reason that root in a container shouldn't be able to write the ima
xattr the same as host root could on its filesystem?
Let's describe the different scenarios of a shared filesystem (not
layered).
1. The kernel updating the file hash as the file changes, written as
the security.ima xattr.
2. Root vs root in the namespace explicitly writing the security.ima
xattr. 
 
In the first case, neither root nor root in the namespace calculates
and writes the file hash as security.ima.  The kernel itself updates
the file hash.  Since the file hash is the same value, it doesn't need
to be written out as two separate security.ima xattrs.

The second case is where root or root in the namespace explicitly
writes out a file signature as security.ima.  Here different entities
might want to sign the file with different keys.  Up to now, only root
with CAP_SYS_ADMIN can write out security.ima.  That shouldn't change.
Nor should root in the namespace be allowed to change the native's
view, but should be only allowed to change its own view of the xattr.
Allowing root in the namespace to change the native's view isn't safe.

Remember writing security.ima also triggers security.evm to be
updated.  Root in the namespace should never be permitted to cause the
native's security.evm to be updated, only the namespaces's version.

Mimi
Eric W. Biederman
2017-07-15 00:10:01 UTC
Permalink
Post by James Bottomley
Post by Mimi Zohar
Post by James Bottomley
Post by Mimi Zohar
The concern is with a shared filesystems.  In that case, for IMA
it would make sense to support a native and a namespace xattr.
 If due to xattr space limitations we have to limit the number of
xattrs, then we should limit it to two - a native and a namespace
version, with a "uid=" tag - first namespace gets permission to
write the namespace xattr.  Again, like in the layered case, if
the namespace xattr doesn't exist, fall back to using the native
xattr.
Just on this point: if we're really concerned about the need on
shared filesystems to have multiple IMA signatures per file, might
it not make sense simply to support multiple signatures within the
security.ima xattr? The rules for writing signature updates within
user namespaces would be somewhat complex (say only able to replace
a signature for which you demonstrate you possess the key) but it
would lead to an implementation which would work for traditional
shared filesystems (like NFS) as well as containerised bind mounts.
Writing security.ima requires being root with CAP_SYS_ADMIN
privileges.  I wouldn't want to give root within the namespace
permission to over write or just extend the native security.ima.
but why?  That's partly the point of all of this: some security.
attributes can't be written by container root without some supervision
(the capability ones are the hugely problematic ones from this point of
view), but for some there's no reason they shouldn't be.  What would be
the reason that root in a container shouldn't be able to write the ima
xattr the same as host root could on its filesystem?
Mimi said she ``native''. It competely makes sense for the things that
the container doesn't ``own'' to not be allowed to be written/updated by
the container.

James you are making the case here for root in the container to write
to the ima and evm attributes that are for the container.

So I don't actually see any disagreement here except perhaps for
terminology.

Eric
Theodore Ts'o
2017-07-14 19:30:03 UTC
Permalink
Post by Mimi Zohar
If I'm understanding the discussion correctly, this isn't an issue for
layered copy on write filesystems, as each fs layer could have it's
own set of xattrs.  The underlying and layered xattrs should be able
to co-exist.  Use the layered xattr if it exists, but fall back to
using the underlying xattr if it doesn't.
Note that this assumes that it is possible to "copy up" the xattrs
without necessarily "copying up" all of the data blocks. This might
be true for some layers, but I don't believe it is currently true for
overlayfs, for example.

- Ted
Mimi Zohar
2017-07-14 19:50:02 UTC
Permalink
Post by Theodore Ts'o
Post by Mimi Zohar
If I'm understanding the discussion correctly, this isn't an issue for
layered copy on write filesystems, as each fs layer could have it's
own set of xattrs.  The underlying and layered xattrs should be able
to co-exist.  Use the layered xattr if it exists, but fall back to
using the underlying xattr if it doesn't.
Note that this assumes that it is possible to "copy up" the xattrs
without necessarily "copying up" all of the data blocks. This might
be true for some layers, but I don't believe it is currently true for
overlayfs, for example.
Ok, so for the use case scneario where the container owner is willing
to use the public key distributed with the files, then only those
files that are new or modified in the overlay would need to be signed
with a key local to the overlay.  In the worst case scenario, where
the container owner is only willing to trust their own public key, I
guess we can live with having to copy up the files.

Mimi
Mimi Zohar
2017-07-14 19:30:02 UTC
Permalink
Post by Eric W. Biederman
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr
on the host. Is that what we want? For limiting the number of xattrs
Not really. If the file is owned by a uid mapped into the container,
then the container root can chown the file which will clear the file
capability, after which he can set a new one. If the file is not
owned by a uid mapped into the container, then container root could
not set a filecap anyway.
Let's say I installed a container where all files are signed and
thus have security.ima. Now for some reason I want to re-sign some
or all files inside that container. How would I do that ? Would I
need to get rid of security.ima first, possibly by copying each
file, deleting the original file, and renaming the copied file to
the original name, or should I just be able to write out a new
security.ima ?
Stefan
Hi Mimi,
what do you think makes most sense for IMA?
I am going to give my two cents since I have been thinking about this.
First I think this entire scheme plays hobs with the security.evm
attribute as security.evm needs to know the names of the xattrs to
protect.
I forget which attributes has a hash and what has a message
athentication code.
security.ima contains either a file hash or a signature. (file data)
security.evm contains either a signature or an hmac of the security
xattrs and other file metadata. (file meta-data)

The same rules would apply to security.evm, as described in my
response.  Based on it's view of the security xattrs, either the
native or namespace security.evm would be updated.
Post by Eric W. Biederman
If there is an attribute with a simple file hash I think it only make
sense for the kernel to touch it, and I don't see any sense in having
multiples.
Only files that are in the IMA-appraisal policy is the file hash
calculated and written out as security.ima.  Depending this policy,
does the security.ima exist.  So if the file is in policy for both the
native and namespace policies, agreed the same hash doesn't need to be
written as two different xattrs.
Post by Eric W. Biederman
If there is an attribute with a message authentication code (roughly a
signed hash) it makes sense to have that to be tied to the kernel key
ring that controlls the keys. (Which probably means a per user
namespace thing at some point). But again pretty untouchable otherwise.
Right, the namespace would require it's own EVM key.
Post by Eric W. Biederman
Which brings us to the semantic question of would it be nice to have
stacked IMA/EVM on the same file.
I really don't think we do. I think allowing multiple keys for
different part of trusting files is easy enough that we should have no
need to fight over which keys do which.
We definitely want to support different policies on the native and in
the namespace with different keys and keyrings.

Refer to Mehmet Kaaylap's recent post, which refers to a PoC version
of IMA namespacing - kernsec.org/pipermail/linux-security-module-
archive/2017-July/002286.html.
Post by Eric W. Biederman
Looking at integrity.h I see signature_v2_hdr that has a keyid. Any use
case I can think of for distributing a distribution image with ima/evm
xattrs will need to use asymmetric keys aka public/private keypairs so
that the originator of the content does not give away their private
keys.
Agreed.
Post by Eric W. Biederman
Given that usefully we are talking about content that should be
connected to keys in one way or another I don't believe it even makes
sense at this point to attempt to use uids for dealing with ima and
evm content.
We need to resolve the xattr issue in order to namespace IMA-
appraisal. 

Mimi
Post by Eric W. Biederman
Further looking Serge's previous patch is 300 lines of code Setfan's
patch that provides the possibility of code resuse is 500 lines of code.
Increasingly it is looking to me that code reuse rather than concept
reuse is a false economy. The code does not get smaller. The semantic
differences make it problematic. Possibly to the problematic to the
point where significant pieces may not be reused. The format breaks
assumptions for other parts of the code like security.evm. The format
by multiple names instead of a single attribute requires more disk
access so is less efficient.
In short I am seeing more code that runs slower and is harder to
maintain. Please point out where I am wrong.
Eric W. Biederman
2017-07-15 00:20:02 UTC
Permalink
Post by Mimi Zohar
Post by Eric W. Biederman
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr
on the host. Is that what we want? For limiting the number of xattrs
Not really. If the file is owned by a uid mapped into the container,
then the container root can chown the file which will clear the file
capability, after which he can set a new one. If the file is not
owned by a uid mapped into the container, then container root could
not set a filecap anyway.
Let's say I installed a container where all files are signed and
thus have security.ima. Now for some reason I want to re-sign some
or all files inside that container. How would I do that ? Would I
need to get rid of security.ima first, possibly by copying each
file, deleting the original file, and renaming the copied file to
the original name, or should I just be able to write out a new
security.ima ?
Stefan
Hi Mimi,
what do you think makes most sense for IMA?
I am going to give my two cents since I have been thinking about this.
First I think this entire scheme plays hobs with the security.evm
attribute as security.evm needs to know the names of the xattrs to
protect.
I forget which attributes has a hash and what has a message
athentication code.
security.ima contains either a file hash or a signature. (file data)
security.evm contains either a signature or an hmac of the security
xattrs and other file metadata. (file meta-data)
The same rules would apply to security.evm, as described in my
response.  Based on it's view of the security xattrs, either the
native or namespace security.evm would be updated.
Post by Eric W. Biederman
If there is an attribute with a simple file hash I think it only make
sense for the kernel to touch it, and I don't see any sense in having
multiples.
Only files that are in the IMA-appraisal policy is the file hash
calculated and written out as security.ima.  Depending this policy,
does the security.ima exist.  So if the file is in policy for both the
native and namespace policies, agreed the same hash doesn't need to be
written as two different xattrs.
Post by Eric W. Biederman
If there is an attribute with a message authentication code (roughly a
signed hash) it makes sense to have that to be tied to the kernel key
ring that controlls the keys. (Which probably means a per user
namespace thing at some point). But again pretty untouchable otherwise.
Right, the namespace would require it's own EVM key.
Post by Eric W. Biederman
Which brings us to the semantic question of would it be nice to have
stacked IMA/EVM on the same file.
I really don't think we do. I think allowing multiple keys for
different part of trusting files is easy enough that we should have no
need to fight over which keys do which.
We definitely want to support different policies on the native and in
the namespace with different keys and keyrings.
Refer to Mehmet Kaaylap's recent post, which refers to a PoC version
of IMA namespacing - kernsec.org/pipermail/linux-security-module-
archive/2017-July/002286.html.
Post by Eric W. Biederman
Looking at integrity.h I see signature_v2_hdr that has a keyid. Any use
case I can think of for distributing a distribution image with ima/evm
xattrs will need to use asymmetric keys aka public/private keypairs so
that the originator of the content does not give away their private
keys.
Agreed.
Post by Eric W. Biederman
Given that usefully we are talking about content that should be
connected to keys in one way or another I don't believe it even makes
sense at this point to attempt to use uids for dealing with ima and
evm content.
We need to resolve the xattr issue in order to namespace IMA-
appraisal. 
Mimi I have two questions:

a) Is the keyid enough to distinguish the security.ima and security.evm
xattrs of one container from another container and from native? Or
do we have some important security xattrs that are associated with
keys that don't have a keyid?

b) Can we reasonably live with a limitation that the native and the
namespace'd policies don't intersect? Or in the case of an
interesection the native policy is the only one that is executed?

I submit that if the answer is keyids are always present, and we can
live with the native policy taking precedence over the container policy
then we have a solution to the IMA xattrs.

Eric
Mimi Zohar
2017-07-16 11:30:02 UTC
Permalink
Post by Eric W. Biederman
Post by Mimi Zohar
Post by Eric W. Biederman
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr
on the host. Is that what we want? For limiting the number of xattrs
Not really. If the file is owned by a uid mapped into the container,
then the container root can chown the file which will clear the file
capability, after which he can set a new one. If the file is not
owned by a uid mapped into the container, then container root could
not set a filecap anyway.
Let's say I installed a container where all files are signed and
thus have security.ima. Now for some reason I want to re-sign some
or all files inside that container. How would I do that ? Would I
need to get rid of security.ima first, possibly by copying each
file, deleting the original file, and renaming the copied file to
the original name, or should I just be able to write out a new
security.ima ?
Stefan
Hi Mimi,
what do you think makes most sense for IMA?
I am going to give my two cents since I have been thinking about this.
First I think this entire scheme plays hobs with the security.evm
attribute as security.evm needs to know the names of the xattrs to
protect.
I forget which attributes has a hash and what has a message
athentication code.
security.ima contains either a file hash or a signature. (file data)
security.evm contains either a signature or an hmac of the security
xattrs and other file metadata. (file meta-data)
The same rules would apply to security.evm, as described in my
response.  Based on it's view of the security xattrs, either the
native or namespace security.evm would be updated.
Post by Eric W. Biederman
If there is an attribute with a simple file hash I think it only make
sense for the kernel to touch it, and I don't see any sense in having
multiples.
Only files that are in the IMA-appraisal policy is the file hash
calculated and written out as security.ima.  Depending this policy,
does the security.ima exist.  So if the file is in policy for both the
native and namespace policies, agreed the same hash doesn't need to be
written as two different xattrs.
Post by Eric W. Biederman
If there is an attribute with a message authentication code (roughly a
signed hash) it makes sense to have that to be tied to the kernel key
ring that controlls the keys. (Which probably means a per user
namespace thing at some point). But again pretty untouchable otherwise.
Right, the namespace would require it's own EVM key.
Post by Eric W. Biederman
Which brings us to the semantic question of would it be nice to have
stacked IMA/EVM on the same file.
I really don't think we do. I think allowing multiple keys for
different part of trusting files is easy enough that we should have no
need to fight over which keys do which.
We definitely want to support different policies on the native and in
the namespace with different keys and keyrings.
Refer to Mehmet Kaaylap's recent post, which refers to a PoC version
of IMA namespacing - kernsec.org/pipermail/linux-security-module-
archive/2017-July/002286.html.
Post by Eric W. Biederman
Looking at integrity.h I see signature_v2_hdr that has a keyid. Any use
case I can think of for distributing a distribution image with ima/evm
xattrs will need to use asymmetric keys aka public/private keypairs so
that the originator of the content does not give away their private
keys.
Agreed.
Post by Eric W. Biederman
Given that usefully we are talking about content that should be
connected to keys in one way or another I don't believe it even makes
sense at this point to attempt to use uids for dealing with ima and
evm content.
We need to resolve the xattr issue in order to namespace IMA-
appraisal. 
a) Is the keyid enough to distinguish the security.ima and security.evm
xattrs of one container from another container and from native? Or
do we have some important security xattrs that are associated with
keys that don't have a keyid?
b) Can we reasonably live with a limitation that the native and the
namespace'd policies don't intersect? Or in the case of an
interesection the native policy is the only one that is executed?
I submit that if the answer is keyids are always present, and we can
live with the native policy taking precedence over the container policy
then we have a solution to the IMA xattrs.
IMA-measurement is hierachical, meaning that the measurement policy
determines whether the measurement exists in the native, the
container, or both measurement lists.

One of the main namespacing use cases for IMA-appraisal is the ability
to limit running an executable to a particular container.  So unlike
IMA-measurement, which is hierarchical, the IMA-appraisal namespace
policy takes precedence over the native policy.

Mimi
Serge E. Hallyn
2017-07-26 03:10:02 UTC
Permalink
Post by Mimi Zohar
Post by Eric W. Biederman
Which brings us to the semantic question of would it be nice to have
stacked IMA/EVM on the same file.
I really don't think we do. I think allowing multiple keys for
different part of trusting files is easy enough that we should have no
need to fight over which keys do which.
We definitely want to support different policies on the native and in
the namespace with different keys and keyrings.
Ok, so Stefan's code to support userspace in a container reading
security.ima and getting back the value for ***@uid=1000
(if 1000 is the kuid of the container's root user) is in fact
useful to IMA?
Mimi Zohar
2017-07-26 14:00:01 UTC
Permalink
Post by Serge E. Hallyn
Post by Mimi Zohar
Post by Eric W. Biederman
Which brings us to the semantic question of would it be nice to have
stacked IMA/EVM on the same file.
I really don't think we do. I think allowing multiple keys for
different part of trusting files is easy enough that we should have no
need to fight over which keys do which.
We definitely want to support different policies on the native and in
the namespace with different keys and keyrings.
Ok, so Stefan's code to support userspace in a container reading
(if 1000 is the kuid of the container's root user) is in fact
useful to IMA?
Definitely!  Root within the namespace needs to be able to read and
write security.ima in order to (re)sign files, with a specific key
known to that container.  Stefan's code provides different views of
the security xattrs.

Mimi

Eric W. Biederman
2017-07-14 19:00:02 UTC
Permalink
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr
on the host. Is that what we want? For limiting the number of xattrs
Not really. If the file is owned by a uid mapped into the container,
then the container root can chown the file which will clear the file
capability, after which he can set a new one. If the file is not
owned by a uid mapped into the container, then container root could
not set a filecap anyway.
Let's say I installed a container where all files are signed and thus have
security.ima. Now for some reason I want to re-sign some or all files inside
that container. How would I do that ? Would I need to get rid of security.ima
first, possibly by copying each file, deleting the original file, and renaming
the copied file to the original name, or should I just be able to write out a
This gets us into some interesting territory, where the semantics of
these attributes matters.

The implementation of security.capable implements the security killpriv
hooks. Anyone merely by changing the file can cause the security
capability to go away. So it makes sense from the security.capable side
that anyone who has the capable_wrt_inode_uidgid(CAP_SETFCAP) will be
able to clear and set security.capable.

The integrity xattrs do not. Which results in very big semantic
difference between these two kinds of attributes. I am insufficiently
familiar with the rules for security.ima and security.evm to understand
what those rules should be.

That may be enough that we can not share code between these two cases.

Eric
Stefan Berger
2017-07-14 19:30:02 UTC
Permalink
Post by Eric W. Biederman
Post by Serge E. Hallyn
Post by Stefan Berger
Post by Eric W. Biederman
Post by Eric W. Biederman
My big question right now is can you implement Ted's suggested
We need to raw-list the xattrs and do the check before writing them. I am fairly sure this can be done.
The latter.
That case would prevent a container user from overriding the xattr
on the host. Is that what we want? For limiting the number of xattrs
Not really. If the file is owned by a uid mapped into the container,
then the container root can chown the file which will clear the file
capability, after which he can set a new one. If the file is not
owned by a uid mapped into the container, then container root could
not set a filecap anyway.
Let's say I installed a container where all files are signed and thus have
security.ima. Now for some reason I want to re-sign some or all files inside
that container. How would I do that ? Would I need to get rid of security.ima
first, possibly by copying each file, deleting the original file, and renaming
the copied file to the original name, or should I just be able to write out a
This gets us into some interesting territory, where the semantics of
these attributes matters.
The implementation of security.capable implements the security killpriv
hooks. Anyone merely by changing the file can cause the security
capability to go away. So it makes sense from the security.capable side
that anyone who has the capable_wrt_inode_uidgid(CAP_SETFCAP) will be
able to clear and set security.capable.
The integrity xattrs do not. Which results in very big semantic
difference between these two kinds of attributes. I am insufficiently
familiar with the rules for security.ima and security.evm to understand
what those rules should be.
That may be enough that we can not share code between these two cases.
On the host I can simply overwrite capabilities. I think the same model
should apply to the virtualized world. The difference still is that
removing an xattr, if written on the host, may only be possible by copy
+ file move to original filename.

On IMA, when appending a letter to an executable, the executable doesn't
run anymore when appraisal is used, but the signature is still there and
needs to be re-written. Though I think this aspect on how they disappear
doesn't matter as much if they can simply be overwritten.

Some things could certainly be solved with flags indicating behaviors of
xattrs for as long as these flags only affect the reading, listing, and
re-writing of the virtualized xattrs, which is what the patch does. For
example a flag for security.capability could say that only a single
'security.capability(@uid=<uid>)?' may exist while security.ima could
have two.

Stefan
Post by Eric W. Biederman
Eric
Serge E. Hallyn
2017-07-13 21:30:02 UTC
Permalink
Post by Stefan Berger
For virtualizing the xattrs on the 'value' side I was looking for
whether there's something like a 'wrapper' structure around the
actual value of the xattr so that that wrapper could be extended to
support different values at different uids and applied to any xattr.
Unfortunately there's no such 'wrapper'.
I believe my very first implementation did essentially this - it used
the not uncommon structure of (mostly making this up):

struct ns_vfs_cap {
int magic;
int ncaps;
struct ns_vfs_cap_data data[0];
};

with (ncaps * sizeof(ns_vfs_cap_data)) following that.

I didn't like it.

-serge
Serge E. Hallyn
2017-07-13 21:20:01 UTC
Permalink
Post by Theodore Ts'o
Post by Eric W. Biederman
Today we have the xattr security.capable that holds a set of
capabilities that an application gains when executed. AKA setuid root exec
without actually being setuid root.
User namespaces have the concept of capabilities that are not global but
are limited to their user namespace. We do not currently have
filesystem support for this concept.
So correct me if I am wrong; in general, there will only be one
Except.... if you have an Distribution root directory which is shared
by many containers, you would need to put the xattrs in the overlay
inodes.
Is that a problem? Essentially people who would try to do the
above also want to use 'shiftfs' stackable filesystem, which would
presumably eventually do this for you.
Post by Theodore Ts'o
Worse, each time you launch a new container, with a new
subuid allocation, you will have to iterate over all files with
capabilities and do a copy-up operations on the xattrs in overlayfs.
So that's actually a bit of a disaster.
Only if you create the container rootfs as a copy.

Note that generally they would want to walk the fs in that case anyway, to chown
the files into the container. And said chown would clear out any existing file
capabilities (and suid/sgid bits).

On the other hand, unprivileged lxc containers are created by
untarring the distro image straight into the mapped user namespace.
So no chowning is needed, and - once we we have this properly supported -
the filecaps should be automatically written correctly for the container.
Post by Theodore Ts'o
So for distribution overlays, you will need to do things a different
way, which is to map the distro subdirectory so you know that the
capability with the global uid 0 should be used for the container
"root" uid, right?
subcontainer root wants to create the privileged executable. You
still have to do things the other way.
security.foo
exists, *or*
exists, but never both? And there BAR is exclusive to only one
instances?
I think that's fine.

-serge
Serge E. Hallyn
2017-07-13 00:50:01 UTC
Permalink
Post by Eric W. Biederman
Post by Serge E. Hallyn
Post by Eric W. Biederman
It doesn't look like this is coming through Serge so I don't see how
the Signed-off-by tag is legtimate.
This is mostly explained by the fact that there have been a *lot* of
changes, many of them discussed in private emails.
Post by Eric W. Biederman
Post by Eric W. Biederman
From the replies to this it doesn't look like Serge has reviewed this
version either.
I am disappointed that all of my concerns about technical feasibility
remain unaddressed.
Can you re-state those, or give a link to them?
Well I only posted about one substantive comment on the last round
so it should be easy to find that said.
Ok so you are likely referring to http://lkml.org/lkml/2017/6/23/551 ,
thanks. I had actually read that differently when you sent it, and
thought it was more to do with the suggestion of putting the nsid
tags in the middle of the xattr name versus putting it on the end.
As far as that is concerned, note that no other tags besides uid=
are currently supported, and only security.capability is being
namespaced.
Post by Eric W. Biederman
The big question is how does this intereact with filesystems
xattr implementations?
There is the potential that we create many more security xattrs this
way. How does that scale? With more names etc.
What happens if we have one xattr per uid for 1000+ uids?
Well, that's not the intent here. The goal is *not* to make one fs image
that satisfies 200k possible uid mappings. The goal is to reconcile the
support for an unprivileged user to set uid agnostic (within the
container) file capabilities with the uid namespace's design goals -
namely root in a container is privileged over the container but
completely unprivileged wrt the host.

This is in part why in my previous version I only allowed a single
namespaced fscap. But I don't think that we have to enforce a
single fscap - I think it's fair to tell users "go ahead and shoot
yourself in the foot" performance-wise, if they insist on doing
this.

The goal of now putting the root kuid in the name is not to support
multiple containers, but to have common code supporting
security.capability and security.ima, and maybe a few more.
Post by Eric W. Biederman
How does this interact with filesystems optimization of xattr names?
For some filesystems they optmize the xattr names, and don't store the
entire thing.
This I have no idea on. Stefan, have you looked at this?
Post by Eric W. Biederman
Post by Serge E. Hallyn
I'd really like to get to a point where unprivileged containers can start
using filecaps - at this point if that means having an extra temporary
file format based on my earlier patchset while we hash this out, that
actually seems worthwhile. But it would of course be ideal if we could
do the name based caps right in the first place.
This whole new version has set my review back to square one
unfortunately.
Well it is a whole new approach and whole new patch, so of course that's
to be expected :(

-serge
Vivek Goyal
2017-07-12 18:00:02 UTC
Permalink
On Tue, Jul 11, 2017 at 11:05:11AM -0400, Stefan Berger wrote:

[..]
Post by Stefan Berger
@@ -301,14 +721,39 @@ ssize_t
__vfs_getxattr(struct dentry *dentry, struct inode *inode, const char *name,
void *value, size_t size)
{
- const struct xattr_handler *handler;
+ const struct xattr_handler *handler = NULL;
+ char *newname = NULL;
+ int ret, userns_supt_xattr;
+ struct user_namespace *userns = current_user_ns();
+
+ userns_supt_xattr = (xattr_is_userns_supported(name, false) >= 0);
+
Hi Stephan,
Post by Stefan Berger
+ do {
+ kfree(newname);
+
+ newname = xattr_userns_name(name, userns);
^^^
Will name be pointing to a freed string in second iteration of loop.
Post by Stefan Berger
+ if (IS_ERR(newname))
+ return PTR_ERR(newname);
+
+ if (!handler) {
+ name = newname;
Here we assign name and at the beginning of second iteration we free
newname.

Also I am not sure why do we do this assignment only if handler is NULL.

BTW, I set cap_sys_admin on a file outside usernamespace and then launched
user namespace (mapping 1000 to 0). And then tried to do getcap on file
and I am not seeing security.capability set by host. Not sure what am I
doing wrong. getxattr() seems to return -ENODATA. Still debugging it.

Also, have we resovled the question of stacked filesystem like overlayfs.
There we are switching creds to mounter's creds when doing operations on
underlying filesystem. I am concenrned does that mean, we will get and
return security.capability to caller in usernamespace instead of
***@uid=1000.

Vivek
Post by Stefan Berger
+ handler = xattr_resolve_name(inode, &name);
+ if (IS_ERR(handler)) {
+ ret = PTR_ERR(handler);
+ goto out;
+ }
+ if (!handler->get) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+ }
+ ret = handler->get(handler, dentry, inode, name, value, size);
+ userns = userns->parent;
+ } while ((ret == -ENODATA) && userns && userns_supt_xattr);
- handler = xattr_resolve_name(inode, &name);
- if (IS_ERR(handler))
- return PTR_ERR(handler);
- if (!handler->get)
- return -EOPNOTSUPP;
- return handler->get(handler, dentry, inode, name, value, size);
+ kfree(newname);
+ return ret;
}
EXPORT_SYMBOL(__vfs_getxattr);
Thanks
Vivek
Stefan Berger
2017-07-12 19:20:02 UTC
Permalink
Post by Vivek Goyal
[..]
Post by Stefan Berger
@@ -301,14 +721,39 @@ ssize_t
__vfs_getxattr(struct dentry *dentry, struct inode *inode, const char *name,
void *value, size_t size)
{
- const struct xattr_handler *handler;
+ const struct xattr_handler *handler = NULL;
+ char *newname = NULL;
+ int ret, userns_supt_xattr;
+ struct user_namespace *userns = current_user_ns();
+
+ userns_supt_xattr = (xattr_is_userns_supported(name, false) >= 0);
+
Hi Stephan,
Post by Stefan Berger
+ do {
+ kfree(newname);
+
+ newname = xattr_userns_name(name, userns);
^^^
Will name be pointing to a freed string in second iteration of loop.
Fixing for v3.
Post by Vivek Goyal
Post by Stefan Berger
+ if (IS_ERR(newname))
+ return PTR_ERR(newname);
+
+ if (!handler) {
+ name = newname;
Here we assign name and at the beginning of second iteration we free
newname.
Also I am not sure why do we do this assignment only if handler is NULL.
The handler shouldn't change but this optimization isn't helpful. Fixed
through this patch:

https://github.com/stefanberger/linux/commit/10828401b29a13f8c56f8fad0c0fb2690e4af878
Post by Vivek Goyal
BTW, I set cap_sys_admin on a file outside usernamespace and then launched
user namespace (mapping 1000 to 0). And then tried to do getcap on file
and I am not seeing security.capability set by host. Not sure what am I
doing wrong. getxattr() seems to return -ENODATA. Still debugging it.
This was a regression due to the bug in the loop. I didn't have a test
case (with runc) for it, now I do.
Post by Vivek Goyal
Also, have we resovled the question of stacked filesystem like overlayfs.
There we are switching creds to mounter's creds when doing operations on
underlying filesystem. I am concenrned does that mean, we will get and
return security.capability to caller in usernamespace instead of
I would have to test this, otherwise I don't know. I'll try it out with
Docker.

Stefan
Post by Vivek Goyal
Vivek
Post by Stefan Berger
+ handler = xattr_resolve_name(inode, &name);
+ if (IS_ERR(handler)) {
+ ret = PTR_ERR(handler);
+ goto out;
+ }
+ if (!handler->get) {
+ ret = -EOPNOTSUPP;
+ goto out;
+ }
+ }
+ ret = handler->get(handler, dentry, inode, name, value, size);
+ userns = userns->parent;
+ } while ((ret == -ENODATA) && userns && userns_supt_xattr);
- handler = xattr_resolve_name(inode, &name);
- if (IS_ERR(handler))
- return PTR_ERR(handler);
- if (!handler->get)
- return -EOPNOTSUPP;
- return handler->get(handler, dentry, inode, name, value, size);
+ kfree(newname);
+ return ret;
}
EXPORT_SYMBOL(__vfs_getxattr);
Thanks
Vivek
Eric W. Biederman
2017-07-14 23:50:01 UTC
Permalink
Post by Stefan Berger
This patch enables security.capability in user namespaces but also
takes a more general approach to enabling extended attributes in user
namespaces.
The following rules describe the approach using security.foo as a
1a) Reading security.foo from a user namespace will read
being the mapping of root in that parent user namespace. An
exception is if root is mapped to uid 0 on the host, and in this case
we will read security.foo directly.
mapping of root to 1000.
parent namespace is tried to be read. This procedure is repeated up to
the init user namespace. This step only applies for reading of extended
attributes and provides the same behavior as older system where the
host's extended attributes applied to user namespaces.
can be read. The uid within the user namespace will be mapped to the
corresponding uid on the host and that uid will be used in the name of
the extended attribute.
mapping of root to 1000, size of at least 2.
of <uid> also being subject to checking for valid mappings.
3) No other security.foo* can be read.
The same rules for reading apply to writing and removing of user
namespace enabled extended attributes.
When listing extended attributes of a file, only those are presented
to the user namespace that have a valid mapping. Besides that, names
of the extended attributes are adjusted to represent the mapping.
This means that if root is mapped to uid 1000 on the host, the
---
fs/xattr.c | 509 +++++++++++++++++++++++++++++++++++++++++++++--
security/commoncap.c | 36 +++-
security/selinux/hooks.c | 9 +-
3 files changed, 523 insertions(+), 31 deletions(-)
I am just going to quickly and publicly point out that as designed this
patch breaks evm inode metadata signing. As evm_config_xattrnames is not
updated.

While not completely insurmountable that seems like a strong limitation of
this design.

Eric
Stefan Berger
2017-07-15 21:30:01 UTC
Permalink
Post by Eric W. Biederman
Post by Stefan Berger
This patch enables security.capability in user namespaces but also
takes a more general approach to enabling extended attributes in user
namespaces.
The following rules describe the approach using security.foo as a
1a) Reading security.foo from a user namespace will read
being the mapping of root in that parent user namespace. An
exception is if root is mapped to uid 0 on the host, and in this case
we will read security.foo directly.
mapping of root to 1000.
parent namespace is tried to be read. This procedure is repeated up to
the init user namespace. This step only applies for reading of extended
attributes and provides the same behavior as older system where the
host's extended attributes applied to user namespaces.
can be read. The uid within the user namespace will be mapped to the
corresponding uid on the host and that uid will be used in the name of
the extended attribute.
mapping of root to 1000, size of at least 2.
of <uid> also being subject to checking for valid mappings.
3) No other security.foo* can be read.
The same rules for reading apply to writing and removing of user
namespace enabled extended attributes.
When listing extended attributes of a file, only those are presented
to the user namespace that have a valid mapping. Besides that, names
of the extended attributes are adjusted to represent the mapping.
This means that if root is mapped to uid 1000 on the host, the
---
fs/xattr.c | 509 +++++++++++++++++++++++++++++++++++++++++++++--
security/commoncap.c | 36 +++-
security/selinux/hooks.c | 9 +-
3 files changed, 523 insertions(+), 31 deletions(-)
I am just going to quickly and publicly point out that as designed this
patch breaks evm inode metadata signing. As evm_config_xattrnames is not
updated.
While not completely insurmountable that seems like a strong limitation of
this design.
EVM could be converted to get the list of xattrs and prefix-compare it
against the evm_config_xattrnames to do what it does now.

Stefan
Vivek Goyal
2017-07-17 19:00:03 UTC
Permalink
On Tue, Jul 11, 2017 at 11:05:11AM -0400, Stefan Berger wrote:

[..]
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
size will never be less than 0 here. Only caller calls this function only
if size is >0. So can we remove this?

What about case of "!list". So if user space called listxattr(foo, NULL,
0), we want to return the size of buffer as if all the xattrs will be
returned to user space. But in practice we probably will filter out some
xattrs so actually returned string will be smaller than size reported
previously.

Looks like that's the intent of "!list" condition here. Just wanted to
make sure, hence asking.

BTW, I am testing this with overlayfs and trying to figure out if
switching of creds will create issues. Simple operations like listxattr
and getxattr and setxattr so far worked for me. And reason seems to be
that name transformation we are doing in top layer based on creds of
caller (and not based on creds of mounter).

Vivek
Stefan Berger
2017-07-17 21:00:01 UTC
Permalink
Post by Vivek Goyal
[..]
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
size will never be less than 0 here. Only caller calls this function only
if size is >0. So can we remove this?
Correct.
Post by Vivek Goyal
What about case of "!list". So if user space called listxattr(foo, NULL,
0), we want to return the size of buffer as if all the xattrs will be
returned to user space. But in practice we probably will filter out some
xattrs so actually returned string will be smaller than size reported
previously.
This case of size=0 is a problem in userns. Depending on the mapping of
the userid's the list can expand. A ***@uid=100 can become
***@uid=100000, if the mapping is set up so that uid 100 on the
host becomes uid 100000 inside the container. So for now we only have
security.capability and the way I solved this is by allocating a 65k
buffer when calling from a userns. In this buffer where we gather the
xattr names and then walk them to determine the size that's needed for
the buffer by simulating the rewriting. It's not nice but I don't know
of any other solution.
Post by Vivek Goyal
Looks like that's the intent of "!list" condition here. Just wanted to
make sure, hence asking.
Thanks for asking. I thought I had this case covered, but obviously I
did not.
Post by Vivek Goyal
BTW, I am testing this with overlayfs and trying to figure out if
switching of creds will create issues. Simple operations like listxattr
and getxattr and setxattr so far worked for me. And reason seems to be
that name transformation we are doing in top layer based on creds of
caller (and not based on creds of mounter).
Vivek
Stefan
Vivek Goyal
2017-07-18 11:50:03 UTC
Permalink
Post by Stefan Berger
Post by Vivek Goyal
[..]
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
size will never be less than 0 here. Only caller calls this function only
if size is >0. So can we remove this?
Correct.
Post by Vivek Goyal
What about case of "!list". So if user space called listxattr(foo, NULL,
0), we want to return the size of buffer as if all the xattrs will be
returned to user space. But in practice we probably will filter out some
xattrs so actually returned string will be smaller than size reported
previously.
This case of size=0 is a problem in userns. Depending on the mapping of the
host becomes uid 100000 inside the container. So for now we only have
security.capability and the way I solved this is by allocating a 65k buffer
when calling from a userns. In this buffer where we gather the xattr names
and then walk them to determine the size that's needed for the buffer by
simulating the rewriting. It's not nice but I don't know of any other
solution.
Hi Stefan,

For the case of size==0, why don't we iterate through all the xattr,
filter them, remap them and then return the size to process in user
namespace. That should fix this? I thought that's what
xattr_list_userns_rewrite() was doing. But looks like this logic will not
kick in for the case of size==0 due to "!list" condition.

Also we could probably replace "!list" with "!size" wheverever required.
Its little easy to read and understand.

For the other case where some xattrs can get filtered out and we report
a buffer size bigger than actually needed, I am hoping that its acceptable
and none of the existing users are broken.

Thanks
Vivek
Stefan Berger
2017-07-18 12:10:01 UTC
Permalink
Post by Vivek Goyal
Post by Stefan Berger
Post by Vivek Goyal
[..]
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
size will never be less than 0 here. Only caller calls this function only
if size is >0. So can we remove this?
Correct.
Post by Vivek Goyal
What about case of "!list". So if user space called listxattr(foo, NULL,
0), we want to return the size of buffer as if all the xattrs will be
returned to user space. But in practice we probably will filter out some
xattrs so actually returned string will be smaller than size reported
previously.
This case of size=0 is a problem in userns. Depending on the mapping of the
host becomes uid 100000 inside the container. So for now we only have
security.capability and the way I solved this is by allocating a 65k buffer
when calling from a userns. In this buffer where we gather the xattr names
and then walk them to determine the size that's needed for the buffer by
simulating the rewriting. It's not nice but I don't know of any other
solution.
Hi Stefan,
For the case of size==0, why don't we iterate through all the xattr,
filter them, remap them and then return the size to process in user
namespace. That should fix this? I thought that's what
For the size==0 we need a temp. buffer where the raw xattr names are
written to so that the xattr_list_userns_rewrite() can actually rewrite
what the filesystem drivers returned. Not knowing exactly how big that
buffer should be, I allocate 65k for it. From what I read there is a 64k
limit on the vfs layer for xattrs, probably including xattr values. So
65k would for sure be enough also if each one of the xattr names becomes
bigger.

@@ -922,10 +947,20 @@ vfs_listxattr(struct dentry *dentry, char *list,
size_t size, bool rewrite)
{
struct inode *inode = d_inode(dentry);
ssize_t error;
+ bool getsize = false;

error = security_inode_listxattr(dentry);
if (error)
return error;
+
+ if (!size) {
+ if (current_user_ns() != &init_user_ns) {
+ size = 65 * 1024;
+ list = kmalloc(size, GFP_KERNEL);
+ }
+ getsize = true;
+ }
+
if (inode->i_op->listxattr && (inode->i_opflags & IOP_XATTR)) {
error = -EOPNOTSUPP;
error = inode->i_op->listxattr(dentry, list, size);
@@ -937,6 +972,9 @@ vfs_listxattr(struct dentry *dentry, char *list,
size_t size, bool rewrite)
if (error > 0 && rewrite)
error = xattr_list_userns_rewrite(list, error, size);

+ if (getsize)
+ kfree(list);
+
return error;
}
EXPORT_SYMBOL_GPL(vfs_listxattr);


Stefan
Post by Vivek Goyal
xattr_list_userns_rewrite() was doing. But looks like this logic will not
kick in for the case of size==0 due to "!list" condition.
Also we could probably replace "!list" with "!size" wheverever required.
Its little easy to read and understand.
For the other case where some xattrs can get filtered out and we report
a buffer size bigger than actually needed, I am hoping that its acceptable
and none of the existing users are broken.
Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vivek Goyal
2017-07-18 12:40:02 UTC
Permalink
Post by Vivek Goyal
Post by Stefan Berger
Post by Vivek Goyal
[..]
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
size will never be less than 0 here. Only caller calls this function only
if size is >0. So can we remove this?
Correct.
Post by Vivek Goyal
What about case of "!list". So if user space called listxattr(foo, NULL,
0), we want to return the size of buffer as if all the xattrs will be
returned to user space. But in practice we probably will filter out some
xattrs so actually returned string will be smaller than size reported
previously.
This case of size=0 is a problem in userns. Depending on the mapping of the
host becomes uid 100000 inside the container. So for now we only have
security.capability and the way I solved this is by allocating a 65k buffer
when calling from a userns. In this buffer where we gather the xattr names
and then walk them to determine the size that's needed for the buffer by
simulating the rewriting. It's not nice but I don't know of any other
solution.
Hi Stefan,
For the case of size==0, why don't we iterate through all the xattr,
filter them, remap them and then return the size to process in user
namespace. That should fix this? I thought that's what
For the size==0 we need a temp. buffer where the raw xattr names are written
to so that the xattr_list_userns_rewrite() can actually rewrite what the
filesystem drivers returned.
I am probably missing something, but for the case of size==0, we don't
have to copy all xattrs. We just need to determine size. So we can walk
through each xattr, remap it and add to the size. I mean there should not
be a need to allocate this 65K buffer. Just enough space needed to be
able to store remapped xattr.

You are already doing it in xattr_parse_uid_from_kuid(). It returns the
buffer containing remapped xattr. So we should be able to just determine
the size and free the buffer. And do it for all the xattrs returned by
filesystem.

What am I missing?

Vivek
Not knowing exactly how big that buffer should
be, I allocate 65k for it. From what I read there is a 64k limit on the vfs
layer for xattrs, probably including xattr values. So 65k would for sure be
enough also if each one of the xattr names becomes bigger.
@@ -922,10 +947,20 @@ vfs_listxattr(struct dentry *dentry, char *list,
size_t size, bool rewrite)
{
struct inode *inode = d_inode(dentry);
ssize_t error;
+ bool getsize = false;
error = security_inode_listxattr(dentry);
if (error)
return error;
+
+ if (!size) {
+ if (current_user_ns() != &init_user_ns) {
+ size = 65 * 1024;
+ list = kmalloc(size, GFP_KERNEL);
+ }
+ getsize = true;
+ }
+
if (inode->i_op->listxattr && (inode->i_opflags & IOP_XATTR)) {
error = -EOPNOTSUPP;
error = inode->i_op->listxattr(dentry, list, size);
@@ -937,6 +972,9 @@ vfs_listxattr(struct dentry *dentry, char *list, size_t
size, bool rewrite)
if (error > 0 && rewrite)
error = xattr_list_userns_rewrite(list, error, size);
+ if (getsize)
+ kfree(list);
+
return error;
}
EXPORT_SYMBOL_GPL(vfs_listxattr);
Stefan
Post by Vivek Goyal
xattr_list_userns_rewrite() was doing. But looks like this logic will not
kick in for the case of size==0 due to "!list" condition.
Also we could probably replace "!list" with "!size" wheverever required.
Its little easy to read and understand.
For the other case where some xattrs can get filtered out and we report
a buffer size bigger than actually needed, I am hoping that its acceptable
and none of the existing users are broken.
Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vivek Goyal
2017-07-18 12:40:02 UTC
Permalink
Post by Vivek Goyal
Post by Vivek Goyal
Post by Stefan Berger
Post by Vivek Goyal
[..]
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
size will never be less than 0 here. Only caller calls this function only
if size is >0. So can we remove this?
Correct.
Post by Vivek Goyal
What about case of "!list". So if user space called listxattr(foo, NULL,
0), we want to return the size of buffer as if all the xattrs will be
returned to user space. But in practice we probably will filter out some
xattrs so actually returned string will be smaller than size reported
previously.
This case of size=0 is a problem in userns. Depending on the mapping of the
host becomes uid 100000 inside the container. So for now we only have
security.capability and the way I solved this is by allocating a 65k buffer
when calling from a userns. In this buffer where we gather the xattr names
and then walk them to determine the size that's needed for the buffer by
simulating the rewriting. It's not nice but I don't know of any other
solution.
Hi Stefan,
For the case of size==0, why don't we iterate through all the xattr,
filter them, remap them and then return the size to process in user
namespace. That should fix this? I thought that's what
For the size==0 we need a temp. buffer where the raw xattr names are written
to so that the xattr_list_userns_rewrite() can actually rewrite what the
filesystem drivers returned.
I am probably missing something, but for the case of size==0, we don't
have to copy all xattrs. We just need to determine size. So we can walk
through each xattr, remap it and add to the size. I mean there should not
be a need to allocate this 65K buffer. Just enough space needed to be
able to store remapped xattr.
You are already doing it in xattr_parse_uid_from_kuid(). It returns the
buffer containing remapped xattr. So we should be able to just determine
the size and free the buffer. And do it for all the xattrs returned by
filesystem.
What am I missing?
Oh, I think I get it. If I don't pass a buffer to underlying driver, then
it will just return the size (and not actual list). So that's why you are
allocating that big buffer and getting the whole list internally, doing
mapping and returning size to user space. Hmm...

Vivek
Eric W. Biederman
2017-07-18 13:40:01 UTC
Permalink
Post by Vivek Goyal
Post by Vivek Goyal
Post by Vivek Goyal
Post by Stefan Berger
Post by Vivek Goyal
[..]
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
size will never be less than 0 here. Only caller calls this function only
if size is >0. So can we remove this?
Correct.
Post by Vivek Goyal
What about case of "!list". So if user space called listxattr(foo, NULL,
0), we want to return the size of buffer as if all the xattrs will be
returned to user space. But in practice we probably will filter out some
xattrs so actually returned string will be smaller than size reported
previously.
This case of size=0 is a problem in userns. Depending on the mapping of the
host becomes uid 100000 inside the container. So for now we only have
security.capability and the way I solved this is by allocating a 65k buffer
when calling from a userns. In this buffer where we gather the xattr names
and then walk them to determine the size that's needed for the buffer by
simulating the rewriting. It's not nice but I don't know of any other
solution.
Hi Stefan,
For the case of size==0, why don't we iterate through all the xattr,
filter them, remap them and then return the size to process in user
namespace. That should fix this? I thought that's what
For the size==0 we need a temp. buffer where the raw xattr names are written
to so that the xattr_list_userns_rewrite() can actually rewrite what the
filesystem drivers returned.
I am probably missing something, but for the case of size==0, we don't
have to copy all xattrs. We just need to determine size. So we can walk
through each xattr, remap it and add to the size. I mean there should not
be a need to allocate this 65K buffer. Just enough space needed to be
able to store remapped xattr.
You are already doing it in xattr_parse_uid_from_kuid(). It returns the
buffer containing remapped xattr. So we should be able to just determine
the size and free the buffer. And do it for all the xattrs returned by
filesystem.
What am I missing?
Oh, I think I get it. If I don't pass a buffer to underlying driver, then
it will just return the size (and not actual list). So that's why you are
allocating that big buffer and getting the whole list internally, doing
mapping and returning size to user space. Hmm...
A valid reason to be leary of storing attributs in the xattrs.

Eric
Stefan Berger
2017-07-18 13:30:03 UTC
Permalink
Post by Vivek Goyal
Post by Vivek Goyal
Post by Stefan Berger
Post by Vivek Goyal
[..]
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
size will never be less than 0 here. Only caller calls this function only
if size is >0. So can we remove this?
Correct.
Post by Vivek Goyal
What about case of "!list". So if user space called listxattr(foo, NULL,
0), we want to return the size of buffer as if all the xattrs will be
returned to user space. But in practice we probably will filter out some
xattrs so actually returned string will be smaller than size reported
previously.
This case of size=0 is a problem in userns. Depending on the mapping of the
host becomes uid 100000 inside the container. So for now we only have
security.capability and the way I solved this is by allocating a 65k buffer
when calling from a userns. In this buffer where we gather the xattr names
and then walk them to determine the size that's needed for the buffer by
simulating the rewriting. It's not nice but I don't know of any other
solution.
Hi Stefan,
For the case of size==0, why don't we iterate through all the xattr,
filter them, remap them and then return the size to process in user
namespace. That should fix this? I thought that's what
For the size==0 we need a temp. buffer where the raw xattr names are written
to so that the xattr_list_userns_rewrite() can actually rewrite what the
filesystem drivers returned.
I am probably missing something, but for the case of size==0, we don't
have to copy all xattrs. We just need to determine size. So we can walk
through each xattr, remap it and add to the size. I mean there should not
be a need to allocate this 65K buffer. Just enough space needed to be
able to store remapped xattr.
You are already doing it in xattr_parse_uid_from_kuid(). It returns the
buffer containing remapped xattr. So we should be able to just determine
the size and free the buffer. And do it for all the xattrs returned by
filesystem.
What am I missing?
The problem is that each filesystem has a function that collects the
xattr names. These functions only return the needed size if size==0 and
don't write anything into a buffer. If the buffer is empty or there is
no buffer, I have nothing to remap and calculate size for. So I pass a
buffer large enough to hold the xattr names to the filesystem functions
so I can then subsequently walk the xattrs and remap them. The remapping
only needs to be done in non-init_user_ns since there the uid parts
(@uid=1000) may need to be rewritten and most importantly, the size of
the needed buffer can increase, depending on how the uid mappings are.

I don't want to extend every filesystem's xattr name gathering function...


Stefan
Post by Vivek Goyal
Vivek
Not knowing exactly how big that buffer should
be, I allocate 65k for it. From what I read there is a 64k limit on the vfs
layer for xattrs, probably including xattr values. So 65k would for sure be
enough also if each one of the xattr names becomes bigger.
@@ -922,10 +947,20 @@ vfs_listxattr(struct dentry *dentry, char *list,
size_t size, bool rewrite)
{
struct inode *inode = d_inode(dentry);
ssize_t error;
+ bool getsize = false;
error = security_inode_listxattr(dentry);
if (error)
return error;
+
+ if (!size) {
+ if (current_user_ns() != &init_user_ns) {
+ size = 65 * 1024;
+ list = kmalloc(size, GFP_KERNEL);
+ }
+ getsize = true;
+ }
+
if (inode->i_op->listxattr && (inode->i_opflags & IOP_XATTR)) {
error = -EOPNOTSUPP;
error = inode->i_op->listxattr(dentry, list, size);
@@ -937,6 +972,9 @@ vfs_listxattr(struct dentry *dentry, char *list, size_t
size, bool rewrite)
if (error > 0 && rewrite)
error = xattr_list_userns_rewrite(list, error, size);
+ if (getsize)
+ kfree(list);
+
return error;
}
EXPORT_SYMBOL_GPL(vfs_listxattr);
Stefan
Post by Vivek Goyal
xattr_list_userns_rewrite() was doing. But looks like this logic will not
kick in for the case of size==0 due to "!list" condition.
Also we could probably replace "!list" with "!size" wheverever required.
Its little easy to read and understand.
For the other case where some xattrs can get filtered out and we report
a buffer size bigger than actually needed, I am hoping that its acceptable
and none of the existing users are broken.
Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
Vivek Goyal
2017-07-18 15:00:03 UTC
Permalink
Post by Vivek Goyal
Post by Vivek Goyal
Post by Stefan Berger
Post by Vivek Goyal
[..]
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
size will never be less than 0 here. Only caller calls this function only
if size is >0. So can we remove this?
Correct.
Post by Vivek Goyal
What about case of "!list". So if user space called listxattr(foo, NULL,
0), we want to return the size of buffer as if all the xattrs will be
returned to user space. But in practice we probably will filter out some
xattrs so actually returned string will be smaller than size reported
previously.
This case of size=0 is a problem in userns. Depending on the mapping of the
host becomes uid 100000 inside the container. So for now we only have
security.capability and the way I solved this is by allocating a 65k buffer
when calling from a userns. In this buffer where we gather the xattr names
and then walk them to determine the size that's needed for the buffer by
simulating the rewriting. It's not nice but I don't know of any other
solution.
Hi Stefan,
For the case of size==0, why don't we iterate through all the xattr,
filter them, remap them and then return the size to process in user
namespace. That should fix this? I thought that's what
For the size==0 we need a temp. buffer where the raw xattr names are written
to so that the xattr_list_userns_rewrite() can actually rewrite what the
filesystem drivers returned.
I am probably missing something, but for the case of size==0, we don't
have to copy all xattrs. We just need to determine size. So we can walk
through each xattr, remap it and add to the size. I mean there should not
be a need to allocate this 65K buffer. Just enough space needed to be
able to store remapped xattr.
You are already doing it in xattr_parse_uid_from_kuid(). It returns the
buffer containing remapped xattr. So we should be able to just determine
the size and free the buffer. And do it for all the xattrs returned by
filesystem.
What am I missing?
The problem is that each filesystem has a function that collects the xattr
names. These functions only return the needed size if size==0 and don't
write anything into a buffer. If the buffer is empty or there is no buffer,
I have nothing to remap and calculate size for.
How about calling listxattr() twice. In the first call you will get the
size of buffer to allocate. Allocate that buffer and call ->listxattr()
again, this time passing that buffer? That way you will not have to
hardcode the size of buffer.

Vivek
Stefan Berger
2017-07-18 16:20:01 UTC
Permalink
Post by Vivek Goyal
Post by Vivek Goyal
Post by Vivek Goyal
Post by Stefan Berger
Post by Vivek Goyal
[..]
Post by Stefan Berger
+/*
+ * xattr_list_userns_rewrite - Rewrite list of xattr names for user namespaces
+ * or determine needed size for attribute list
+ * in case size == 0
+ *
+ * In a user namespace we do not present all extended attributes to the
+ * user. We filter out those that are in the list of userns supported xattr.
+ * for that uid in the current user namespace.
+ *
+ */
+static ssize_t
+xattr_list_userns_rewrite(char *list, ssize_t size, size_t list_maxlen)
+{
+ char *nlist = NULL;
+ size_t s_off, len, nlen;
+ ssize_t d_off;
+ char *name, *newname;
+
+ if (!list || size < 0 || current_user_ns() == &init_user_ns)
size will never be less than 0 here. Only caller calls this function only
if size is >0. So can we remove this?
Correct.
Post by Vivek Goyal
What about case of "!list". So if user space called listxattr(foo, NULL,
0), we want to return the size of buffer as if all the xattrs will be
returned to user space. But in practice we probably will filter out some
xattrs so actually returned string will be smaller than size reported
previously.
This case of size=0 is a problem in userns. Depending on the mapping of the
host becomes uid 100000 inside the container. So for now we only have
security.capability and the way I solved this is by allocating a 65k buffer
when calling from a userns. In this buffer where we gather the xattr names
and then walk them to determine the size that's needed for the buffer by
simulating the rewriting. It's not nice but I don't know of any other
solution.
Hi Stefan,
For the case of size==0, why don't we iterate through all the xattr,
filter them, remap them and then return the size to process in user
namespace. That should fix this? I thought that's what
For the size==0 we need a temp. buffer where the raw xattr names are written
to so that the xattr_list_userns_rewrite() can actually rewrite what the
filesystem drivers returned.
I am probably missing something, but for the case of size==0, we don't
have to copy all xattrs. We just need to determine size. So we can walk
through each xattr, remap it and add to the size. I mean there should not
be a need to allocate this 65K buffer. Just enough space needed to be
able to store remapped xattr.
You are already doing it in xattr_parse_uid_from_kuid(). It returns the
buffer containing remapped xattr. So we should be able to just determine
the size and free the buffer. And do it for all the xattrs returned by
filesystem.
What am I missing?
The problem is that each filesystem has a function that collects the xattr
names. These functions only return the needed size if size==0 and don't
write anything into a buffer. If the buffer is empty or there is no buffer,
I have nothing to remap and calculate size for.
How about calling listxattr() twice. In the first call you will get the
size of buffer to allocate. Allocate that buffer and call ->listxattr()
again, this time passing that buffer? That way you will not have to
hardcode the size of buffer.
Good idea. I modified the code to do this now. Thanks.

Stefan
Loading...