Merge branch 'for-4.6-ns' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup (5518f66b) · Commits · 戴 / test

Documentation/cgroup-v2.txt

+147 −0

Original line number	Diff line number	Diff line
		@@ -47,6 +47,11 @@ CONTENTS
		5-3. IO
		5-3-1. IO Interface Files
		5-3-2. Writeback
		6. Namespace
		6-1. Basics
		6-2. The Root and Views
		6-3. Migration and setns(2)
		6-4. Interaction with Other Namespaces
		P. Information on Kernel Programming
		P-1. Filesystem Support for Writeback
		D. Deprecated v1 Core Features
		@@ -1114,6 +1119,148 @@ writeback as follows.
		vm.dirty[_background]_ratio.


		6. Namespace

		6-1. Basics

		cgroup namespace provides a mechanism to virtualize the view of the
		"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone
		flag can be used with clone(2) and unshare(2) to create a new cgroup
		namespace. The process running inside the cgroup namespace will have
		its "/proc/$PID/cgroup" output restricted to cgroupns root. The
		cgroupns root is the cgroup of the process at the time of creation of
		the cgroup namespace.

		Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
		complete path of the cgroup of a process. In a container setup where
		a set of cgroups and namespaces are intended to isolate processes the
		"/proc/$PID/cgroup" file may leak potential system level information
		to the isolated processes. For Example:

		# cat /proc/self/cgroup
		0::/batchjobs/container_id1

		The path '/batchjobs/container_id1' can be considered as system-data
		and undesirable to expose to the isolated processes. cgroup namespace
		can be used to restrict visibility of this path. For example, before
		creating a cgroup namespace, one would see:

		# ls -l /proc/self/ns/cgroup
		lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
		# cat /proc/self/cgroup
		0::/batchjobs/container_id1

		After unsharing a new namespace, the view changes.

		# ls -l /proc/self/ns/cgroup
		lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
		# cat /proc/self/cgroup
		0::/

		When some thread from a multi-threaded process unshares its cgroup
		namespace, the new cgroupns gets applied to the entire process (all
		the threads). This is natural for the v2 hierarchy; however, for the
		legacy hierarchies, this may be unexpected.

		A cgroup namespace is alive as long as there are processes inside or
		mounts pinning it. When the last usage goes away, the cgroup
		namespace is destroyed. The cgroupns root and the actual cgroups
		remain.


		6-2. The Root and Views

		The 'cgroupns root' for a cgroup namespace is the cgroup in which the
		process calling unshare(2) is running. For example, if a process in
		/batchjobs/container_id1 cgroup calls unshare, cgroup
		/batchjobs/container_id1 becomes the cgroupns root. For the
		init_cgroup_ns, this is the real root ('/') cgroup.

		The cgroupns root cgroup does not change even if the namespace creator
		process later moves to a different cgroup.

		# ~/unshare -c # unshare cgroupns in some cgroup
		# cat /proc/self/cgroup
		0::/
		# mkdir sub_cgrp_1
		# echo 0 > sub_cgrp_1/cgroup.procs
		# cat /proc/self/cgroup
		0::/sub_cgrp_1

		Each process gets its namespace-specific view of "/proc/$PID/cgroup"

		Processes running inside the cgroup namespace will be able to see
		cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
		From within an unshared cgroupns:

		# sleep 100000 &
		[1] 7353
		# echo 7353 > sub_cgrp_1/cgroup.procs
		# cat /proc/7353/cgroup
		0::/sub_cgrp_1

		From the initial cgroup namespace, the real cgroup path will be
		visible:

		$ cat /proc/7353/cgroup
		0::/batchjobs/container_id1/sub_cgrp_1

		From a sibling cgroup namespace (that is, a namespace rooted at a
		different cgroup), the cgroup path relative to its own cgroup
		namespace root will be shown. For instance, if PID 7353's cgroup
		namespace root is at '/batchjobs/container_id2', then it will see

		# cat /proc/7353/cgroup
		0::/../container_id2/sub_cgrp_1

		Note that the relative path always starts with '/' to indicate that
		its relative to the cgroup namespace root of the caller.


		6-3. Migration and setns(2)

		Processes inside a cgroup namespace can move into and out of the
		namespace root if they have proper access to external cgroups. For
		example, from inside a namespace with cgroupns root at
		/batchjobs/container_id1, and assuming that the global hierarchy is
		still accessible inside cgroupns:

		# cat /proc/7353/cgroup
		0::/sub_cgrp_1
		# echo 7353 > batchjobs/container_id2/cgroup.procs
		# cat /proc/7353/cgroup
		0::/../container_id2

		Note that this kind of setup is not encouraged. A task inside cgroup
		namespace should only be exposed to its own cgroupns hierarchy.

		setns(2) to another cgroup namespace is allowed when:

		(a) the process has CAP_SYS_ADMIN against its current user namespace
		(b) the process has CAP_SYS_ADMIN against the target cgroup
		namespace's userns

		No implicit cgroup changes happen with attaching to another cgroup
		namespace. It is expected that the someone moves the attaching
		process under the target cgroup namespace root.


		6-4. Interaction with Other Namespaces

		Namespace specific cgroup hierarchy can be mounted by a process
		running inside a non-init cgroup namespace.

		# mount -t cgroup2 none $MOUNT_POINT

		This will mount the unified cgroup hierarchy with cgroupns root as the
		filesystem root. The process needs CAP_SYS_ADMIN against its user and
		mount namespaces.

		The virtualization of /proc/self/cgroup file combined with restricting
		the view of cgroup hierarchy by namespace-private cgroupfs mount
		provides a properly isolated cgroup view inside the container.


		P. Information on Kernel Programming

		This section contains kernel programming information in the areas

fs/kernfs/dir.c

+160 −31

Original line number	Diff line number	Diff line
		@@ -44,28 +44,122 @@ static int kernfs_name_locked(struct kernfs_node kn, char buf, size_t buflen)
		return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
		}

		static char * __must_check kernfs_path_locked(struct kernfs_node kn, char buf,
		size_t buflen)
		/* kernfs_node_depth - compute depth from @from to @to */
		static size_t kernfs_depth(struct kernfs_node from, struct kernfs_node to)
		{
		char *p = buf + buflen;
		int len;
		size_t depth = 0;

		*--p = '\0';
		while (to->parent && to != from) {
		depth++;
		to = to->parent;
		}
		return depth;
		}

		do {
		len = strlen(kn->name);
		if (p - buf < len + 1) {
		static struct kernfs_node kernfs_common_ancestor(struct kernfs_node a,
		struct kernfs_node *b)
		{
		size_t da, db;
		struct kernfs_root ra = kernfs_root(a), rb = kernfs_root(b);

		if (ra != rb)
		return NULL;

		da = kernfs_depth(ra->kn, a);
		db = kernfs_depth(rb->kn, b);

		while (da > db) {
		a = a->parent;
		da--;
		}
		while (db > da) {
		b = b->parent;
		db--;
		}

		/* worst case b and a will be the same at root */
		while (b != a) {
		b = b->parent;
		a = a->parent;
		}

		return a;
		}

		/**
		* kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
		* where kn_from is treated as root of the path.
		* @kn_from: kernfs node which should be treated as root for the path
		* @kn_to: kernfs node to which path is needed
		* @buf: buffer to copy the path into
		* @buflen: size of @buf
		*
		* We need to handle couple of scenarios here:
		* [1] when @kn_from is an ancestor of @kn_to at some level
		* kn_from: /n1/n2/n3
		* kn_to: /n1/n2/n3/n4/n5
		* result: /n4/n5
		*
		* [2] when @kn_from is on a different hierarchy and we need to find common
		* ancestor between @kn_from and @kn_to.
		* kn_from: /n1/n2/n3/n4
		* kn_to: /n1/n2/n5
		* result: /../../n5
		* OR
		* kn_from: /n1/n2/n3/n4/n5 [depth=5]
		* kn_to: /n1/n2/n3 [depth=3]
		* result: /../..
		*
		* return value: length of the string. If greater than buflen,
		* then contents of buf are undefined. On error, -1 is returned.
		*/
		static int kernfs_path_from_node_locked(struct kernfs_node *kn_to,
		struct kernfs_node *kn_from,
		char *buf, size_t buflen)
		{
		struct kernfs_node kn, common;
		const char parent_str[] = "/..";
		size_t depth_from, depth_to, len = 0, nlen = 0;
		char *p;
		int i;

		if (!kn_from)
		kn_from = kernfs_root(kn_to)->kn;

		if (kn_from == kn_to)
		return strlcpy(buf, "/", buflen);

		common = kernfs_common_ancestor(kn_from, kn_to);
		if (WARN_ON(!common))
		return -1;

		depth_to = kernfs_depth(common, kn_to);
		depth_from = kernfs_depth(common, kn_from);

		if (buf)
		buf[0] = '\0';
		p = NULL;
		break;

		for (i = 0; i < depth_from; i++)
		len += strlcpy(buf + len, parent_str,
		len < buflen ? buflen - len : 0);

		/* Calculate how many bytes we need for the rest */
		for (kn = kn_to; kn != common; kn = kn->parent)
		nlen += strlen(kn->name) + 1;

		if (len + nlen >= buflen)
		return len + nlen;

		p = buf + len + nlen;
		*p = '\0';
		for (kn = kn_to; kn != common; kn = kn->parent) {
		nlen = strlen(kn->name);
		p -= nlen;
		memcpy(p, kn->name, nlen);
		*(--p) = '/';
		}
		p -= len;
		memcpy(p, kn->name, len);
		*--p = '/';
		kn = kn->parent;
		} while (kn && kn->parent);

		return p;
		return len + nlen;
		}

		/**
		@@ -114,6 +208,34 @@ size_t kernfs_path_len(struct kernfs_node *kn)
		return len;
		}

		/**
		* kernfs_path_from_node - build path of node @to relative to @from.
		* @from: parent kernfs_node relative to which we need to build the path
		* @to: kernfs_node of interest
		* @buf: buffer to copy @to's path into
		* @buflen: size of @buf
		*
		* Builds @to's path relative to @from in @buf. @from and @to must
		* be on the same kernfs-root. If @from is not parent of @to, then a relative
		* path (which includes '..'s) as needed to reach from @from to @to is
		* returned.
		*
		* If @buf isn't long enough, the return value will be greater than @buflen
		* and @buf contents are undefined.
		*/
		int kernfs_path_from_node(struct kernfs_node to, struct kernfs_node from,
		char *buf, size_t buflen)
		{
		unsigned long flags;
		int ret;

		spin_lock_irqsave(&kernfs_rename_lock, flags);
		ret = kernfs_path_from_node_locked(to, from, buf, buflen);
		spin_unlock_irqrestore(&kernfs_rename_lock, flags);
		return ret;
		}
		EXPORT_SYMBOL_GPL(kernfs_path_from_node);

		/**
		* kernfs_path - build full path of a given node
		* @kn: kernfs_node of interest
		@@ -127,13 +249,12 @@ size_t kernfs_path_len(struct kernfs_node *kn)
		*/
		char kernfs_path(struct kernfs_node kn, char *buf, size_t buflen)
		{
		unsigned long flags;
		char *p;
		int ret;

		spin_lock_irqsave(&kernfs_rename_lock, flags);
		p = kernfs_path_locked(kn, buf, buflen);
		spin_unlock_irqrestore(&kernfs_rename_lock, flags);
		return p;
		ret = kernfs_path_from_node(kn, NULL, buf, buflen);
		if (ret < 0 \|\| ret >= buflen)
		return NULL;
		return buf;
		}
		EXPORT_SYMBOL_GPL(kernfs_path);

		@@ -164,17 +285,25 @@ void pr_cont_kernfs_name(struct kernfs_node *kn)
		void pr_cont_kernfs_path(struct kernfs_node *kn)
		{
		unsigned long flags;
		char *p;
		int sz;

		spin_lock_irqsave(&kernfs_rename_lock, flags);

		p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
		sz = kernfs_path_from_node_locked(kn, NULL, kernfs_pr_cont_buf,
		sizeof(kernfs_pr_cont_buf));
		if (p)
		pr_cont("%s", p);
		else
		pr_cont("<name too long>");
		if (sz < 0) {
		pr_cont("(error)");
		goto out;
		}

		if (sz >= sizeof(kernfs_pr_cont_buf)) {
		pr_cont("(name too long)");
		goto out;
		}

		pr_cont("%s", kernfs_pr_cont_buf);

		out:
		spin_unlock_irqrestore(&kernfs_rename_lock, flags);
		}

fs/kernfs/mount.c

+69 −0

Original line number	Diff line number	Diff line
		@@ -14,6 +14,7 @@
		#include <linux/magic.h>
		#include <linux/slab.h>
		#include <linux/pagemap.h>
		#include <linux/namei.h>

		#include "kernfs-internal.h"

		@@ -62,6 +63,74 @@ struct kernfs_root kernfs_root_from_sb(struct super_block sb)
		return NULL;
		}

		/*
		* find the next ancestor in the path down to @child, where @parent was the
		* ancestor whose descendant we want to find.
		*
		* Say the path is /a/b/c/d. @child is d, @parent is NULL. We return the root
		* node. If @parent is b, then we return the node for c.
		* Passing in d as @parent is not ok.
		*/
		static struct kernfs_node find_next_ancestor(struct kernfs_node child,
		struct kernfs_node *parent)
		{
		if (child == parent) {
		pr_crit_once("BUG in find_next_ancestor: called with parent == child");
		return NULL;
		}

		while (child->parent != parent) {
		if (!child->parent)
		return NULL;
		child = child->parent;
		}

		return child;
		}

		/**
		* kernfs_node_dentry - get a dentry for the given kernfs_node
		* @kn: kernfs_node for which a dentry is needed
		* @sb: the kernfs super_block
		*/
		struct dentry kernfs_node_dentry(struct kernfs_node kn,
		struct super_block *sb)
		{
		struct dentry *dentry;
		struct kernfs_node *knparent = NULL;

		BUG_ON(sb->s_op != &kernfs_sops);

		dentry = dget(sb->s_root);

		/* Check if this is the root kernfs_node */
		if (!kn->parent)
		return dentry;

		knparent = find_next_ancestor(kn, NULL);
		if (WARN_ON(!knparent))
		return ERR_PTR(-EINVAL);

		do {
		struct dentry *dtmp;
		struct kernfs_node *kntmp;

		if (kn == knparent)
		return dentry;
		kntmp = find_next_ancestor(kn, knparent);
		if (WARN_ON(!kntmp))
		return ERR_PTR(-EINVAL);
		mutex_lock(&d_inode(dentry)->i_mutex);
		dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
		mutex_unlock(&d_inode(dentry)->i_mutex);
		dput(dentry);
		if (IS_ERR(dtmp))
		return dtmp;
		knparent = kntmp;
		dentry = dtmp;
		} while (true);
		}

		static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
		{
		struct kernfs_super_info *info = kernfs_info(sb);

fs/proc/namespaces.c

+3 −0

Original line number	Diff line number	Diff line
		@@ -28,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
		&userns_operations,
		#endif
		&mntns_operations,
		#ifdef CONFIG_CGROUPS
		&cgroupns_operations,
		#endif
		};

		static const char proc_ns_get_link(struct dentry dentry,

include/linux/cgroup.h

+49 −0

Original line number	Diff line number	Diff line
		@@ -17,6 +17,11 @@
		#include <linux/seq_file.h>
		#include <linux/kernfs.h>
		#include <linux/jump_label.h>
		#include <linux/nsproxy.h>
		#include <linux/types.h>
		#include <linux/ns_common.h>
		#include <linux/nsproxy.h>
		#include <linux/user_namespace.h>

		#include <linux/cgroup-defs.h>

		@@ -611,4 +616,48 @@ static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}

		#endif /* CONFIG_CGROUP_DATA */

		struct cgroup_namespace {
		atomic_t count;
		struct ns_common ns;
		struct user_namespace *user_ns;
		struct css_set *root_cset;
		};

		extern struct cgroup_namespace init_cgroup_ns;

		#ifdef CONFIG_CGROUPS

		void free_cgroup_ns(struct cgroup_namespace *ns);

		struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
		struct user_namespace *user_ns,
		struct cgroup_namespace *old_ns);

		char cgroup_path_ns(struct cgroup cgrp, char *buf, size_t buflen,
		struct cgroup_namespace *ns);

		#else /* !CONFIG_CGROUPS */

		static inline void free_cgroup_ns(struct cgroup_namespace *ns) { }
		static inline struct cgroup_namespace *
		copy_cgroup_ns(unsigned long flags, struct user_namespace *user_ns,
		struct cgroup_namespace *old_ns)
		{
		return old_ns;
		}

		#endif /* !CONFIG_CGROUPS */

		static inline void get_cgroup_ns(struct cgroup_namespace *ns)
		{
		if (ns)
		atomic_inc(&ns->count);
		}

		static inline void put_cgroup_ns(struct cgroup_namespace *ns)
		{
		if (ns && atomic_dec_and_test(&ns->count))
		free_cgroup_ns(ns);
		}

		#endif /* _LINUX_CGROUP_H */

Admin message