Merge branch 'shared-cgroup-storage' (36f72484) · Commits · 戴 / test

Documentation/bpf/index.rst

+9 −0

Original line number	Diff line number	Diff line
		@@ -48,6 +48,15 @@ Program types
		bpf_lsm


		Map types
		=========

		.. toctree::
		:maxdepth: 1

		map_cgroup_storage


		Testing and debugging BPF
		=========================

Documentation/bpf/map_cgroup_storage.rst

0 → 100644

+169 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0-only
		.. Copyright (C) 2020 Google LLC.

		===========================
		BPF_MAP_TYPE_CGROUP_STORAGE
		===========================

		The ``BPF_MAP_TYPE_CGROUP_STORAGE`` map type represents a local fix-sized
		storage. It is only available with ``CONFIG_CGROUP_BPF``, and to programs that
		attach to cgroups; the programs are made available by the same Kconfig. The
		storage is identified by the cgroup the program is attached to.

		The map provide a local storage at the cgroup that the BPF program is attached
		to. It provides a faster and simpler access than the general purpose hash
		table, which performs a hash table lookups, and requires user to track live
		cgroups on their own.

		This document describes the usage and semantics of the
		``BPF_MAP_TYPE_CGROUP_STORAGE`` map type. Some of its behaviors was changed in
		Linux 5.9 and this document will describe the differences.

		Usage
		=====

		The map uses key of type of either ``__u64 cgroup_inode_id`` or
		``struct bpf_cgroup_storage_key``, declared in ``linux/bpf.h``::

		struct bpf_cgroup_storage_key {
		__u64 cgroup_inode_id;
		__u32 attach_type;
		};

		``cgroup_inode_id`` is the inode id of the cgroup directory.
		``attach_type`` is the the program's attach type.

		Linux 5.9 added support for type ``__u64 cgroup_inode_id`` as the key type.
		When this key type is used, then all attach types of the particular cgroup and
		map will share the same storage. Otherwise, if the type is
		``struct bpf_cgroup_storage_key``, then programs of different attach types
		be isolated and see different storages.

		To access the storage in a program, use ``bpf_get_local_storage``::

		void bpf_get_local_storage(void map, u64 flags)

		``flags`` is reserved for future use and must be 0.

		There is no implicit synchronization. Storages of ``BPF_MAP_TYPE_CGROUP_STORAGE``
		can be accessed by multiple programs across different CPUs, and user should
		take care of synchronization by themselves. The bpf infrastructure provides
		``struct bpf_spin_lock`` to synchronize the storage. See
		``tools/testing/selftests/bpf/progs/test_spin_lock.c``.

		Examples
		========

		Usage with key type as ``struct bpf_cgroup_storage_key``::

		#include <bpf/bpf.h>

		struct {
		__uint(type, BPF_MAP_TYPE_CGROUP_STORAGE);
		__type(key, struct bpf_cgroup_storage_key);
		__type(value, __u32);
		} cgroup_storage SEC(".maps");

		int program(struct __sk_buff *skb)
		{
		__u32 *ptr = bpf_get_local_storage(&cgroup_storage, 0);
		__sync_fetch_and_add(ptr, 1);

		return 0;
		}

		Userspace accessing map declared above::

		#include <linux/bpf.h>
		#include <linux/libbpf.h>

		__u32 map_lookup(struct bpf_map *map, __u64 cgrp, enum bpf_attach_type type)
		{
		struct bpf_cgroup_storage_key = {
		.cgroup_inode_id = cgrp,
		.attach_type = type,
		};
		__u32 value;
		bpf_map_lookup_elem(bpf_map__fd(map), &key, &value);
		// error checking omitted
		return value;
		}

		Alternatively, using just ``__u64 cgroup_inode_id`` as key type::

		#include <bpf/bpf.h>

		struct {
		__uint(type, BPF_MAP_TYPE_CGROUP_STORAGE);
		__type(key, __u64);
		__type(value, __u32);
		} cgroup_storage SEC(".maps");

		int program(struct __sk_buff *skb)
		{
		__u32 *ptr = bpf_get_local_storage(&cgroup_storage, 0);
		__sync_fetch_and_add(ptr, 1);

		return 0;
		}

		And userspace::

		#include <linux/bpf.h>
		#include <linux/libbpf.h>

		__u32 map_lookup(struct bpf_map *map, __u64 cgrp, enum bpf_attach_type type)
		{
		__u32 value;
		bpf_map_lookup_elem(bpf_map__fd(map), &cgrp, &value);
		// error checking omitted
		return value;
		}

		Semantics
		=========

		``BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE`` is a variant of this map type. This
		per-CPU variant will have different memory regions for each CPU for each
		storage. The non-per-CPU will have the same memory region for each storage.

		Prior to Linux 5.9, the lifetime of a storage is precisely per-attachment, and
		for a single ``CGROUP_STORAGE`` map, there can be at most one program loaded
		that uses the map. A program may be attached to multiple cgroups or have
		multiple attach types, and each attach creates a fresh zeroed storage. The
		storage is freed upon detach.

		There is a one-to-one association between the map of each type (per-CPU and
		non-per-CPU) and the BPF program during load verification time. As a result,
		each map can only be used by one BPF program and each BPF program can only use
		one storage map of each type. Because of map can only be used by one BPF
		program, sharing of this cgroup's storage with other BPF programs were
		impossible.

		Since Linux 5.9, storage can be shared by multiple programs. When a program is
		attached to a cgroup, the kernel would create a new storage only if the map
		does not already contain an entry for the cgroup and attach type pair, or else
		the old storage is reused for the new attachment. If the map is attach type
		shared, then attach type is simply ignored during comparison. Storage is freed
		only when either the map or the cgroup attached to is being freed. Detaching
		will not directly free the storage, but it may cause the reference to the map
		to reach zero and indirectly freeing all storage in the map.

		The map is not associated with any BPF program, thus making sharing possible.
		However, the BPF program can still only associate with one map of each type
		(per-CPU and non-per-CPU). A BPF program cannot use more than one
		``BPF_MAP_TYPE_CGROUP_STORAGE`` or more than one
		``BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE``.

		In all versions, userspace may use the the attach parameters of cgroup and
		attach type pair in ``struct bpf_cgroup_storage_key`` as the key to the BPF map
		APIs to read or update the storage for a given attachment. For Linux 5.9
		attach type shared storages, only the first value in the struct, cgroup inode
		id, is used during comparison, so userspace may just specify a ``__u64``
		directly.

		The storage is bound at attach time. Even if the program is attached to parent
		and triggers in child, the storage still belongs to the parent.

		Userspace cannot create a new entry in the map or delete an existing entry.
		Program test runs always use a temporary storage.

include/linux/bpf-cgroup.h

+8 −4

Original line number	Diff line number	Diff line
		@@ -46,7 +46,8 @@ struct bpf_cgroup_storage {
		};
		struct bpf_cgroup_storage_map *map;
		struct bpf_cgroup_storage_key key;
		struct list_head list;
		struct list_head list_map;
		struct list_head list_cg;
		struct rb_node node;
		struct rcu_head rcu;
		};
		@@ -78,6 +79,9 @@ struct cgroup_bpf {
		struct list_head progs[MAX_BPF_ATTACH_TYPE];
		u32 flags[MAX_BPF_ATTACH_TYPE];

		/* list of cgroup shared storages */
		struct list_head storages;

		/* temp storage for effective prog array used by prog_attach/detach */
		struct bpf_prog_array *inactive;

		@@ -161,6 +165,9 @@ static inline void bpf_cgroup_storage_set(struct bpf_cgroup_storage
		this_cpu_write(bpf_cgroup_storage[stype], storage[stype]);
		}

		struct bpf_cgroup_storage *
		cgroup_storage_lookup(struct bpf_cgroup_storage_map *map,
		void *key, bool locked);
		struct bpf_cgroup_storage bpf_cgroup_storage_alloc(struct bpf_prog prog,
		enum bpf_cgroup_storage_type stype);
		void bpf_cgroup_storage_free(struct bpf_cgroup_storage *storage);
		@@ -169,7 +176,6 @@ void bpf_cgroup_storage_link(struct bpf_cgroup_storage *storage,
		enum bpf_attach_type type);
		void bpf_cgroup_storage_unlink(struct bpf_cgroup_storage *storage);
		int bpf_cgroup_storage_assign(struct bpf_prog_aux aux, struct bpf_map map);
		void bpf_cgroup_storage_release(struct bpf_prog_aux aux, struct bpf_map map);

		int bpf_percpu_cgroup_storage_copy(struct bpf_map map, void key, void *value);
		int bpf_percpu_cgroup_storage_update(struct bpf_map map, void key,
		@@ -383,8 +389,6 @@ static inline void bpf_cgroup_storage_set(
		struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE]) {}
		static inline int bpf_cgroup_storage_assign(struct bpf_prog_aux *aux,
		struct bpf_map *map) { return 0; }
		static inline void bpf_cgroup_storage_release(struct bpf_prog_aux *aux,
		struct bpf_map *map) {}
		static inline struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(
		struct bpf_prog *prog, enum bpf_cgroup_storage_type stype) { return NULL; }
		static inline void bpf_cgroup_storage_free(

kernel/bpf/cgroup.c

+39 −28

Original line number	Diff line number	Diff line
		@@ -37,17 +37,34 @@ static void bpf_cgroup_storages_free(struct bpf_cgroup_storage *storages[])
		}

		static int bpf_cgroup_storages_alloc(struct bpf_cgroup_storage *storages[],
		struct bpf_prog *prog)
		struct bpf_cgroup_storage *new_storages[],
		enum bpf_attach_type type,
		struct bpf_prog *prog,
		struct cgroup *cgrp)
		{
		enum bpf_cgroup_storage_type stype;
		struct bpf_cgroup_storage_key key;
		struct bpf_map *map;

		key.cgroup_inode_id = cgroup_id(cgrp);
		key.attach_type = type;

		for_each_cgroup_storage_type(stype) {
		map = prog->aux->cgroup_storage[stype];
		if (!map)
		continue;

		storages[stype] = cgroup_storage_lookup((void *)map, &key, false);
		if (storages[stype])
		continue;

		storages[stype] = bpf_cgroup_storage_alloc(prog, stype);
		if (IS_ERR(storages[stype])) {
		storages[stype] = NULL;
		bpf_cgroup_storages_free(storages);
		bpf_cgroup_storages_free(new_storages);
		return -ENOMEM;
		}

		new_storages[stype] = storages[stype];
		}

		return 0;
		@@ -72,14 +89,6 @@ static void bpf_cgroup_storages_link(struct bpf_cgroup_storage *storages[],
		bpf_cgroup_storage_link(storages[stype], cgrp, attach_type);
		}

		static void bpf_cgroup_storages_unlink(struct bpf_cgroup_storage *storages[])
		{
		enum bpf_cgroup_storage_type stype;

		for_each_cgroup_storage_type(stype)
		bpf_cgroup_storage_unlink(storages[stype]);
		}

		/* Called when bpf_cgroup_link is auto-detached from dying cgroup.
		* It drops cgroup and bpf_prog refcounts, and marks bpf_link as defunct. It
		* doesn't free link memory, which will eventually be done by bpf_link's
		@@ -101,22 +110,23 @@ static void cgroup_bpf_release(struct work_struct *work)
		struct cgroup p, cgrp = container_of(work, struct cgroup,
		bpf.release_work);
		struct bpf_prog_array *old_array;
		struct list_head *storages = &cgrp->bpf.storages;
		struct bpf_cgroup_storage storage, stmp;

		unsigned int type;

		mutex_lock(&cgroup_mutex);

		for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) {
		struct list_head *progs = &cgrp->bpf.progs[type];
		struct bpf_prog_list pl, tmp;
		struct bpf_prog_list pl, pltmp;

		list_for_each_entry_safe(pl, tmp, progs, node) {
		list_for_each_entry_safe(pl, pltmp, progs, node) {
		list_del(&pl->node);
		if (pl->prog)
		bpf_prog_put(pl->prog);
		if (pl->link)
		bpf_cgroup_link_auto_detach(pl->link);
		bpf_cgroup_storages_unlink(pl->storage);
		bpf_cgroup_storages_free(pl->storage);
		kfree(pl);
		static_branch_dec(&cgroup_bpf_enabled_key);
		}
		@@ -126,6 +136,11 @@ static void cgroup_bpf_release(struct work_struct *work)
		bpf_prog_array_free(old_array);
		}

		list_for_each_entry_safe(storage, stmp, storages, list_cg) {
		bpf_cgroup_storage_unlink(storage);
		bpf_cgroup_storage_free(storage);
		}

		mutex_unlock(&cgroup_mutex);

		for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
		@@ -290,6 +305,8 @@ int cgroup_bpf_inherit(struct cgroup *cgrp)
		for (i = 0; i < NR; i++)
		INIT_LIST_HEAD(&cgrp->bpf.progs[i]);

		INIT_LIST_HEAD(&cgrp->bpf.storages);

		for (i = 0; i < NR; i++)
		if (compute_effective_progs(cgrp, i, &arrays[i]))
		goto cleanup;
		@@ -422,7 +439,7 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
		struct list_head *progs = &cgrp->bpf.progs[type];
		struct bpf_prog *old_prog = NULL;
		struct bpf_cgroup_storage *storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {};
		struct bpf_cgroup_storage *old_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {};
		struct bpf_cgroup_storage *new_storage[MAX_BPF_CGROUP_STORAGE_TYPE] = {};
		struct bpf_prog_list *pl;
		int err;

		@@ -455,17 +472,16 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
		if (IS_ERR(pl))
		return PTR_ERR(pl);

		if (bpf_cgroup_storages_alloc(storage, prog ? : link->link.prog))
		if (bpf_cgroup_storages_alloc(storage, new_storage, type,
		prog ? : link->link.prog, cgrp))
		return -ENOMEM;

		if (pl) {
		old_prog = pl->prog;
		bpf_cgroup_storages_unlink(pl->storage);
		bpf_cgroup_storages_assign(old_storage, pl->storage);
		} else {
		pl = kmalloc(sizeof(*pl), GFP_KERNEL);
		if (!pl) {
		bpf_cgroup_storages_free(storage);
		bpf_cgroup_storages_free(new_storage);
		return -ENOMEM;
		}
		list_add_tail(&pl->node, progs);
		@@ -480,12 +496,11 @@ int __cgroup_bpf_attach(struct cgroup *cgrp,
		if (err)
		goto cleanup;

		bpf_cgroup_storages_free(old_storage);
		if (old_prog)
		bpf_prog_put(old_prog);
		else
		static_branch_inc(&cgroup_bpf_enabled_key);
		bpf_cgroup_storages_link(pl->storage, cgrp, type);
		bpf_cgroup_storages_link(new_storage, cgrp, type);
		return 0;

		cleanup:
		@@ -493,9 +508,7 @@ cleanup:
		pl->prog = old_prog;
		pl->link = NULL;
		}
		bpf_cgroup_storages_free(pl->storage);
		bpf_cgroup_storages_assign(pl->storage, old_storage);
		bpf_cgroup_storages_link(pl->storage, cgrp, type);
		bpf_cgroup_storages_free(new_storage);
		if (!old_prog) {
		list_del(&pl->node);
		kfree(pl);
		@@ -679,8 +692,6 @@ int __cgroup_bpf_detach(struct cgroup cgrp, struct bpf_prog prog,

		/* now can actually delete it from this cgroup list */
		list_del(&pl->node);
		bpf_cgroup_storages_unlink(pl->storage);
		bpf_cgroup_storages_free(pl->storage);
		kfree(pl);
		if (list_empty(progs))
		/* last program was detached, reset flags to zero */

kernel/bpf/core.c

+0 −12

Original line number	Diff line number	Diff line
		@@ -2097,24 +2097,12 @@ int bpf_prog_array_copy_info(struct bpf_prog_array *array,
		: 0;
		}

		static void bpf_free_cgroup_storage(struct bpf_prog_aux *aux)
		{
		enum bpf_cgroup_storage_type stype;

		for_each_cgroup_storage_type(stype) {
		if (!aux->cgroup_storage[stype])
		continue;
		bpf_cgroup_storage_release(aux, aux->cgroup_storage[stype]);
		}
		}

		void __bpf_free_used_maps(struct bpf_prog_aux *aux,
		struct bpf_map **used_maps, u32 len)
		{
		struct bpf_map *map;
		u32 i;

		bpf_free_cgroup_storage(aux);
		for (i = 0; i < len; i++) {
		map = used_maps[i];
		if (map->ops->map_poke_untrack)

Admin message