Merge tag 'for-linus-20190614' of git://git.kernel.dk/linux-block (7b103151) · Commits · 戴 / test

Documentation/block/switching-sched.txt

+8 −10

Original line number	Diff line number	Diff line
		@@ -13,11 +13,9 @@ you can do so by typing:

		# mount none /sys -t sysfs

		As of the Linux 2.6.10 kernel, it is now possible to change the
		IO scheduler for a given block device on the fly (thus making it possible,
		for instance, to set the CFQ scheduler for the system default, but
		set a specific device to use the deadline or noop schedulers - which
		can improve that device's throughput).
		It is possible to change the IO scheduler for a given block device on
		the fly to select one of mq-deadline, none, bfq, or kyber schedulers -
		which can improve that device's throughput.

		To set a specific scheduler, simply do this:

		@@ -30,8 +28,8 @@ The list of defined schedulers can be found by simply doing
		a "cat /sys/block/DEV/queue/scheduler" - the list of valid names
		will be displayed, with the currently selected scheduler in brackets:

		# cat /sys/block/hda/queue/scheduler
		noop deadline [cfq]
		# echo deadline > /sys/block/hda/queue/scheduler
		# cat /sys/block/hda/queue/scheduler
		noop [deadline] cfq
		# cat /sys/block/sda/queue/scheduler
		[mq-deadline] kyber bfq none
		# echo none >/sys/block/sda/queue/scheduler
		# cat /sys/block/sda/queue/scheduler
		[none] mq-deadline kyber bfq

Documentation/cgroup-v1/blkio-controller.txt

+7 −89

Original line number	Diff line number	Diff line
		@@ -8,61 +8,13 @@ both at leaf nodes as well as at intermediate nodes in a storage hierarchy.
		Plan is to use the same cgroup based management interface for blkio controller
		and based on user options switch IO policies in the background.

		Currently two IO control policies are implemented. First one is proportional
		weight time based division of disk policy. It is implemented in CFQ. Hence
		this policy takes effect only on leaf nodes when CFQ is being used. The second
		one is throttling policy which can be used to specify upper IO rate limits
		on devices. This policy is implemented in generic block layer and can be
		used on leaf nodes as well as higher level logical devices like device mapper.
		One IO control policy is throttling policy which can be used to
		specify upper IO rate limits on devices. This policy is implemented in
		generic block layer and can be used on leaf nodes as well as higher
		level logical devices like device mapper.

		HOWTO
		=====
		Proportional Weight division of bandwidth
		-----------------------------------------
		You can do a very simple testing of running two dd threads in two different
		cgroups. Here is what you can do.

		- Enable Block IO controller
		CONFIG_BLK_CGROUP=y

		- Enable group scheduling in CFQ
		CONFIG_CFQ_GROUP_IOSCHED=y

		- Compile and boot into kernel and mount IO controller (blkio); see
		cgroups.txt, Why are cgroups needed?.

		mount -t tmpfs cgroup_root /sys/fs/cgroup
		mkdir /sys/fs/cgroup/blkio
		mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

		- Create two cgroups
		mkdir -p /sys/fs/cgroup/blkio/test1/ /sys/fs/cgroup/blkio/test2

		- Set weights of group test1 and test2
		echo 1000 > /sys/fs/cgroup/blkio/test1/blkio.weight
		echo 500 > /sys/fs/cgroup/blkio/test2/blkio.weight

		- Create two same size files (say 512MB each) on same disk (file1, file2) and
		launch two dd threads in different cgroup to read those files.

		sync
		echo 3 > /proc/sys/vm/drop_caches

		dd if=/mnt/sdb/zerofile1 of=/dev/null &
		echo $! > /sys/fs/cgroup/blkio/test1/tasks
		cat /sys/fs/cgroup/blkio/test1/tasks

		dd if=/mnt/sdb/zerofile2 of=/dev/null &
		echo $! > /sys/fs/cgroup/blkio/test2/tasks
		cat /sys/fs/cgroup/blkio/test2/tasks

		- At macro level, first dd should finish first. To get more precise data, keep
		on looking at (with the help of script), at blkio.disk_time and
		blkio.disk_sectors files of both test1 and test2 groups. This will tell how
		much disk time (in milliseconds), each group got and how many sectors each
		group dispatched to the disk. We provide fairness in terms of disk time, so
		ideally io.disk_time of cgroups should be in proportion to the weight.

		Throttling/Upper Limit policy
		-----------------------------
		- Enable Block IO controller
		@@ -94,7 +46,7 @@ Throttling/Upper Limit policy
		Hierarchical Cgroups
		====================

		Both CFQ and throttling implement hierarchy support; however,
		Throttling implements hierarchy support; however,
		throttling's hierarchy support is enabled iff "sane_behavior" is
		enabled from cgroup side, which currently is a development option and
		not publicly available.
		@@ -107,9 +59,8 @@ If somebody created a hierarchy like as follows.
		\|
		test3

		CFQ by default and throttling with "sane_behavior" will handle the
		hierarchy correctly. For details on CFQ hierarchy support, refer to
		Documentation/block/cfq-iosched.txt. For throttling, all limits apply
		Throttling with "sane_behavior" will handle the
		hierarchy correctly. For throttling, all limits apply
		to the whole subtree while all statistics are local to the IOs
		directly generated by tasks in that cgroup.

		@@ -130,10 +81,6 @@ CONFIG_DEBUG_BLK_CGROUP
		- Debug help. Right now some additional stats file show up in cgroup
		if this option is enabled.

		CONFIG_CFQ_GROUP_IOSCHED
		- Enables group scheduling in CFQ. Currently only 1 level of group
		creation is allowed.

		CONFIG_BLK_DEV_THROTTLING
		- Enable block device throttling support in block layer.

		@@ -344,32 +291,3 @@ Common files among various policies
		- blkio.reset_stats
		- Writing an int to this file will result in resetting all the stats
		for that cgroup.

		CFQ sysfs tunable
		=================
		/sys/block/<disk>/queue/iosched/slice_idle
		------------------------------------------
		On a faster hardware CFQ can be slow, especially with sequential workload.
		This happens because CFQ idles on a single queue and single queue might not
		drive deeper request queue depths to keep the storage busy. In such scenarios
		one can try setting slice_idle=0 and that would switch CFQ to IOPS
		(IO operations per second) mode on NCQ supporting hardware.

		That means CFQ will not idle between cfq queues of a cfq group and hence be
		able to driver higher queue depth and achieve better throughput. That also
		means that cfq provides fairness among groups in terms of IOPS and not in
		terms of disk time.

		/sys/block/<disk>/queue/iosched/group_idle
		------------------------------------------
		If one disables idling on individual cfq queues and cfq service trees by
		setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
		on the group in an attempt to provide fairness among groups.

		By default group_idle is same as slice_idle and does not do anything if
		slice_idle is enabled.

		One can experience an overall throughput drop if you have created multiple
		groups and put applications in that group which are not driving enough
		IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
		on individual groups and throughput should improve.

block/Kconfig

+1 −0

Original line number	Diff line number	Diff line
		@@ -73,6 +73,7 @@ config BLK_DEV_INTEGRITY

		config BLK_DEV_ZONED
		bool "Zoned block device support"
		select MQ_IOSCHED_DEADLINE
		---help---
		Block layer zoned block device support. This option enables
		support for ZAC/ZBC host-managed and host-aware zoned block devices.

block/blk-mq-debugfs.c

+34 −111

Original line number	Diff line number	Diff line
		@@ -821,38 +821,28 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
		{},
		};

		static bool debugfs_create_files(struct dentry parent, void data,
		static void debugfs_create_files(struct dentry parent, void data,
		const struct blk_mq_debugfs_attr *attr)
		{
		if (IS_ERR_OR_NULL(parent))
		return false;
		return;

		d_inode(parent)->i_private = data;

		for (; attr->name; attr++) {
		if (!debugfs_create_file(attr->name, attr->mode, parent,
		(void *)attr, &blk_mq_debugfs_fops))
		return false;
		}
		return true;
		for (; attr->name; attr++)
		debugfs_create_file(attr->name, attr->mode, parent,
		(void *)attr, &blk_mq_debugfs_fops);
		}

		int blk_mq_debugfs_register(struct request_queue *q)
		void blk_mq_debugfs_register(struct request_queue *q)
		{
		struct blk_mq_hw_ctx *hctx;
		int i;

		if (!blk_debugfs_root)
		return -ENOENT;

		q->debugfs_dir = debugfs_create_dir(kobject_name(q->kobj.parent),
		blk_debugfs_root);
		if (!q->debugfs_dir)
		return -ENOMEM;

		if (!debugfs_create_files(q->debugfs_dir, q,
		blk_mq_debugfs_queue_attrs))
		goto err;
		debugfs_create_files(q->debugfs_dir, q, blk_mq_debugfs_queue_attrs);

		/*
		* blk_mq_init_sched() attempted to do this already, but q->debugfs_dir
		@@ -864,11 +854,10 @@ int blk_mq_debugfs_register(struct request_queue *q)

		/* Similarly, blk_mq_init_hctx() couldn't do this previously. */
		queue_for_each_hw_ctx(q, hctx, i) {
		if (!hctx->debugfs_dir && blk_mq_debugfs_register_hctx(q, hctx))
		goto err;
		if (q->elevator && !hctx->sched_debugfs_dir &&
		blk_mq_debugfs_register_sched_hctx(q, hctx))
		goto err;
		if (!hctx->debugfs_dir)
		blk_mq_debugfs_register_hctx(q, hctx);
		if (q->elevator && !hctx->sched_debugfs_dir)
		blk_mq_debugfs_register_sched_hctx(q, hctx);
		}

		if (q->rq_qos) {
		@@ -879,12 +868,6 @@ int blk_mq_debugfs_register(struct request_queue *q)
		rqos = rqos->next;
		}
		}

		return 0;

		err:
		blk_mq_debugfs_unregister(q);
		return -ENOMEM;
		}

		void blk_mq_debugfs_unregister(struct request_queue *q)
		@@ -894,7 +877,7 @@ void blk_mq_debugfs_unregister(struct request_queue *q)
		q->debugfs_dir = NULL;
		}

		static int blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
		static void blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
		struct blk_mq_ctx *ctx)
		{
		struct dentry *ctx_dir;
		@@ -902,44 +885,24 @@ static int blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,

		snprintf(name, sizeof(name), "cpu%u", ctx->cpu);
		ctx_dir = debugfs_create_dir(name, hctx->debugfs_dir);
		if (!ctx_dir)
		return -ENOMEM;

		if (!debugfs_create_files(ctx_dir, ctx, blk_mq_debugfs_ctx_attrs))
		return -ENOMEM;

		return 0;
		debugfs_create_files(ctx_dir, ctx, blk_mq_debugfs_ctx_attrs);
		}

		int blk_mq_debugfs_register_hctx(struct request_queue *q,
		void blk_mq_debugfs_register_hctx(struct request_queue *q,
		struct blk_mq_hw_ctx *hctx)
		{
		struct blk_mq_ctx *ctx;
		char name[20];
		int i;

		if (!q->debugfs_dir)
		return -ENOENT;

		snprintf(name, sizeof(name), "hctx%u", hctx->queue_num);
		hctx->debugfs_dir = debugfs_create_dir(name, q->debugfs_dir);
		if (!hctx->debugfs_dir)
		return -ENOMEM;

		if (!debugfs_create_files(hctx->debugfs_dir, hctx,
		blk_mq_debugfs_hctx_attrs))
		goto err;
		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);

		hctx_for_each_ctx(hctx, ctx, i) {
		if (blk_mq_debugfs_register_ctx(hctx, ctx))
		goto err;
		}

		return 0;

		err:
		blk_mq_debugfs_unregister_hctx(hctx);
		return -ENOMEM;
		hctx_for_each_ctx(hctx, ctx, i)
		blk_mq_debugfs_register_ctx(hctx, ctx);
		}

		void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx)
		@@ -949,17 +912,13 @@ void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx)
		hctx->debugfs_dir = NULL;
		}

		int blk_mq_debugfs_register_hctxs(struct request_queue *q)
		void blk_mq_debugfs_register_hctxs(struct request_queue *q)
		{
		struct blk_mq_hw_ctx *hctx;
		int i;

		queue_for_each_hw_ctx(q, hctx, i) {
		if (blk_mq_debugfs_register_hctx(q, hctx))
		return -ENOMEM;
		}

		return 0;
		queue_for_each_hw_ctx(q, hctx, i)
		blk_mq_debugfs_register_hctx(q, hctx);
		}

		void blk_mq_debugfs_unregister_hctxs(struct request_queue *q)
		@@ -971,29 +930,16 @@ void blk_mq_debugfs_unregister_hctxs(struct request_queue *q)
		blk_mq_debugfs_unregister_hctx(hctx);
		}

		int blk_mq_debugfs_register_sched(struct request_queue *q)
		void blk_mq_debugfs_register_sched(struct request_queue *q)
		{
		struct elevator_type *e = q->elevator->type;

		if (!q->debugfs_dir)
		return -ENOENT;

		if (!e->queue_debugfs_attrs)
		return 0;
		return;

		q->sched_debugfs_dir = debugfs_create_dir("sched", q->debugfs_dir);
		if (!q->sched_debugfs_dir)
		return -ENOMEM;

		if (!debugfs_create_files(q->sched_debugfs_dir, q,
		e->queue_debugfs_attrs))
		goto err;

		return 0;

		err:
		blk_mq_debugfs_unregister_sched(q);
		return -ENOMEM;
		debugfs_create_files(q->sched_debugfs_dir, q, e->queue_debugfs_attrs);
		}

		void blk_mq_debugfs_unregister_sched(struct request_queue *q)
		@@ -1008,36 +954,22 @@ void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
		rqos->debugfs_dir = NULL;
		}

		int blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
		void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
		{
		struct request_queue *q = rqos->q;
		const char *dir_name = rq_qos_id_to_name(rqos->id);

		if (!q->debugfs_dir)
		return -ENOENT;

		if (rqos->debugfs_dir \|\| !rqos->ops->debugfs_attrs)
		return 0;
		return;

		if (!q->rqos_debugfs_dir) {
		if (!q->rqos_debugfs_dir)
		q->rqos_debugfs_dir = debugfs_create_dir("rqos",
		q->debugfs_dir);
		if (!q->rqos_debugfs_dir)
		return -ENOMEM;
		}

		rqos->debugfs_dir = debugfs_create_dir(dir_name,
		rqos->q->rqos_debugfs_dir);
		if (!rqos->debugfs_dir)
		return -ENOMEM;

		if (!debugfs_create_files(rqos->debugfs_dir, rqos,
		rqos->ops->debugfs_attrs))
		goto err;
		return 0;
		err:
		blk_mq_debugfs_unregister_rqos(rqos);
		return -ENOMEM;
		debugfs_create_files(rqos->debugfs_dir, rqos, rqos->ops->debugfs_attrs);
		}

		void blk_mq_debugfs_unregister_queue_rqos(struct request_queue *q)
		@@ -1046,27 +978,18 @@ void blk_mq_debugfs_unregister_queue_rqos(struct request_queue *q)
		q->rqos_debugfs_dir = NULL;
		}

		int blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
		void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
		struct blk_mq_hw_ctx *hctx)
		{
		struct elevator_type *e = q->elevator->type;

		if (!hctx->debugfs_dir)
		return -ENOENT;

		if (!e->hctx_debugfs_attrs)
		return 0;
		return;

		hctx->sched_debugfs_dir = debugfs_create_dir("sched",
		hctx->debugfs_dir);
		if (!hctx->sched_debugfs_dir)
		return -ENOMEM;

		if (!debugfs_create_files(hctx->sched_debugfs_dir, hctx,
		e->hctx_debugfs_attrs))
		return -ENOMEM;

		return 0;
		debugfs_create_files(hctx->sched_debugfs_dir, hctx,
		e->hctx_debugfs_attrs);
		}

		void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)

block/blk-mq-debugfs.h

+15 −21

Original line number	Diff line number	Diff line
		@@ -18,74 +18,68 @@ struct blk_mq_debugfs_attr {
		int __blk_mq_debugfs_rq_show(struct seq_file m, struct request rq);
		int blk_mq_debugfs_rq_show(struct seq_file m, void v);

		int blk_mq_debugfs_register(struct request_queue *q);
		void blk_mq_debugfs_register(struct request_queue *q);
		void blk_mq_debugfs_unregister(struct request_queue *q);
		int blk_mq_debugfs_register_hctx(struct request_queue *q,
		void blk_mq_debugfs_register_hctx(struct request_queue *q,
		struct blk_mq_hw_ctx *hctx);
		void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx);
		int blk_mq_debugfs_register_hctxs(struct request_queue *q);
		void blk_mq_debugfs_register_hctxs(struct request_queue *q);
		void blk_mq_debugfs_unregister_hctxs(struct request_queue *q);

		int blk_mq_debugfs_register_sched(struct request_queue *q);
		void blk_mq_debugfs_register_sched(struct request_queue *q);
		void blk_mq_debugfs_unregister_sched(struct request_queue *q);
		int blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
		void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
		struct blk_mq_hw_ctx *hctx);
		void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx);

		int blk_mq_debugfs_register_rqos(struct rq_qos *rqos);
		void blk_mq_debugfs_register_rqos(struct rq_qos *rqos);
		void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos);
		void blk_mq_debugfs_unregister_queue_rqos(struct request_queue *q);
		#else
		static inline int blk_mq_debugfs_register(struct request_queue *q)
		static inline void blk_mq_debugfs_register(struct request_queue *q)
		{
		return 0;
		}

		static inline void blk_mq_debugfs_unregister(struct request_queue *q)
		{
		}

		static inline int blk_mq_debugfs_register_hctx(struct request_queue *q,
		static inline void blk_mq_debugfs_register_hctx(struct request_queue *q,
		struct blk_mq_hw_ctx *hctx)
		{
		return 0;
		}

		static inline void blk_mq_debugfs_unregister_hctx(struct blk_mq_hw_ctx *hctx)
		{
		}

		static inline int blk_mq_debugfs_register_hctxs(struct request_queue *q)
		static inline void blk_mq_debugfs_register_hctxs(struct request_queue *q)
		{
		return 0;
		}

		static inline void blk_mq_debugfs_unregister_hctxs(struct request_queue *q)
		{
		}

		static inline int blk_mq_debugfs_register_sched(struct request_queue *q)
		static inline void blk_mq_debugfs_register_sched(struct request_queue *q)
		{
		return 0;
		}

		static inline void blk_mq_debugfs_unregister_sched(struct request_queue *q)
		{
		}

		static inline int blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
		static inline void blk_mq_debugfs_register_sched_hctx(struct request_queue *q,
		struct blk_mq_hw_ctx *hctx)
		{
		return 0;
		}

		static inline void blk_mq_debugfs_unregister_sched_hctx(struct blk_mq_hw_ctx *hctx)
		{
		}

		static inline int blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
		static inline void blk_mq_debugfs_register_rqos(struct rq_qos *rqos)
		{
		return 0;
		}

		static inline void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)

Admin message