Commit 7ad67ca5 authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Two NVMe pull requests:
     - ana log parse fix from Anton
     - nvme quirks support for Apple devices from Ben
     - fix missing bio completion tracing for multipath stack devices
       from Hannes and Mikhail
     - IP TOS settings for nvme rdma and tcp transports from Israel
     - rq_dma_dir cleanups from Israel
     - tracing for Get LBA Status command from Minwoo
     - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
     - Some consolidation between the fabrics transports for handling
       the CAP register
     - reset race with ns scanning fix for fabrics (move fabrics
       commands to a dedicated request queue with a different lifetime
       from the admin request queue)."
     - controller reset and namespace scan races fixes
     - nvme discovery log change uevent support
     - naming improvements from Keith
     - multiple discovery controllers reject fix from James
     - some regular cleanups from various people

 - Series fixing (and re-fixing) null_blk debug printing and nr_devices
   checks (André)

 - A few pull requests from Song, with fixes from Andy, Guoqing,
   Guilherme, Neil, Nigel, and Yufen.

 - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

 - Bio merge handling unification (Christoph)

 - Pick default elevator correctly for devices with special needs
   (Damien)

 - Block stats fixes (Hou)

 - Timeout and support devices nbd fixes (Mike)

 - Series fixing races around elevator switching and device add/remove
   (Ming)

 - sed-opal cleanups (Revanth)

 - Per device weight support for BFQ (Fam)

 - Support for blk-iocost, a new model that can properly account cost of
   IO workloads. (Tejun)

 - blk-cgroup writeback fixes (Tejun)

 - paride queue init fixes (zhengbin)

 - blk_set_runtime_active() cleanup (Stanley)

 - Block segment mapping optimizations (Bart)

 - lightnvm fixes (Hans/Minwoo/YueHaibing)

 - Various little fixes and cleanups

* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
  null_blk: format pr_* logs with pr_fmt
  null_blk: match the type of parameter nr_devices
  null_blk: do not fail the module load with zero devices
  block: also check RQF_STATS in blk_mq_need_time_stamp()
  block: make rq sector size accessible for block stats
  bfq: Fix bfq linkage error
  raid5: use bio_end_sector in r5_next_bio
  raid5: remove STRIPE_OPS_REQ_PENDING
  md: add feature flag MD_FEATURE_RAID0_LAYOUT
  md/raid0: avoid RAID0 data corruption due to layout confusion.
  raid5: don't set STRIPE_HANDLE to stripe which is in batch list
  raid5: don't increment read_errors on EILSEQ return
  nvmet: fix a wrong error status returned in error log page
  nvme: send discovery log page change events to userspace
  nvme: add uevent variables for controller devices
  nvme: enable aen regardless of the presence of I/O queues
  nvme-fabrics: allow discovery subsystems accept a kato
  nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
  nvme: Remove redundant assignment of cq vector
  nvme: Assign subsys instance from first ctrl
  ...
parents 5260c2b8 9c7eddf1
Loading
Loading
Loading
Loading
+97 −0
Original line number Diff line number Diff line
@@ -1469,6 +1469,103 @@ IO Interface Files
	  8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
	  8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021

  io.cost.qos
	A read-write nested-keyed file with exists only on the root
	cgroup.

	This file configures the Quality of Service of the IO cost
	model based controller (CONFIG_BLK_CGROUP_IOCOST) which
	currently implements "io.weight" proportional control.  Lines
	are keyed by $MAJ:$MIN device numbers and not ordered.  The
	line for a given device is populated on the first write for
	the device on "io.cost.qos" or "io.cost.model".  The following
	nested keys are defined.

	  ======	=====================================
	  enable	Weight-based control enable
	  ctrl		"auto" or "user"
	  rpct		Read latency percentile    [0, 100]
	  rlat		Read latency threshold
	  wpct		Write latency percentile   [0, 100]
	  wlat		Write latency threshold
	  min		Minimum scaling percentage [1, 10000]
	  max		Maximum scaling percentage [1, 10000]
	  ======	=====================================

	The controller is disabled by default and can be enabled by
	setting "enable" to 1.  "rpct" and "wpct" parameters default
	to zero and the controller uses internal device saturation
	state to adjust the overall IO rate between "min" and "max".

	When a better control quality is needed, latency QoS
	parameters can be configured.  For example::

	  8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0

	shows that on sdb, the controller is enabled, will consider
	the device saturated if the 95th percentile of read completion
	latencies is above 75ms or write 150ms, and adjust the overall
	IO issue rate between 50% and 150% accordingly.

	The lower the saturation point, the better the latency QoS at
	the cost of aggregate bandwidth.  The narrower the allowed
	adjustment range between "min" and "max", the more conformant
	to the cost model the IO behavior.  Note that the IO issue
	base rate may be far off from 100% and setting "min" and "max"
	blindly can lead to a significant loss of device capacity or
	control quality.  "min" and "max" are useful for regulating
	devices which show wide temporary behavior changes - e.g. a
	ssd which accepts writes at the line speed for a while and
	then completely stalls for multiple seconds.

	When "ctrl" is "auto", the parameters are controlled by the
	kernel and may change automatically.  Setting "ctrl" to "user"
	or setting any of the percentile and latency parameters puts
	it into "user" mode and disables the automatic changes.  The
	automatic mode can be restored by setting "ctrl" to "auto".

  io.cost.model
	A read-write nested-keyed file with exists only on the root
	cgroup.

	This file configures the cost model of the IO cost model based
	controller (CONFIG_BLK_CGROUP_IOCOST) which currently
	implements "io.weight" proportional control.  Lines are keyed
	by $MAJ:$MIN device numbers and not ordered.  The line for a
	given device is populated on the first write for the device on
	"io.cost.qos" or "io.cost.model".  The following nested keys
	are defined.

	  =====		================================
	  ctrl		"auto" or "user"
	  model		The cost model in use - "linear"
	  =====		================================

	When "ctrl" is "auto", the kernel may change all parameters
	dynamically.  When "ctrl" is set to "user" or any other
	parameters are written to, "ctrl" become "user" and the
	automatic changes are disabled.

	When "model" is "linear", the following model parameters are
	defined.

	  =============	========================================
	  [r|w]bps	The maximum sequential IO throughput
	  [r|w]seqiops	The maximum 4k sequential IOs per second
	  [r|w]randiops	The maximum 4k random IOs per second
	  =============	========================================

	From the above, the builtin linear model determines the base
	costs of a sequential and random IO and the cost coefficient
	for the IO size.  While simple, this model can cover most
	common device classes acceptably.

	The IO cost model isn't expected to be accurate in absolute
	sense and is scaled to the device behavior dynamically.

	If needed, tools/cgroup/iocost_coef_gen.py can be used to
	generate device-specific coefficients.

  io.weight
	A read-write flat-keyed file which exists on non-root cgroups.
	The default is "default 100".
+0 −6
Original line number Diff line number Diff line
@@ -1201,12 +1201,6 @@
			See comment before function elanfreq_setup() in
			arch/x86/kernel/cpu/cpufreq/elanfreq.c.

	elevator=	[IOSCHED]
			Format: { "mq-deadline" | "kyber" | "bfq" }
			See Documentation/block/deadline-iosched.rst,
			Documentation/block/kyber-iosched.rst and
			Documentation/block/bfq-iosched.rst for details.

	elfcorehdr=[size[KMG]@]offset[KMG] [IA64,PPC,SH,X86,S390]
			Specifies physical address of start of kernel core
			image elf header and optionally the size. Generally
+3 −5
Original line number Diff line number Diff line
@@ -274,9 +274,7 @@ To reduce its OS jitter, do any of the following:
		(based on an earlier one from Gilad Ben-Yossef) that
		reduces or even eliminates vmstat overhead for some
		workloads at https://lkml.org/lkml/2013/9/4/379.
	e.	Boot with "elevator=noop" to avoid workqueue use by
		the block layer.
	f.	If running on high-end powerpc servers, build with
	e.	If running on high-end powerpc servers, build with
		CONFIG_PPC_RTAS_DAEMON=n.  This prevents the RTAS
		daemon from running on each CPU every second or so.
		(This will require editing Kconfig files and will defeat
@@ -284,12 +282,12 @@ To reduce its OS jitter, do any of the following:
		due to the rtas_event_scan() function.
		WARNING:  Please check your CPU specifications to
		make sure that this is safe on your particular system.
	g.	If running on Cell Processor, build your kernel with
	f.	If running on Cell Processor, build your kernel with
		CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from
		spu_gov_work().
		WARNING:  Please check your CPU specifications to
		make sure that this is safe on your particular system.
	h.	If running on PowerMAC, build your kernel with
	g.	If running on PowerMAC, build your kernel with
		CONFIG_PMAC_RACKMETER=n to disable the CPU-meter,
		avoiding OS jitter from rackmeter_do_timer().

+18 −15
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

========================
Null block device driver
========================

1. Overview
===========
Overview
========

The null block device (/dev/nullb*) is used for benchmarking the various
The null block device (``/dev/nullb*``) is used for benchmarking the various
block-layer implementations. It emulates a block device of X gigabytes in size.
The following instances are possible:

  Single-queue block-layer

    - Request-based.
    - Single submission queue per device.
    - Implements IO scheduling algorithms (CFQ, Deadline, noop).
It does not execute any read/write operation, just mark them as complete in
the request queue. The following instances are possible:

  Multi-queue block-layer

@@ -27,15 +24,15 @@ The following instances are possible:

All of them have a completion queue for each core in the system.

2. Module parameters applicable for all instances
=================================================
Module parameters
=================

queue_mode=[0-2]: Default: 2-Multi-queue
  Selects which block-layer the module should instantiate with.

  =  ============
  0  Bio-based
  1  Single-queue
  1  Single-queue (deprecated)
  2  Multi-queue
  =  ============

@@ -67,7 +64,7 @@ irqmode=[0-2]: Default: 1-Soft-irq
completion_nsec=[ns]: Default: 10,000ns
  Combined with irqmode=2 (timer). The time each completion event must wait.

submit_queues=[1..nr_cpus]:
submit_queues=[1..nr_cpus]: Default: 1
  The number of submission queues attached to the device driver. If unset, it
  defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module
  parameter is 1.
@@ -75,9 +72,11 @@ submit_queues=[1..nr_cpus]:
hw_queue_depth=[0..qdepth]: Default: 64
  The hardware queue depth of the device.

III: Multi-queue specific parameters
Multi-queue specific parameters
-------------------------------

use_per_node_hctx=[0/1]: Default: 0
  Number of hardware context queues.

  =  =====================================================================
  0  The number of submit queues are set to the value of the submit_queues
@@ -87,6 +86,7 @@ use_per_node_hctx=[0/1]: Default: 0
  =  =====================================================================

no_sched=[0/1]: Default: 0
  Enable/disable the io scheduler.

  =  ======================================
  0  nullb* use default blk-mq io scheduler
@@ -94,6 +94,7 @@ no_sched=[0/1]: Default: 0
  =  ======================================

blocking=[0/1]: Default: 0
  Blocking behavior of the request queue.

  =  ===============================================================
  0  Register as a non-blocking blk-mq driver device.
@@ -103,6 +104,7 @@ blocking=[0/1]: Default: 0
  =  ===============================================================

shared_tags=[0/1]: Default: 0
  Sharing tags between devices.

  =  ================================================================
  0  Tag set is not shared.
@@ -111,6 +113,7 @@ shared_tags=[0/1]: Default: 0
  =  ================================================================

zoned=[0/1]: Default: 0
  Device is a random-access or a zoned block device.

  =  ======================================================================
  0  Block device is exposed as a random-access block device.
+0 −4
Original line number Diff line number Diff line
@@ -2,10 +2,6 @@
Switching Scheduler
===================

To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
'noop' and 'cfq' (the default) are also available. IO schedulers are assigned
globally at boot time only presently.

Each io queue has a set of io scheduler tunables associated with it. These
tunables control how the io scheduler works. You can find these entries
in::
Loading