Merge branch 'core/rcu' into sched/core, to pick up dependency (58ef57b1) · Commits · 戴 / test

Documentation/RCU/Design/Requirements/Requirements.rst

+16 −45

Original line number	Diff line number	Diff line
		@@ -1943,56 +1943,27 @@ invoked from a CPU-hotplug notifier.
		Scheduler and RCU
		~~~~~~~~~~~~~~~~~

		RCU depends on the scheduler, and the scheduler uses RCU to protect some
		of its data structures. The preemptible-RCU ``rcu_read_unlock()``
		implementation must therefore be written carefully to avoid deadlocks
		involving the scheduler's runqueue and priority-inheritance locks. In
		particular, ``rcu_read_unlock()`` must tolerate an interrupt where the
		interrupt handler invokes both ``rcu_read_lock()`` and
		``rcu_read_unlock()``. This possibility requires ``rcu_read_unlock()``
		to use negative nesting levels to avoid destructive recursion via
		interrupt handler's use of RCU.

		This scheduler-RCU requirement came as a `complete
		surprise <https://lwn.net/Articles/453002/>`__.

		As noted above, RCU makes use of kthreads, and it is necessary to avoid
		excessive CPU-time accumulation by these kthreads. This requirement was
		no surprise, but RCU's violation of it when running context-switch-heavy
		workloads when built with ``CONFIG_NO_HZ_FULL=y`` `did come as a
		surprise
		RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
		accumulation by these kthreads. This requirement was no surprise, but
		RCU's violation of it when running context-switch-heavy workloads when
		built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise
		[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__.
		RCU has made good progress towards meeting this requirement, even for
		context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
		room for further improvement.

		It is forbidden to hold any of scheduler's runqueue or
		priority-inheritance spinlocks across an ``rcu_read_unlock()`` unless
		interrupts have been disabled across the entire RCU read-side critical
		section, that is, up to and including the matching ``rcu_read_lock()``.
		Violating this restriction can result in deadlocks involving these
		scheduler spinlocks. There was hope that this restriction might be
		lifted when interrupt-disabled calls to ``rcu_read_unlock()`` started
		deferring the reporting of the resulting RCU-preempt quiescent state
		until the end of the corresponding interrupts-disabled region.
		Unfortunately, timely reporting of the corresponding quiescent state to
		expedited grace periods requires a call to ``raise_softirq()``, which
		can acquire these scheduler spinlocks. In addition, real-time systems
		using RCU priority boosting need this restriction to remain in effect
		because deferred quiescent-state reporting would also defer deboosting,
		which in turn would degrade real-time latencies.

		In theory, if a given RCU read-side critical section could be guaranteed
		to be less than one second in duration, holding a scheduler spinlock
		across that critical section's ``rcu_read_unlock()`` would require only
		that preemption be disabled across the entire RCU read-side critical
		section, not interrupts. Unfortunately, given the possibility of vCPU
		preemption, long-running interrupts, and so on, it is not possible in
		practice to guarantee that a given RCU read-side critical section will
		complete in less than one second. Therefore, as noted above, if
		scheduler spinlocks are held across a given call to
		``rcu_read_unlock()``, interrupts must be disabled across the entire RCU
		read-side critical section.
		There is no longer any prohibition against holding any of
		scheduler's runqueue or priority-inheritance spinlocks across an
		``rcu_read_unlock()``, even if interrupts and preemption were enabled
		somewhere within the corresponding RCU read-side critical section.
		Therefore, it is now perfectly legal to execute ``rcu_read_lock()``
		with preemption enabled, acquire one of the scheduler locks, and hold
		that lock across the matching ``rcu_read_unlock()``.

		Similarly, the RCU flavor consolidation has removed the need for negative
		nesting. The fact that interrupt-disabled regions of code act as RCU
		read-side critical sections implicitly avoids earlier issues that used
		to result in destructive recursion via interrupt handler's use of RCU.

		Tracing and RCU
		~~~~~~~~~~~~~~~

Documentation/admin-guide/kernel-parameters.txt

+19 −0

Original line number	Diff line number	Diff line
		@@ -4210,12 +4210,24 @@
		Duration of CPU stall (s) to test RCU CPU stall
		warnings, zero to disable.

		rcutorture.stall_cpu_block= [KNL]
		Sleep while stalling if set. This will result
		in warnings from preemptible RCU in addition
		to any other stall-related activity.

		rcutorture.stall_cpu_holdoff= [KNL]
		Time to wait (s) after boot before inducing stall.

		rcutorture.stall_cpu_irqsoff= [KNL]
		Disable interrupts while stalling if set.

		rcutorture.stall_gp_kthread= [KNL]
		Duration (s) of forced sleep within RCU
		grace-period kthread to test RCU CPU stall
		warnings, zero to disable. If both stall_cpu
		and stall_gp_kthread are specified, the
		kthread is starved first, then the CPU.

		rcutorture.stat_interval= [KNL]
		Time (s) between statistics printk()s.

		@@ -4286,6 +4298,13 @@
		only normal grace-period primitives. No effect
		on CONFIG_TINY_RCU kernels.

		rcupdate.rcu_task_ipi_delay= [KNL]
		Set time in jiffies during which RCU tasks will
		avoid sending IPIs, starting with the beginning
		of a given grace period. Setting a large
		number avoids disturbing real-time workloads,
		but lengthens grace periods.

		rcupdate.rcu_task_stall_timeout= [KNL]
		Set timeout in jiffies for RCU task stall warning
		messages. Disable with a value less than or equal

Documentation/trace/ftrace-design.rst

+0 −8

Original line number	Diff line number	Diff line
		@@ -229,14 +229,6 @@ Adding support for it is easy: just define the macro in asm/ftrace.h and
		pass the return address pointer as the 'retp' argument to
		ftrace_push_return_trace().

		HAVE_FTRACE_NMI_ENTER
		---------------------

		If you can't trace NMI functions, then skip this option.

		<details to be filled>


		HAVE_SYSCALL_TRACEPOINTS
		------------------------

arch/arm64/include/asm/hardirq.h

+59 −19

Original line number	Diff line number	Diff line
		@@ -32,29 +32,69 @@ u64 smp_irq_stat_cpu(unsigned int cpu);

		struct nmi_ctx {
		u64 hcr;
		unsigned int cnt;
		};

		DECLARE_PER_CPU(struct nmi_ctx, nmi_contexts);

		#define arch_nmi_enter() \
		do { \
		if (is_kernel_in_hyp_mode()) { \
		struct nmi_ctx *nmi_ctx = this_cpu_ptr(&nmi_contexts); \
		nmi_ctx->hcr = read_sysreg(hcr_el2); \
		if (!(nmi_ctx->hcr & HCR_TGE)) { \
		write_sysreg(nmi_ctx->hcr \| HCR_TGE, hcr_el2); \
		isb(); \
		struct nmi_ctx *___ctx; \
		u64 ___hcr; \
		\
		if (!is_kernel_in_hyp_mode()) \
		break; \
		\
		___ctx = this_cpu_ptr(&nmi_contexts); \
		if (___ctx->cnt) { \
		___ctx->cnt++; \
		break; \
		} \
		\
		___hcr = read_sysreg(hcr_el2); \
		if (!(___hcr & HCR_TGE)) { \
		write_sysreg(___hcr \| HCR_TGE, hcr_el2); \
		isb(); \
		} \
		/* \
		* Make sure the sysreg write is performed before ___ctx->cnt \
		* is set to 1. NMIs that see cnt == 1 will rely on us. \
		*/ \
		barrier(); \
		___ctx->cnt = 1; \
		/* \
		* Make sure ___ctx->cnt is set before we save ___hcr. We \
		* don't want ___ctx->hcr to be overwritten. \
		*/ \
		barrier(); \
		___ctx->hcr = ___hcr; \
		} while (0)

		#define arch_nmi_exit() \
		do { \
		if (is_kernel_in_hyp_mode()) { \
		struct nmi_ctx *nmi_ctx = this_cpu_ptr(&nmi_contexts); \
		if (!(nmi_ctx->hcr & HCR_TGE)) \
		write_sysreg(nmi_ctx->hcr, hcr_el2); \
		} \
		struct nmi_ctx *___ctx; \
		u64 ___hcr; \
		\
		if (!is_kernel_in_hyp_mode()) \
		break; \
		\
		___ctx = this_cpu_ptr(&nmi_contexts); \
		___hcr = ___ctx->hcr; \
		/* \
		* Make sure we read ___ctx->hcr before we release \
		* ___ctx->cnt as it makes ___ctx->hcr updatable again. \
		*/ \
		barrier(); \
		___ctx->cnt--; \
		/* \
		* Make sure ___ctx->cnt release is visible before we \
		* restore the sysreg. Otherwise a new NMI occurring \
		* right after write_sysreg() can be fooled and think \
		* we secured things for it. \
		*/ \
		barrier(); \
		if (!___ctx->cnt && !(___hcr & HCR_TGE)) \
		write_sysreg(___hcr, hcr_el2); \
		} while (0)

		static inline void ack_bad_irq(unsigned int irq)

arch/arm64/kernel/sdei.c

+2 −12

Original line number	Diff line number	Diff line
		@@ -251,21 +251,11 @@ asmlinkage __kprobes notrace unsigned long
		__sdei_handler(struct pt_regs regs, struct sdei_registered_event arg)
		{
		unsigned long ret;
		bool do_nmi_exit = false;

		/*
		* nmi_enter() deals with printk() re-entrance and use of RCU when
		* RCU believed this CPU was idle. Because critical events can
		* interrupt normal events, we may already be in_nmi().
		*/
		if (!in_nmi()) {
		nmi_enter();
		do_nmi_exit = true;
		}

		ret = _sdei_handler(regs, arg);

		if (do_nmi_exit)
		nmi_exit();

		return ret;

Admin message