Commit 58ef57b1 authored by Ingo Molnar's avatar Ingo Molnar
Browse files

Merge branch 'core/rcu' into sched/core, to pick up dependency



We are going to rely on the loosening of RCU callback semantics,
introduced by this commit:

  806f04e9: ("rcu: Allow for smp_call_function() running callbacks from idle")

Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
parents 498bdcdb 806f04e9
Loading
Loading
Loading
Loading
+16 −45
Original line number Diff line number Diff line
@@ -1943,56 +1943,27 @@ invoked from a CPU-hotplug notifier.
Scheduler and RCU
~~~~~~~~~~~~~~~~~

RCU depends on the scheduler, and the scheduler uses RCU to protect some
of its data structures. The preemptible-RCU ``rcu_read_unlock()``
implementation must therefore be written carefully to avoid deadlocks
involving the scheduler's runqueue and priority-inheritance locks. In
particular, ``rcu_read_unlock()`` must tolerate an interrupt where the
interrupt handler invokes both ``rcu_read_lock()`` and
``rcu_read_unlock()``. This possibility requires ``rcu_read_unlock()``
to use negative nesting levels to avoid destructive recursion via
interrupt handler's use of RCU.

This scheduler-RCU requirement came as a `complete
surprise <https://lwn.net/Articles/453002/>`__.

As noted above, RCU makes use of kthreads, and it is necessary to avoid
excessive CPU-time accumulation by these kthreads. This requirement was
no surprise, but RCU's violation of it when running context-switch-heavy
workloads when built with ``CONFIG_NO_HZ_FULL=y`` `did come as a
surprise
RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
accumulation by these kthreads. This requirement was no surprise, but
RCU's violation of it when running context-switch-heavy workloads when
built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise
[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__.
RCU has made good progress towards meeting this requirement, even for
context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
room for further improvement.

It is forbidden to hold any of scheduler's runqueue or
priority-inheritance spinlocks across an ``rcu_read_unlock()`` unless
interrupts have been disabled across the entire RCU read-side critical
section, that is, up to and including the matching ``rcu_read_lock()``.
Violating this restriction can result in deadlocks involving these
scheduler spinlocks. There was hope that this restriction might be
lifted when interrupt-disabled calls to ``rcu_read_unlock()`` started
deferring the reporting of the resulting RCU-preempt quiescent state
until the end of the corresponding interrupts-disabled region.
Unfortunately, timely reporting of the corresponding quiescent state to
expedited grace periods requires a call to ``raise_softirq()``, which
can acquire these scheduler spinlocks. In addition, real-time systems
using RCU priority boosting need this restriction to remain in effect
because deferred quiescent-state reporting would also defer deboosting,
which in turn would degrade real-time latencies.

In theory, if a given RCU read-side critical section could be guaranteed
to be less than one second in duration, holding a scheduler spinlock
across that critical section's ``rcu_read_unlock()`` would require only
that preemption be disabled across the entire RCU read-side critical
section, not interrupts. Unfortunately, given the possibility of vCPU
preemption, long-running interrupts, and so on, it is not possible in
practice to guarantee that a given RCU read-side critical section will
complete in less than one second. Therefore, as noted above, if
scheduler spinlocks are held across a given call to
``rcu_read_unlock()``, interrupts must be disabled across the entire RCU
read-side critical section.
There is no longer any prohibition against holding any of
scheduler's runqueue or priority-inheritance spinlocks across an
``rcu_read_unlock()``, even if interrupts and preemption were enabled
somewhere within the corresponding RCU read-side critical section.
Therefore, it is now perfectly legal to execute ``rcu_read_lock()``
with preemption enabled, acquire one of the scheduler locks, and hold
that lock across the matching ``rcu_read_unlock()``.

Similarly, the RCU flavor consolidation has removed the need for negative
nesting.  The fact that interrupt-disabled regions of code act as RCU
read-side critical sections implicitly avoids earlier issues that used
to result in destructive recursion via interrupt handler's use of RCU.

Tracing and RCU
~~~~~~~~~~~~~~~
+19 −0
Original line number Diff line number Diff line
@@ -4210,12 +4210,24 @@
			Duration of CPU stall (s) to test RCU CPU stall
			warnings, zero to disable.

	rcutorture.stall_cpu_block= [KNL]
			Sleep while stalling if set.  This will result
			in warnings from preemptible RCU in addition
			to any other stall-related activity.

	rcutorture.stall_cpu_holdoff= [KNL]
			Time to wait (s) after boot before inducing stall.

	rcutorture.stall_cpu_irqsoff= [KNL]
			Disable interrupts while stalling if set.

	rcutorture.stall_gp_kthread= [KNL]
			Duration (s) of forced sleep within RCU
			grace-period kthread to test RCU CPU stall
			warnings, zero to disable.  If both stall_cpu
			and stall_gp_kthread are specified, the
			kthread is starved first, then the CPU.

	rcutorture.stat_interval= [KNL]
			Time (s) between statistics printk()s.

@@ -4286,6 +4298,13 @@
			only normal grace-period primitives.  No effect
			on CONFIG_TINY_RCU kernels.

	rcupdate.rcu_task_ipi_delay= [KNL]
			Set time in jiffies during which RCU tasks will
			avoid sending IPIs, starting with the beginning
			of a given grace period.  Setting a large
			number avoids disturbing real-time workloads,
			but lengthens grace periods.

	rcupdate.rcu_task_stall_timeout= [KNL]
			Set timeout in jiffies for RCU task stall warning
			messages.  Disable with a value less than or equal
+0 −8
Original line number Diff line number Diff line
@@ -229,14 +229,6 @@ Adding support for it is easy: just define the macro in asm/ftrace.h and
pass the return address pointer as the 'retp' argument to
ftrace_push_return_trace().

HAVE_FTRACE_NMI_ENTER
---------------------

If you can't trace NMI functions, then skip this option.

<details to be filled>


HAVE_SYSCALL_TRACEPOINTS
------------------------

+59 −19
Original line number Diff line number Diff line
@@ -32,29 +32,69 @@ u64 smp_irq_stat_cpu(unsigned int cpu);

struct nmi_ctx {
	u64 hcr;
	unsigned int cnt;
};

DECLARE_PER_CPU(struct nmi_ctx, nmi_contexts);

#define arch_nmi_enter()						\
do {									\
		if (is_kernel_in_hyp_mode()) {					\
			struct nmi_ctx *nmi_ctx = this_cpu_ptr(&nmi_contexts);	\
			nmi_ctx->hcr = read_sysreg(hcr_el2);			\
			if (!(nmi_ctx->hcr & HCR_TGE)) {			\
				write_sysreg(nmi_ctx->hcr | HCR_TGE, hcr_el2);	\
				isb();						\
	struct nmi_ctx *___ctx;						\
	u64 ___hcr;							\
									\
	if (!is_kernel_in_hyp_mode())					\
		break;							\
									\
	___ctx = this_cpu_ptr(&nmi_contexts);				\
	if (___ctx->cnt) {						\
		___ctx->cnt++;						\
		break;							\
	}								\
									\
	___hcr = read_sysreg(hcr_el2);					\
	if (!(___hcr & HCR_TGE)) {					\
		write_sysreg(___hcr | HCR_TGE, hcr_el2);		\
		isb();							\
	}								\
	/*								\
	 * Make sure the sysreg write is performed before ___ctx->cnt	\
	 * is set to 1. NMIs that see cnt == 1 will rely on us.		\
	 */								\
	barrier();							\
	___ctx->cnt = 1;                                                \
	/*								\
	 * Make sure ___ctx->cnt is set before we save ___hcr. We	\
	 * don't want ___ctx->hcr to be overwritten.			\
	 */								\
	barrier();							\
	___ctx->hcr = ___hcr;						\
} while (0)

#define arch_nmi_exit()							\
do {									\
		if (is_kernel_in_hyp_mode()) {					\
			struct nmi_ctx *nmi_ctx = this_cpu_ptr(&nmi_contexts);	\
			if (!(nmi_ctx->hcr & HCR_TGE))				\
				write_sysreg(nmi_ctx->hcr, hcr_el2);		\
		}								\
	struct nmi_ctx *___ctx;						\
	u64 ___hcr;							\
									\
	if (!is_kernel_in_hyp_mode())					\
		break;							\
									\
	___ctx = this_cpu_ptr(&nmi_contexts);				\
	___hcr = ___ctx->hcr;						\
	/*								\
	 * Make sure we read ___ctx->hcr before we release		\
	 * ___ctx->cnt as it makes ___ctx->hcr updatable again.		\
	 */								\
	barrier();							\
	___ctx->cnt--;							\
	/*								\
	 * Make sure ___ctx->cnt release is visible before we		\
	 * restore the sysreg. Otherwise a new NMI occurring		\
	 * right after write_sysreg() can be fooled and think		\
	 * we secured things for it.					\
	 */								\
	barrier();							\
	if (!___ctx->cnt && !(___hcr & HCR_TGE))			\
		write_sysreg(___hcr, hcr_el2);				\
} while (0)

static inline void ack_bad_irq(unsigned int irq)
+2 −12
Original line number Diff line number Diff line
@@ -251,21 +251,11 @@ asmlinkage __kprobes notrace unsigned long
__sdei_handler(struct pt_regs *regs, struct sdei_registered_event *arg)
{
	unsigned long ret;
	bool do_nmi_exit = false;

	/*
	 * nmi_enter() deals with printk() re-entrance and use of RCU when
	 * RCU believed this CPU was idle. Because critical events can
	 * interrupt normal events, we may already be in_nmi().
	 */
	if (!in_nmi()) {
	nmi_enter();
		do_nmi_exit = true;
	}

	ret = _sdei_handler(regs, arg);

	if (do_nmi_exit)
	nmi_exit();

	return ret;
Loading