Commit cee1352f authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU updates from Ingo Molnar:
 "The biggest change in this cycle is the conclusion of the big
  'simplify RCU to two primary flavors' consolidation work - i.e.
  there's a single RCU flavor for any kernel variant (PREEMPT and
  !PREEMPT):

    - Consolidate the RCU-bh, RCU-preempt, and RCU-sched flavors into a
      single flavor similar to RCU-sched in !PREEMPT kernels and into a
      single flavor similar to RCU-preempt (but also waiting on
      preempt-disabled sequences of code) in PREEMPT kernels.

      This branch also includes a refactoring of
      rcu_{nmi,irq}_{enter,exit}() from Byungchul Park.

    - Now that there is only one RCU flavor in any given running kernel,
      the many "rsp" pointers are no longer required, and this cleanup
      series removes them.

    - This branch carries out additional cleanups made possible by the
      RCU flavor consolidation, including inlining now-trivial
      functions, updating comments and definitions, and removing
      now-unneeded rcutorture scenarios.

    - Now that there is only one flavor of RCU in any running kernel,
      there is also only on rcu_data structure per CPU. This means that
      the rcu_dynticks structure can be merged into the rcu_data
      structure, a task taken on by this branch. This branch also
      contains a -rt-related fix from Mike Galbraith.

  There were also other updates:

    - Documentation updates, including some good-eye catches from Joel
      Fernandes.

    - SRCU updates, most notably changes enabling call_srcu() to be
      invoked very early in the boot sequence.

    - Torture-test updates, including some preliminary work towards
      making rcutorture better able to find problems that result in
      insufficient grace-period forward progress.

    - Initial changes to RCU to better promote forward progress of grace
      periods, including fixing a bug found by Marius Hillenbrand and
      David Woodhouse, with the fix suggested by Peter Zijlstra"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
  srcu: Make early-boot call_srcu() reuse workqueue lists
  rcutorture: Test early boot call_srcu()
  srcu: Make call_srcu() available during very early boot
  rcu: Convert rcu_state.ofl_lock to raw_spinlock_t
  rcu: Remove obsolete ->dynticks_fqs and ->cond_resched_completed
  rcu: Switch ->dynticks to rcu_data structure, remove rcu_dynticks
  rcu: Switch dyntick nesting counters to rcu_data structure
  rcu: Switch urgent quiescent-state requests to rcu_data structure
  rcu: Switch lazy counts to rcu_data structure
  rcu: Switch last accelerate/advance to rcu_data structure
  rcu: Switch ->tick_nohz_enabled_snap to rcu_data structure
  rcu: Merge rcu_dynticks structure into rcu_data structure
  rcu: Remove unused rcu_dynticks_snap() from Tiny RCU
  rcu: Convert "1UL << x" to "BIT(x)"
  rcu: Avoid resched_cpu() when rescheduling the current CPU
  rcu: More aggressively enlist scheduler aid for nohz_full CPUs
  rcu: Compute jiffies_till_sched_qs from other kernel parameters
  rcu: Provide functions for determining if call_rcu() has been invoked
  rcu: Eliminate ->rcu_qs_ctr from the rcu_dynticks structure
  rcu: Motivate Tiny RCU forward progress
  ...
parents e2b623fb d0346559
Loading
Loading
Loading
Loading
+12 −19
Original line number Diff line number Diff line
@@ -1227,9 +1227,11 @@ to overflow the counter, this approach corrects the
CPU enters the idle loop from process context.

</p><p>The <tt>-&gt;dynticks</tt> field counts the corresponding
CPU's transitions to and from dyntick-idle mode, so that this counter
has an even value when the CPU is in dyntick-idle mode and an odd
value otherwise.
CPU's transitions to and from either dyntick-idle or user mode, so
that this counter has an even value when the CPU is in dyntick-idle
mode or user mode and an odd value otherwise. The transitions to/from
user mode need to be counted for user mode adaptive-ticks support
(see timers/NO_HZ.txt).

</p><p>The <tt>-&gt;rcu_need_heavy_qs</tt> field is used
to record the fact that the RCU core code would really like to
@@ -1372,8 +1374,7 @@ that is, if the CPU is currently idle.
Accessor Functions</a></h3>

<p>The following listing shows the
<tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt>,
<tt>rcu_for_each_nonleaf_node_breadth_first()</tt>, and
<tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt> and
<tt>rcu_for_each_leaf_node()</tt> function and macros:

<pre>
@@ -1386,13 +1387,9 @@ Accessor Functions</a></h3>
  7   for ((rnp) = &amp;(rsp)-&gt;node[0]; \
  8        (rnp) &lt; &amp;(rsp)-&gt;node[NUM_RCU_NODES]; (rnp)++)
  9
 10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
 11   for ((rnp) = &amp;(rsp)-&gt;node[0]; \
 12        (rnp) &lt; (rsp)-&gt;level[NUM_RCU_LVLS - 1]; (rnp)++)
 13
 14 #define rcu_for_each_leaf_node(rsp, rnp) \
 15   for ((rnp) = (rsp)-&gt;level[NUM_RCU_LVLS - 1]; \
 16        (rnp) &lt; &amp;(rsp)-&gt;node[NUM_RCU_NODES]; (rnp)++)
 10 #define rcu_for_each_leaf_node(rsp, rnp) \
 11   for ((rnp) = (rsp)-&gt;level[NUM_RCU_LVLS - 1]; \
 12        (rnp) &lt; &amp;(rsp)-&gt;node[NUM_RCU_NODES]; (rnp)++)
</pre>

<p>The <tt>rcu_get_root()</tt> simply returns a pointer to the
@@ -1405,10 +1402,7 @@ macro takes advantage of the layout of the <tt>rcu_node</tt>
structures in the <tt>rcu_state</tt> structure's
<tt>-&gt;node[]</tt> array, performing a breadth-first traversal by
simply traversing the array in order.
The <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> macro operates
similarly, but traverses only the first part of the array, thus excluding
the leaf <tt>rcu_node</tt> structures.
Finally, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only
Similarly, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only
the last part of the array, thus traversing only the leaf
<tt>rcu_node</tt> structures.

@@ -1416,15 +1410,14 @@ the last part of the array, thus traversing only the leaf
<tr><th>&nbsp;</th></tr>
<tr><th align="left">Quick Quiz:</th></tr>
<tr><td>
	What do <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> and
	What does
	<tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree
	contains only a single node?
</td></tr>
<tr><th align="left">Answer:</th></tr>
<tr><td bgcolor="#ffffff"><font color="ffffff">
	In the single-node case,
	<tt>rcu_for_each_nonleaf_node_breadth_first()</tt> is a no-op
	and <tt>rcu_for_each_leaf_node()</tt> traverses the single node.
	<tt>rcu_for_each_leaf_node()</tt> traverses the single node.
</font></td></tr>
<tr><td>&nbsp;</td></tr>
</table>
+4 −5
Original line number Diff line number Diff line
@@ -12,10 +12,9 @@ high efficiency and minimal disturbance, expedited grace periods accept
lower efficiency and significant disturbance to attain shorter latencies.

<p>
There are three flavors of RCU (RCU-bh, RCU-preempt, and RCU-sched),
but only two flavors of expedited grace periods because the RCU-bh
expedited grace period maps onto the RCU-sched expedited grace period.
Each of the remaining two implementations is covered in its own section.
There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier
third RCU-bh flavor having been implemented in terms of the other two.
Each of the two implementations is covered in its own section.

<ol>
<li>	<a href="#Expedited Grace Period Design">
@@ -158,7 +157,7 @@ whether or not the current CPU is in an RCU read-side critical section.
The best that <tt>sync_sched_exp_handler()</tt> can do is to check
for idle, on the off-chance that the CPU went idle while the IPI
was in flight.
If the CPU is idle, then tt>sync_sched_exp_handler()</tt> reports
If the CPU is idle, then <tt>sync_sched_exp_handler()</tt> reports
the quiescent state.

<p>
+72 −142
Original line number Diff line number Diff line
@@ -1306,8 +1306,6 @@ doing so would degrade real-time response.

<p>
This non-requirement appeared with preemptible RCU.
If you need a grace period that waits on non-preemptible code regions, use
<a href="#Sched Flavor">RCU-sched</a>.

<h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2>

@@ -2165,14 +2163,9 @@ however, this is not a panacea because there would be severe restrictions
on what operations those callbacks could invoke.

<p>
Perhaps surprisingly, <tt>synchronize_rcu()</tt>,
<a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a>
(<a href="#Bottom-Half Flavor">discussed below</a>),
<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>,
Perhaps surprisingly, <tt>synchronize_rcu()</tt> and
<tt>synchronize_rcu_expedited()</tt>,
<tt>synchronize_rcu_bh_expedited()</tt>, and
<tt>synchronize_sched_expedited()</tt>
will all operate normally
will operate normally
during very early boot, the reason being that there is only one CPU
and preemption is disabled.
This means that the call <tt>synchronize_rcu()</tt> (or friends)
@@ -2269,12 +2262,23 @@ Thankfully, RCU update-side primitives, including
The name notwithstanding, some Linux-kernel architectures
can have nested NMIs, which RCU must handle correctly.
Andy Lutomirski
<a href="https://lkml.kernel.org/g/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a>
<a href="https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a>
with this requirement;
he also kindly surprised me with
<a href="https://lkml.kernel.org/g/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a>
<a href="https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a>
that meets this requirement.

<p>
Furthermore, NMI handlers can be interrupted by what appear to RCU
to be normal interrupts.
One way that this can happen is for code that directly invokes
<tt>rcu_irq_enter()</tt> and </tt>rcu_irq_exit()</tt> to be called
from an NMI handler.
This astonishing fact of life prompted the current code structure,
which has <tt>rcu_irq_enter()</tt> invoking <tt>rcu_nmi_enter()</tt>
and <tt>rcu_irq_exit()</tt> invoking <tt>rcu_nmi_exit()</tt>.
And yes, I also learned of this requirement the hard way.

<h3><a name="Loadable Modules">Loadable Modules</a></h3>

<p>
@@ -2394,30 +2398,9 @@ when invoked from a CPU-hotplug notifier.
<p>
RCU depends on the scheduler, and the scheduler uses RCU to
protect some of its data structures.
This means the scheduler is forbidden from acquiring
the runqueue locks and the priority-inheritance locks
in the middle of an outermost RCU read-side critical section unless either
(1)&nbsp;it releases them before exiting that same
RCU read-side critical section, or
(2)&nbsp;interrupts are disabled across
that entire RCU read-side critical section.
This same prohibition also applies (recursively!) to any lock that is acquired
while holding any lock to which this prohibition applies.
Adhering to this rule prevents preemptible RCU from invoking
<tt>rcu_read_unlock_special()</tt> while either runqueue or
priority-inheritance locks are held, thus avoiding deadlock.

<p>
Prior to v4.4, it was only necessary to disable preemption across
RCU read-side critical sections that acquired scheduler locks.
In v4.4, expedited grace periods started using IPIs, and these
IPIs could force a <tt>rcu_read_unlock()</tt> to take the slowpath.
Therefore, this expedited-grace-period change required disabling of
interrupts, not just preemption.

<p>
For RCU's part, the preemptible-RCU <tt>rcu_read_unlock()</tt>
implementation must be written carefully to avoid similar deadlocks.
The preemptible-RCU <tt>rcu_read_unlock()</tt>
implementation must therefore be written carefully to avoid deadlocks
involving the scheduler's runqueue and priority-inheritance locks.
In particular, <tt>rcu_read_unlock()</tt> must tolerate an
interrupt where the interrupt handler invokes both
<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
@@ -2426,7 +2409,7 @@ negative nesting levels to avoid destructive recursion via
interrupt handler's use of RCU.

<p>
This pair of mutual scheduler-RCU requirements came as a
This scheduler-RCU requirement came as a
<a href="https://lwn.net/Articles/453002/">complete surprise</a>.

<p>
@@ -2437,9 +2420,28 @@ when running context-switch-heavy workloads when built with
<tt>CONFIG_NO_HZ_FULL=y</tt>
<a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>.
RCU has made good progress towards meeting this requirement, even
for context-switch-have <tt>CONFIG_NO_HZ_FULL=y</tt> workloads,
for context-switch-heavy <tt>CONFIG_NO_HZ_FULL=y</tt> workloads,
but there is room for further improvement.

<p>
In the past, it was forbidden to disable interrupts across an
<tt>rcu_read_unlock()</tt> unless that interrupt-disabled region
of code also included the matching <tt>rcu_read_lock()</tt>.
Violating this restriction could result in deadlocks involving the
scheduler's runqueue and priority-inheritance spinlocks.
This restriction was lifted when interrupt-disabled calls to
<tt>rcu_read_unlock()</tt> started deferring the reporting of
the resulting RCU-preempt quiescent state until the end of that
interrupts-disabled region.
This deferred reporting means that the scheduler's runqueue and
priority-inheritance locks cannot be held while reporting an RCU-preempt
quiescent state, which lifts the earlier restriction, at least from
a deadlock perspective.
Unfortunately, real-time systems using RCU priority boosting may
need this restriction to remain in effect because deferred
quiescent-state reporting also defers deboosting, which in turn
degrades real-time latencies.

<h3><a name="Tracing and RCU">Tracing and RCU</a></h3>

<p>
@@ -2850,15 +2852,22 @@ The other four flavors are listed below, with requirements for each
described in a separate section.

<ol>
<li>	<a href="#Bottom-Half Flavor">Bottom-Half Flavor</a>
<li>	<a href="#Sched Flavor">Sched Flavor</a>
<li>	<a href="#Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a>
<li>	<a href="#Sched Flavor">Sched Flavor (Historical)</a>
<li>	<a href="#Sleepable RCU">Sleepable RCU</a>
<li>	<a href="#Tasks RCU">Tasks RCU</a>
<li>	<a href="#Waiting for Multiple Grace Periods">
	Waiting for Multiple Grace Periods</a>
</ol>

<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor</a></h3>
<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a></h3>

<p>
The RCU-bh flavor of RCU has since been expressed in terms of
the other RCU flavors as part of a consolidation of the three
flavors into a single flavor.
The read-side API remains, and continues to disable softirq and to
be accounted for by lockdep.
Much of the material in this section is therefore strictly historical
in nature.

<p>
The softirq-disable (AKA &ldquo;bottom-half&rdquo;,
@@ -2918,8 +2927,20 @@ includes
<tt>call_rcu_bh()</tt>,
<tt>rcu_barrier_bh()</tt>, and
<tt>rcu_read_lock_bh_held()</tt>.
However, the update-side APIs are now simple wrappers for other RCU
flavors, namely RCU-sched in CONFIG_PREEMPT=n kernels and RCU-preempt
otherwise.

<h3><a name="Sched Flavor">Sched Flavor (Historical)</a></h3>

<h3><a name="Sched Flavor">Sched Flavor</a></h3>
<p>
The RCU-sched flavor of RCU has since been expressed in terms of
the other RCU flavors as part of a consolidation of the three
flavors into a single flavor.
The read-side API remains, and continues to disable preemption and to
be accounted for by lockdep.
Much of the material in this section is therefore strictly historical
in nature.

<p>
Before preemptible RCU, waiting for an RCU grace period had the
@@ -3139,94 +3160,14 @@ The tasks-RCU API is quite compact, consisting only of
<tt>call_rcu_tasks()</tt>,
<tt>synchronize_rcu_tasks()</tt>, and
<tt>rcu_barrier_tasks()</tt>.

<h3><a name="Waiting for Multiple Grace Periods">
Waiting for Multiple Grace Periods</a></h3>

<p>
Perhaps you have an RCU protected data structure that is accessed from
RCU read-side critical sections, from softirq handlers, and from
hardware interrupt handlers.
That is three flavors of RCU, the normal flavor, the bottom-half flavor,
and the sched flavor.
How to wait for a compound grace period?

<p>
The best approach is usually to &ldquo;just say no!&rdquo; and
insert <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>
around each RCU read-side critical section, regardless of what
environment it happens to be in.
But suppose that some of the RCU read-side critical sections are
on extremely hot code paths, and that use of <tt>CONFIG_PREEMPT=n</tt>
is not a viable option, so that <tt>rcu_read_lock()</tt> and
<tt>rcu_read_unlock()</tt> are not free.
What then?

<p>
You <i>could</i> wait on all three grace periods in succession, as follows:

<blockquote>
<pre>
 1 synchronize_rcu();
 2 synchronize_rcu_bh();
 3 synchronize_sched();
</pre>
</blockquote>

<p>
This works, but triples the update-side latency penalty.
In cases where this is not acceptable, <tt>synchronize_rcu_mult()</tt>
may be used to wait on all three flavors of grace period concurrently:

<blockquote>
<pre>
 1 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched);
</pre>
</blockquote>

<p>
But what if it is necessary to also wait on SRCU?
This can be done as follows:

<blockquote>
<pre>
 1 static void call_my_srcu(struct rcu_head *head,
 2        void (*func)(struct rcu_head *head))
 3 {
 4   call_srcu(&amp;my_srcu, head, func);
 5 }
 6
 7 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched, call_my_srcu);
</pre>
</blockquote>

<p>
If you needed to wait on multiple different flavors of SRCU
(but why???), you would need to create a wrapper function resembling
<tt>call_my_srcu()</tt> for each SRCU flavor.

<table>
<tr><th>&nbsp;</th></tr>
<tr><th align="left">Quick Quiz:</th></tr>
<tr><td>
	But what if I need to wait for multiple RCU flavors, but I also need
	the grace periods to be expedited?
</td></tr>
<tr><th align="left">Answer:</th></tr>
<tr><td bgcolor="#ffffff"><font color="ffffff">
	If you are using expedited grace periods, there should be less penalty
	for waiting on them in succession.
	But if that is nevertheless a problem, you can use workqueues
	or multiple kthreads to wait on the various expedited grace
	periods concurrently.
</font></td></tr>
<tr><td>&nbsp;</td></tr>
</table>

<p>
Again, it is usually better to adjust the RCU read-side critical sections
to use a single flavor of RCU, but when this is not feasible, you can use
<tt>synchronize_rcu_mult()</tt>.
In <tt>CONFIG_PREEMPT=n</tt> kernels, trampolines cannot be preempted,
so these APIs map to
<tt>call_rcu()</tt>,
<tt>synchronize_rcu()</tt>, and
<tt>rcu_barrier()</tt>, respectively.
In <tt>CONFIG_PREEMPT=y</tt> kernels, trampolines can be preempted,
and these three APIs are therefore implemented by separate functions
that check for voluntary context switches.

<h2><a name="Possible Future Changes">Possible Future Changes</a></h2>

@@ -3237,12 +3178,6 @@ If this becomes a serious problem, it will be necessary to rework the
grace-period state machine so as to avoid the need for the additional
latency.

<p>
Expedited grace periods scan the CPUs, so their latency and overhead
increases with increasing numbers of CPUs.
If this becomes a serious problem on large systems, it will be necessary
to do some redesign to avoid this scalability problem.

<p>
RCU disables CPU hotplug in a few places, perhaps most notably in the
<tt>rcu_barrier()</tt> operations.
@@ -3287,11 +3222,6 @@ Please note that arrangements that require RCU to remap CPU numbers will
require extremely good demonstration of need and full exploration of
alternatives.

<p>
There is an embarrassingly large number of flavors of RCU, and this
number has been increasing over time.
Perhaps it will be possible to combine some at some future date.

<p>
RCU's various kthreads are reasonably recent additions.
It is quite likely that adjustments will be required to more gracefully
+5 −8
Original line number Diff line number Diff line
@@ -16,12 +16,9 @@ o A CPU looping in an RCU read-side critical section.

o	A CPU looping with interrupts disabled.

o	A CPU looping with preemption disabled.  This condition can
	result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
	stalls.
o	A CPU looping with preemption disabled.

o	A CPU looping with bottom halves disabled.  This condition can
	result in RCU-sched and RCU-bh stalls.
o	A CPU looping with bottom halves disabled.

o	For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the kernel
	without invoking schedule().  If the looping in the kernel is
@@ -87,9 +84,9 @@ o A hardware failure. This is quite unlikely, but has occurred
	This resulted in a series of RCU CPU stall warnings, eventually
	leading the realization that the CPU had failed.

The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
warning.  Note that SRCU does -not- have CPU stall warnings.  Please note
that RCU only detects CPU stalls when there is a grace period in progress.
The RCU, RCU-sched, and RCU-tasks implementations have CPU stall warning.
Note that SRCU does -not- have CPU stall warnings.  Please note that
RCU only detects CPU stalls when there is a grace period in progress.
No grace period, no CPU stall warnings.

To diagnose the cause of the stall, inspect the stack traces.
+2 −1
Original line number Diff line number Diff line
@@ -934,7 +934,8 @@ c. Do you need to treat NMI handlers, hardirq handlers,
d.	Do you need RCU grace periods to complete even in the face
	of softirq monopolization of one or more of the CPUs?  For
	example, is your code subject to network-based denial-of-service
	attacks?  If so, you need RCU-bh.
	attacks?  If so, you should disable softirq across your readers,
	for example, by using rcu_read_lock_bh().

e.	Is your workload too update-intensive for normal use of
	RCU, but inappropriate for other synchronization mechanisms?
Loading