Commit dd4e5d61 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull mmiowb removal from Will Deacon:
 "Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb())

  Remove mmiowb() from the kernel memory barrier API and instead, for
  architectures that need it, hide the barrier inside spin_unlock() when
  MMIO has been performed inside the critical section.

  The only relatively recent changes have been addressing review
  comments on the documentation, which is in a much better shape thanks
  to the efforts of Ben and Ingo.

  I was initially planning to split this into two pull requests so that
  you could run the coccinelle script yourself, however it's been plain
  sailing in linux-next so I've just included the whole lot here to keep
  things simple"

* tag 'arm64-mmiowb' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (23 commits)
  docs/memory-barriers.txt: Update I/O section to be clearer about CPU vs thread
  docs/memory-barriers.txt: Fix style, spacing and grammar in I/O section
  arch: Remove dummy mmiowb() definitions from arch code
  net/ethernet/silan/sc92031: Remove stale comment about mmiowb()
  i40iw: Redefine i40iw_mmiowb() to do nothing
  scsi/qla1280: Remove stale comment about mmiowb()
  drivers: Remove explicit invocations of mmiowb()
  drivers: Remove useless trailing comments from mmiowb() invocations
  Documentation: Kill all references to mmiowb()
  riscv/mmiowb: Hook up mmwiob() implementation to asm-generic code
  powerpc/mmiowb: Hook up mmwiob() implementation to asm-generic code
  ia64/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  mips/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  sh/mmiowb: Add unconditional mmiowb() to arch_spin_unlock()
  m68k/io: Remove useless definition of mmiowb()
  nds32/io: Remove useless definition of mmiowb()
  x86/io: Remove useless definition of mmiowb()
  arm64/io: Remove useless definition of mmiowb()
  ARM/io: Remove useless definition of mmiowb()
  mmiowb: Hook up mmiowb helpers to spinlocks and generic I/O accessors
  ...
parents 14be4c61 9726840d
Loading
Loading
Loading
Loading
+0 −45
Original line number Diff line number Diff line
@@ -103,51 +103,6 @@ continuing execution::
        ha->flags.ints_enabled = 0;
    }

In addition to write posting, on some large multiprocessing systems
(e.g. SGI Challenge, Origin and Altix machines) posted writes won't be
strongly ordered coming from different CPUs. Thus it's important to
properly protect parts of your driver that do memory-mapped writes with
locks and use the :c:func:`mmiowb()` to make sure they arrive in the
order intended. Issuing a regular readX() will also ensure write ordering,
but should only be used when the 
driver has to be sure that the write has actually arrived at the device
(not that it's simply ordered with respect to other writes), since a
full readX() is a relatively expensive operation.

Generally, one should use :c:func:`mmiowb()` prior to releasing a spinlock
that protects regions using :c:func:`writeb()` or similar functions that
aren't surrounded by readb() calls, which will ensure ordering
and flushing. The following pseudocode illustrates what might occur if
write ordering isn't guaranteed via :c:func:`mmiowb()` or one of the
readX() functions::

    CPU A:  spin_lock_irqsave(&dev_lock, flags)
    CPU A:  ...
    CPU A:  writel(newval, ring_ptr);
    CPU A:  spin_unlock_irqrestore(&dev_lock, flags)
            ...
    CPU B:  spin_lock_irqsave(&dev_lock, flags)
    CPU B:  writel(newval2, ring_ptr);
    CPU B:  ...
    CPU B:  spin_unlock_irqrestore(&dev_lock, flags)

In the case above, newval2 could be written to ring_ptr before newval.
Fixing it is easy though::

    CPU A:  spin_lock_irqsave(&dev_lock, flags)
    CPU A:  ...
    CPU A:  writel(newval, ring_ptr);
    CPU A:  mmiowb(); /* ensure no other writes beat us to the device */
    CPU A:  spin_unlock_irqrestore(&dev_lock, flags)
            ...
    CPU B:  spin_lock_irqsave(&dev_lock, flags)
    CPU B:  writel(newval2, ring_ptr);
    CPU B:  ...
    CPU B:  mmiowb();
    CPU B:  spin_unlock_irqrestore(&dev_lock, flags)

See tg3.c for a real world example of how to use :c:func:`mmiowb()`

PCI ordering rules also guarantee that PIO read responses arrive after any
outstanding DMA writes from that bus, since for some devices the result of
a readb() call may signal to the driver that a DMA transaction is
+0 −4
Original line number Diff line number Diff line
@@ -132,10 +132,6 @@ precludes passing these pages to userspace.
P2P memory is also technically IO memory but should never have any side
effects behind it. Thus, the order of loads and stores should not be important
and ioreadX(), iowriteX() and friends should not be necessary.
However, as the memory is not cache coherent, if access ever needs to
be protected by a spinlock then :c:func:`mmiowb()` must be used before
unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
Documentation/memory-barriers.txt)


P2P DMA Support Library
+100 −149
Original line number Diff line number Diff line
@@ -1937,21 +1937,6 @@ There are some more advanced barrier functions:
     information on consistent memory.


MMIO WRITE BARRIER
------------------

The Linux kernel also has a special barrier for use with memory-mapped I/O
writes:

	mmiowb();

This is a variation on the mandatory write barrier that causes writes to weakly
ordered I/O regions to be partially ordered.  Its effects may go beyond the
CPU->Hardware interface and actually affect the hardware at some level.

See the subsection "Acquires vs I/O accesses" for more information.


===============================
IMPLICIT KERNEL MEMORY BARRIERS
===============================
@@ -2317,75 +2302,6 @@ But it won't see any of:
	*E, *F or *G following RELEASE Q



ACQUIRES VS I/O ACCESSES
------------------------

Under certain circumstances (especially involving NUMA), I/O accesses within
two spinlocked sections on two different CPUs may be seen as interleaved by the
PCI bridge, because the PCI bridge does not necessarily participate in the
cache-coherence protocol, and is therefore incapable of issuing the required
read memory barriers.

For example:

	CPU 1				CPU 2
	===============================	===============================
	spin_lock(Q)
	writel(0, ADDR)
	writel(1, DATA);
	spin_unlock(Q);
					spin_lock(Q);
					writel(4, ADDR);
					writel(5, DATA);
					spin_unlock(Q);

may be seen by the PCI bridge as follows:

	STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5

which would probably cause the hardware to malfunction.


What is necessary here is to intervene with an mmiowb() before dropping the
spinlock, for example:

	CPU 1				CPU 2
	===============================	===============================
	spin_lock(Q)
	writel(0, ADDR)
	writel(1, DATA);
	mmiowb();
	spin_unlock(Q);
					spin_lock(Q);
					writel(4, ADDR);
					writel(5, DATA);
					mmiowb();
					spin_unlock(Q);

this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
before either of the stores issued on CPU 2.


Furthermore, following a store by a load from the same device obviates the need
for the mmiowb(), because the load forces the store to complete before the load
is performed:

	CPU 1				CPU 2
	===============================	===============================
	spin_lock(Q)
	writel(0, ADDR)
	a = readl(DATA);
	spin_unlock(Q);
					spin_lock(Q);
					writel(4, ADDR);
					b = readl(DATA);
					spin_unlock(Q);


See Documentation/driver-api/device-io.rst for more information.


=================================
WHERE ARE MEMORY BARRIERS NEEDED?
=================================
@@ -2532,16 +2448,9 @@ the device to malfunction.
Inside of the Linux kernel, I/O should be done through the appropriate accessor
routines - such as inb() or writel() - which know how to make such accesses
appropriately sequential.  While this, for the most part, renders the explicit
use of memory barriers unnecessary, there are a couple of situations where they
might be needed:

 (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
     so for _all_ general drivers locks should be used and mmiowb() must be
     issued prior to unlocking the critical section.

 (2) If the accessor functions are used to refer to an I/O memory window with
     relaxed memory access properties, then _mandatory_ memory barriers are
     required to enforce ordering.
use of memory barriers unnecessary, if the accessor functions are used to refer
to an I/O memory window with relaxed memory access properties, then _mandatory_
memory barriers are required to enforce ordering.

See Documentation/driver-api/device-io.rst for more information.

@@ -2586,8 +2495,7 @@ explicit barriers are used.

Normally this won't be a problem because the I/O accesses done inside such
sections will include synchronous load operations on strictly ordered I/O
registers that form implicit I/O barriers.  If this isn't sufficient then an
mmiowb() may need to be used explicitly.
registers that form implicit I/O barriers.


A similar situation may occur between an interrupt routine and two routines
@@ -2599,72 +2507,115 @@ likely, then interrupt-disabling locks should be used to guarantee ordering.
KERNEL I/O BARRIER EFFECTS
==========================

When accessing I/O memory, drivers should use the appropriate accessor
functions:

 (*) inX(), outX():
Interfacing with peripherals via I/O accesses is deeply architecture and device
specific. Therefore, drivers which are inherently non-portable may rely on
specific behaviours of their target systems in order to achieve synchronization
in the most lightweight manner possible. For drivers intending to be portable
between multiple architectures and bus implementations, the kernel offers a
series of accessor functions that provide various degrees of ordering
guarantees:

     These are intended to talk to I/O space rather than memory space, but
     that's primarily a CPU-specific concept.  The i386 and x86_64 processors
     do indeed have special I/O space access cycles and instructions, but many
     CPUs don't have such a concept.

     The PCI bus, amongst others, defines an I/O space concept which - on such
     CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
     space.  However, it may also be mapped as a virtual I/O space in the CPU's
     memory map, particularly on those CPUs that don't support alternate I/O
     spaces.

     Accesses to this space may be fully synchronous (as on i386), but
     intermediary bridges (such as the PCI host bridge) may not fully honour
     that.

     They are guaranteed to be fully ordered with respect to each other.
 (*) readX(), writeX():

     They are not guaranteed to be fully ordered with respect to other types of
     memory and I/O operation.
	The readX() and writeX() MMIO accessors take a pointer to the
	peripheral being accessed as an __iomem * parameter. For pointers
	mapped with the default I/O attributes (e.g. those returned by
	ioremap()), the ordering guarantees are as follows:

	1. All readX() and writeX() accesses to the same peripheral are ordered
	   with respect to each other. This ensures that MMIO register accesses
	   by the same CPU thread to a particular device will arrive in program
	   order.

	2. A writeX() issued by a CPU thread holding a spinlock is ordered
	   before a writeX() to the same peripheral from another CPU thread
	   issued after a later acquisition of the same spinlock. This ensures
	   that MMIO register writes to a particular device issued while holding
	   a spinlock will arrive in an order consistent with acquisitions of
	   the lock.

	3. A writeX() by a CPU thread to the peripheral will first wait for the
	   completion of all prior writes to memory either issued by, or
	   propagated to, the same thread. This ensures that writes by the CPU
	   to an outbound DMA buffer allocated by dma_alloc_coherent() will be
	   visible to a DMA engine when the CPU writes to its MMIO control
	   register to trigger the transfer.

	4. A readX() by a CPU thread from the peripheral will complete before
	   any subsequent reads from memory by the same thread can begin. This
	   ensures that reads by the CPU from an incoming DMA buffer allocated
	   by dma_alloc_coherent() will not see stale data after reading from
	   the DMA engine's MMIO status register to establish that the DMA
	   transfer has completed.

	5. A readX() by a CPU thread from the peripheral will complete before
	   any subsequent delay() loop can begin execution on the same thread.
	   This ensures that two MMIO register writes by the CPU to a peripheral
	   will arrive at least 1us apart if the first write is immediately read
	   back with readX() and udelay(1) is called prior to the second
	   writeX():

		writel(42, DEVICE_REGISTER_0); // Arrives at the device...
		readl(DEVICE_REGISTER_0);
		udelay(1);
		writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.

	The ordering properties of __iomem pointers obtained with non-default
	attributes (e.g. those returned by ioremap_wc()) are specific to the
	underlying architecture and therefore the guarantees listed above cannot
	generally be relied upon for accesses to these types of mappings.

 (*) readX_relaxed(), writeX_relaxed():

 (*) readX(), writeX():
	These are similar to readX() and writeX(), but provide weaker memory
	ordering guarantees. Specifically, they do not guarantee ordering with
	respect to locking, normal memory accesses or delay() loops (i.e.
	bullets 2-5 above) but they are still guaranteed to be ordered with
	respect to other accesses from the same CPU thread to the same
	peripheral when operating on __iomem pointers mapped with the default
	I/O attributes.

     Whether these are guaranteed to be fully ordered and uncombined with
     respect to each other on the issuing CPU depends on the characteristics
     defined for the memory window through which they're accessing.  On later
     i386 architecture machines, for example, this is controlled by way of the
     MTRR registers.
 (*) readsX(), writesX():

     Ordinarily, these will be guaranteed to be fully ordered and uncombined,
     provided they're not accessing a prefetchable device.
	The readsX() and writesX() MMIO accessors are designed for accessing
	register-based, memory-mapped FIFOs residing on peripherals that are not
	capable of performing DMA. Consequently, they provide only the ordering
	guarantees of readX_relaxed() and writeX_relaxed(), as documented above.

     However, intermediary hardware (such as a PCI bridge) may indulge in
     deferral if it so wishes; to flush a store, a load from the same location
     is preferred[*], but a load from the same device or from configuration
     space should suffice for PCI.
 (*) inX(), outX():

     [*] NOTE! attempting to load from the same location as was written to may
	 cause a malfunction - consider the 16550 Rx/Tx serial registers for
	 example.
	The inX() and outX() accessors are intended to access legacy port-mapped
	I/O peripherals, which may require special instructions on some
	architectures (notably x86). The port number of the peripheral being
	accessed is passed as an argument.

     Used with prefetchable I/O memory, an mmiowb() barrier may be required to
     force stores to be ordered.
	Since many CPU architectures ultimately access these peripherals via an
	internal virtual memory mapping, the portable ordering guarantees
	provided by inX() and outX() are the same as those provided by readX()
	and writeX() respectively when accessing a mapping with the default I/O
	attributes.

     Please refer to the PCI specification for more information on interactions
     between PCI transactions.
	Device drivers may expect outX() to emit a non-posted write transaction
	that waits for a completion response from the I/O peripheral before
	returning. This is not guaranteed by all architectures and is therefore
	not part of the portable ordering semantics.

 (*) readX_relaxed(), writeX_relaxed()
 (*) insX(), outsX():

     These are similar to readX() and writeX(), but provide weaker memory
     ordering guarantees.  Specifically, they do not guarantee ordering with
     respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
     ordering with respect to LOCK or UNLOCK operations.  If the latter is
     required, an mmiowb() barrier can be used.  Note that relaxed accesses to
     the same peripheral are guaranteed to be ordered with respect to each
     other.
	As above, the insX() and outsX() accessors provide the same ordering
	guarantees as readsX() and writesX() respectively when accessing a
	mapping with the default I/O attributes.

 (*) ioreadX(), iowriteX()
 (*) ioreadX(), iowriteX():

	These will perform appropriately for the type of access they're actually
	doing, be it inX()/outX() or readX()/writeX().

With the exception of the string accessors (insX(), outsX(), readsX() and
writesX()), all of the above assume that the underlying peripheral is
little-endian and will therefore perform byte-swapping operations on big-endian
architectures.


========================================
ASSUMED MINIMUM EXECUTION ORDERING MODEL
+1 −0
Original line number Diff line number Diff line
@@ -9,6 +9,7 @@ generic-y += irq_work.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
generic-y += mmiowb.h
generic-y += preempt.h
generic-y += sections.h
generic-y += trace_clock.h
+0 −2
Original line number Diff line number Diff line
@@ -513,8 +513,6 @@ extern inline void writeq(u64 b, volatile void __iomem *addr)
#define writel_relaxed(b, addr)	__raw_writel(b, addr)
#define writeq_relaxed(b, addr)	__raw_writeq(b, addr)

#define mmiowb()

/*
 * String version of IO memory access ops:
 */
Loading