Merge tag 'arm64-mmiowb' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux (dd4e5d61) · Commits · 戴 / test

Documentation/driver-api/device-io.rst

+0 −45

Original line number	Diff line number	Diff line
		@@ -103,51 +103,6 @@ continuing execution::
		ha->flags.ints_enabled = 0;
		}

		In addition to write posting, on some large multiprocessing systems
		(e.g. SGI Challenge, Origin and Altix machines) posted writes won't be
		strongly ordered coming from different CPUs. Thus it's important to
		properly protect parts of your driver that do memory-mapped writes with
		locks and use the :c:func:`mmiowb()` to make sure they arrive in the
		order intended. Issuing a regular readX() will also ensure write ordering,
		but should only be used when the
		driver has to be sure that the write has actually arrived at the device
		(not that it's simply ordered with respect to other writes), since a
		full readX() is a relatively expensive operation.

		Generally, one should use :c:func:`mmiowb()` prior to releasing a spinlock
		that protects regions using :c:func:`writeb()` or similar functions that
		aren't surrounded by readb() calls, which will ensure ordering
		and flushing. The following pseudocode illustrates what might occur if
		write ordering isn't guaranteed via :c:func:`mmiowb()` or one of the
		readX() functions::

		CPU A: spin_lock_irqsave(&dev_lock, flags)
		CPU A: ...
		CPU A: writel(newval, ring_ptr);
		CPU A: spin_unlock_irqrestore(&dev_lock, flags)
		...
		CPU B: spin_lock_irqsave(&dev_lock, flags)
		CPU B: writel(newval2, ring_ptr);
		CPU B: ...
		CPU B: spin_unlock_irqrestore(&dev_lock, flags)

		In the case above, newval2 could be written to ring_ptr before newval.
		Fixing it is easy though::

		CPU A: spin_lock_irqsave(&dev_lock, flags)
		CPU A: ...
		CPU A: writel(newval, ring_ptr);
		CPU A: mmiowb(); /* ensure no other writes beat us to the device */
		CPU A: spin_unlock_irqrestore(&dev_lock, flags)
		...
		CPU B: spin_lock_irqsave(&dev_lock, flags)
		CPU B: writel(newval2, ring_ptr);
		CPU B: ...
		CPU B: mmiowb();
		CPU B: spin_unlock_irqrestore(&dev_lock, flags)

		See tg3.c for a real world example of how to use :c:func:`mmiowb()`

		PCI ordering rules also guarantee that PIO read responses arrive after any
		outstanding DMA writes from that bus, since for some devices the result of
		a readb() call may signal to the driver that a DMA transaction is

Documentation/driver-api/pci/p2pdma.rst

+0 −4

Original line number	Diff line number	Diff line
		@@ -132,10 +132,6 @@ precludes passing these pages to userspace.
		P2P memory is also technically IO memory but should never have any side
		effects behind it. Thus, the order of loads and stores should not be important
		and ioreadX(), iowriteX() and friends should not be necessary.
		However, as the memory is not cache coherent, if access ever needs to
		be protected by a spinlock then :c:func:`mmiowb()` must be used before
		unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
		Documentation/memory-barriers.txt)


		P2P DMA Support Library

Documentation/memory-barriers.txt

+100 −149

Original line number	Diff line number	Diff line
		@@ -1937,21 +1937,6 @@ There are some more advanced barrier functions:
		information on consistent memory.


		MMIO WRITE BARRIER
		------------------

		The Linux kernel also has a special barrier for use with memory-mapped I/O
		writes:

		mmiowb();

		This is a variation on the mandatory write barrier that causes writes to weakly
		ordered I/O regions to be partially ordered. Its effects may go beyond the
		CPU->Hardware interface and actually affect the hardware at some level.

		See the subsection "Acquires vs I/O accesses" for more information.


		===============================
		IMPLICIT KERNEL MEMORY BARRIERS
		===============================
		@@ -2317,75 +2302,6 @@ But it won't see any of:
		E, F or *G following RELEASE Q



		ACQUIRES VS I/O ACCESSES
		------------------------

		Under certain circumstances (especially involving NUMA), I/O accesses within
		two spinlocked sections on two different CPUs may be seen as interleaved by the
		PCI bridge, because the PCI bridge does not necessarily participate in the
		cache-coherence protocol, and is therefore incapable of issuing the required
		read memory barriers.

		For example:

		CPU 1 CPU 2
		=============================== ===============================
		spin_lock(Q)
		writel(0, ADDR)
		writel(1, DATA);
		spin_unlock(Q);
		spin_lock(Q);
		writel(4, ADDR);
		writel(5, DATA);
		spin_unlock(Q);

		may be seen by the PCI bridge as follows:

		STORE ADDR = 0, STORE ADDR = 4, STORE DATA = 1, STORE DATA = 5

		which would probably cause the hardware to malfunction.


		What is necessary here is to intervene with an mmiowb() before dropping the
		spinlock, for example:

		CPU 1 CPU 2
		=============================== ===============================
		spin_lock(Q)
		writel(0, ADDR)
		writel(1, DATA);
		mmiowb();
		spin_unlock(Q);
		spin_lock(Q);
		writel(4, ADDR);
		writel(5, DATA);
		mmiowb();
		spin_unlock(Q);

		this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
		before either of the stores issued on CPU 2.


		Furthermore, following a store by a load from the same device obviates the need
		for the mmiowb(), because the load forces the store to complete before the load
		is performed:

		CPU 1 CPU 2
		=============================== ===============================
		spin_lock(Q)
		writel(0, ADDR)
		a = readl(DATA);
		spin_unlock(Q);
		spin_lock(Q);
		writel(4, ADDR);
		b = readl(DATA);
		spin_unlock(Q);


		See Documentation/driver-api/device-io.rst for more information.


		=================================
		WHERE ARE MEMORY BARRIERS NEEDED?
		=================================
		@@ -2532,16 +2448,9 @@ the device to malfunction.
		Inside of the Linux kernel, I/O should be done through the appropriate accessor
		routines - such as inb() or writel() - which know how to make such accesses
		appropriately sequential. While this, for the most part, renders the explicit
		use of memory barriers unnecessary, there are a couple of situations where they
		might be needed:

		(1) On some systems, I/O stores are not strongly ordered across all CPUs, and
		so for _all_ general drivers locks should be used and mmiowb() must be
		issued prior to unlocking the critical section.

		(2) If the accessor functions are used to refer to an I/O memory window with
		relaxed memory access properties, then _mandatory_ memory barriers are
		required to enforce ordering.
		use of memory barriers unnecessary, if the accessor functions are used to refer
		to an I/O memory window with relaxed memory access properties, then _mandatory_
		memory barriers are required to enforce ordering.

		See Documentation/driver-api/device-io.rst for more information.

		@@ -2586,8 +2495,7 @@ explicit barriers are used.

		Normally this won't be a problem because the I/O accesses done inside such
		sections will include synchronous load operations on strictly ordered I/O
		registers that form implicit I/O barriers. If this isn't sufficient then an
		mmiowb() may need to be used explicitly.
		registers that form implicit I/O barriers.


		A similar situation may occur between an interrupt routine and two routines
		@@ -2599,72 +2507,115 @@ likely, then interrupt-disabling locks should be used to guarantee ordering.
		KERNEL I/O BARRIER EFFECTS
		==========================

		When accessing I/O memory, drivers should use the appropriate accessor
		functions:

		(*) inX(), outX():
		Interfacing with peripherals via I/O accesses is deeply architecture and device
		specific. Therefore, drivers which are inherently non-portable may rely on
		specific behaviours of their target systems in order to achieve synchronization
		in the most lightweight manner possible. For drivers intending to be portable
		between multiple architectures and bus implementations, the kernel offers a
		series of accessor functions that provide various degrees of ordering
		guarantees:

		These are intended to talk to I/O space rather than memory space, but
		that's primarily a CPU-specific concept. The i386 and x86_64 processors
		do indeed have special I/O space access cycles and instructions, but many
		CPUs don't have such a concept.

		The PCI bus, amongst others, defines an I/O space concept which - on such
		CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
		space. However, it may also be mapped as a virtual I/O space in the CPU's
		memory map, particularly on those CPUs that don't support alternate I/O
		spaces.

		Accesses to this space may be fully synchronous (as on i386), but
		intermediary bridges (such as the PCI host bridge) may not fully honour
		that.

		They are guaranteed to be fully ordered with respect to each other.
		(*) readX(), writeX():

		They are not guaranteed to be fully ordered with respect to other types of
		memory and I/O operation.
		The readX() and writeX() MMIO accessors take a pointer to the
		peripheral being accessed as an __iomem * parameter. For pointers
		mapped with the default I/O attributes (e.g. those returned by
		ioremap()), the ordering guarantees are as follows:

		1. All readX() and writeX() accesses to the same peripheral are ordered
		with respect to each other. This ensures that MMIO register accesses
		by the same CPU thread to a particular device will arrive in program
		order.

		2. A writeX() issued by a CPU thread holding a spinlock is ordered
		before a writeX() to the same peripheral from another CPU thread
		issued after a later acquisition of the same spinlock. This ensures
		that MMIO register writes to a particular device issued while holding
		a spinlock will arrive in an order consistent with acquisitions of
		the lock.

		3. A writeX() by a CPU thread to the peripheral will first wait for the
		completion of all prior writes to memory either issued by, or
		propagated to, the same thread. This ensures that writes by the CPU
		to an outbound DMA buffer allocated by dma_alloc_coherent() will be
		visible to a DMA engine when the CPU writes to its MMIO control
		register to trigger the transfer.

		4. A readX() by a CPU thread from the peripheral will complete before
		any subsequent reads from memory by the same thread can begin. This
		ensures that reads by the CPU from an incoming DMA buffer allocated
		by dma_alloc_coherent() will not see stale data after reading from
		the DMA engine's MMIO status register to establish that the DMA
		transfer has completed.

		5. A readX() by a CPU thread from the peripheral will complete before
		any subsequent delay() loop can begin execution on the same thread.
		This ensures that two MMIO register writes by the CPU to a peripheral
		will arrive at least 1us apart if the first write is immediately read
		back with readX() and udelay(1) is called prior to the second
		writeX():

		writel(42, DEVICE_REGISTER_0); // Arrives at the device...
		readl(DEVICE_REGISTER_0);
		udelay(1);
		writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.

		The ordering properties of __iomem pointers obtained with non-default
		attributes (e.g. those returned by ioremap_wc()) are specific to the
		underlying architecture and therefore the guarantees listed above cannot
		generally be relied upon for accesses to these types of mappings.

		(*) readX_relaxed(), writeX_relaxed():

		(*) readX(), writeX():
		These are similar to readX() and writeX(), but provide weaker memory
		ordering guarantees. Specifically, they do not guarantee ordering with
		respect to locking, normal memory accesses or delay() loops (i.e.
		bullets 2-5 above) but they are still guaranteed to be ordered with
		respect to other accesses from the same CPU thread to the same
		peripheral when operating on __iomem pointers mapped with the default
		I/O attributes.

		Whether these are guaranteed to be fully ordered and uncombined with
		respect to each other on the issuing CPU depends on the characteristics
		defined for the memory window through which they're accessing. On later
		i386 architecture machines, for example, this is controlled by way of the
		MTRR registers.
		(*) readsX(), writesX():

		Ordinarily, these will be guaranteed to be fully ordered and uncombined,
		provided they're not accessing a prefetchable device.
		The readsX() and writesX() MMIO accessors are designed for accessing
		register-based, memory-mapped FIFOs residing on peripherals that are not
		capable of performing DMA. Consequently, they provide only the ordering
		guarantees of readX_relaxed() and writeX_relaxed(), as documented above.

		However, intermediary hardware (such as a PCI bridge) may indulge in
		deferral if it so wishes; to flush a store, a load from the same location
		is preferred[*], but a load from the same device or from configuration
		space should suffice for PCI.
		(*) inX(), outX():

		[*] NOTE! attempting to load from the same location as was written to may
		cause a malfunction - consider the 16550 Rx/Tx serial registers for
		example.
		The inX() and outX() accessors are intended to access legacy port-mapped
		I/O peripherals, which may require special instructions on some
		architectures (notably x86). The port number of the peripheral being
		accessed is passed as an argument.

		Used with prefetchable I/O memory, an mmiowb() barrier may be required to
		force stores to be ordered.
		Since many CPU architectures ultimately access these peripherals via an
		internal virtual memory mapping, the portable ordering guarantees
		provided by inX() and outX() are the same as those provided by readX()
		and writeX() respectively when accessing a mapping with the default I/O
		attributes.

		Please refer to the PCI specification for more information on interactions
		between PCI transactions.
		Device drivers may expect outX() to emit a non-posted write transaction
		that waits for a completion response from the I/O peripheral before
		returning. This is not guaranteed by all architectures and is therefore
		not part of the portable ordering semantics.

		(*) readX_relaxed(), writeX_relaxed()
		(*) insX(), outsX():

		These are similar to readX() and writeX(), but provide weaker memory
		ordering guarantees. Specifically, they do not guarantee ordering with
		respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
		ordering with respect to LOCK or UNLOCK operations. If the latter is
		required, an mmiowb() barrier can be used. Note that relaxed accesses to
		the same peripheral are guaranteed to be ordered with respect to each
		other.
		As above, the insX() and outsX() accessors provide the same ordering
		guarantees as readsX() and writesX() respectively when accessing a
		mapping with the default I/O attributes.

		(*) ioreadX(), iowriteX()
		(*) ioreadX(), iowriteX():

		These will perform appropriately for the type of access they're actually
		doing, be it inX()/outX() or readX()/writeX().

		With the exception of the string accessors (insX(), outsX(), readsX() and
		writesX()), all of the above assume that the underlying peripheral is
		little-endian and will therefore perform byte-swapping operations on big-endian
		architectures.


		========================================
		ASSUMED MINIMUM EXECUTION ORDERING MODEL

arch/alpha/include/asm/Kbuild

+1 −0

Original line number	Diff line number	Diff line
		@@ -9,6 +9,7 @@ generic-y += irq_work.h
		generic-y += kvm_para.h
		generic-y += mcs_spinlock.h
		generic-y += mm-arch-hooks.h
		generic-y += mmiowb.h
		generic-y += preempt.h
		generic-y += sections.h
		generic-y += trace_clock.h

arch/alpha/include/asm/io.h

+0 −2

Original line number	Diff line number	Diff line
		@@ -513,8 +513,6 @@ extern inline void writeq(u64 b, volatile void __iomem *addr)
		#define writel_relaxed(b, addr) __raw_writel(b, addr)
		#define writeq_relaxed(b, addr) __raw_writeq(b, addr)

		#define mmiowb()

		/*
		* String version of IO memory access ops:
		*/

Admin message