Commit ac12cf85 authored by Will Deacon's avatar Will Deacon
Browse files

Merge branches 'for-next/52-bit-kva', 'for-next/cpu-topology',...

Merge branches 'for-next/52-bit-kva', 'for-next/cpu-topology', 'for-next/error-injection', 'for-next/perf', 'for-next/psci-cpuidle', 'for-next/rng', 'for-next/smpboot', 'for-next/tbi' and 'for-next/tlbi' into for-next/core

* for-next/52-bit-kva: (25 commits)
  Support for 52-bit virtual addressing in kernel space

* for-next/cpu-topology: (9 commits)
  Move CPU topology parsing into core code and add support for ACPI 6.3

* for-next/error-injection: (2 commits)
  Support for function error injection via kprobes

* for-next/perf: (8 commits)
  Support for i.MX8 DDR PMU and proper SMMUv3 group validation

* for-next/psci-cpuidle: (7 commits)
  Move PSCI idle code into a new CPUidle driver

* for-next/rng: (4 commits)
  Support for 'rng-seed' property being passed in the devicetree

* for-next/smpboot: (3 commits)
  Reduce fragility of secondary CPU bringup in debug configurations

* for-next/tbi: (10 commits)
  Introduce new syscall ABI with relaxed requirements for pointer tags

* for-next/tlbi: (6 commits)
  Handle spurious page faults arising from kernel space
Loading
+52 −0
Original line number Diff line number Diff line
=====================================================
Freescale i.MX8 DDR Performance Monitoring Unit (PMU)
=====================================================

There are no performance counters inside the DRAM controller, so performance
signals are brought out to the edge of the controller where a set of 4 x 32 bit
counters is implemented. This is controlled by the CSV modes programed in counter
control register which causes a large number of PERF signals to be generated.

Selection of the value for each counter is done via the config registers. There
is one register for each counter. Counter 0 is special in that it always counts
“time” and when expired causes a lock on itself and the other counters and an
interrupt is raised. If any other counter overflows, it continues counting, and
no interrupt is raised.

The "format" directory describes format of the config (event ID) and config1
(AXI filtering) fields of the perf_event_attr structure, see /sys/bus/event_source/
devices/imx8_ddr0/format/. The "events" directory describes the events types
hardware supported that can be used with perf tool, see /sys/bus/event_source/
devices/imx8_ddr0/events/.
  e.g.::
        perf stat -a -e imx8_ddr0/cycles/ cmd
        perf stat -a -e imx8_ddr0/read/,imx8_ddr0/write/ cmd

AXI filtering is only used by CSV modes 0x41 (axid-read) and 0x42 (axid-write)
to count reading or writing matches filter setting. Filter setting is various
from different DRAM controller implementations, which is distinguished by quirks
in the driver.

* With DDR_CAP_AXI_ID_FILTER quirk.
  Filter is defined with two configuration parts:
  --AXI_ID defines AxID matching value.
  --AXI_MASKING defines which bits of AxID are meaningful for the matching.
        0:corresponding bit is masked.
        1: corresponding bit is not masked, i.e. used to do the matching.

  AXI_ID and AXI_MASKING are mapped on DPCR1 register in performance counter.
  When non-masked bits are matching corresponding AXI_ID bits then counter is
  incremented. Perf counter is incremented if
          AxID && AXI_MASKING == AXI_ID && AXI_MASKING

  This filter doesn't support filter different AXI ID for axid-read and axid-write
  event at the same time as this filter is shared between counters.
  e.g.::
        perf stat -a -e imx8_ddr0/axid-read,axi_mask=0xMMMM,axi_id=0xDDDD/ cmd
        perf stat -a -e imx8_ddr0/axid-write,axi_mask=0xMMMM,axi_id=0xDDDD/ cmd

  NOTE: axi_mask is inverted in userspace(i.e. set bits are bits to mask), and
  it will be reverted in driver automatically. so that the user can just specify
  axi_id to monitor a specific id, rather than having to specify axi_mask.
  e.g.::
        perf stat -a -e imx8_ddr0/axid-read,axi_id=0x12/ cmd, which will monitor ARID=0x12
+1 −0
Original line number Diff line number Diff line
@@ -16,6 +16,7 @@ ARM64 Architecture
    pointer-authentication
    silicon-errata
    sve
    tagged-address-abi
    tagged-pointers

.. only::  subproject and html
+27 −0
Original line number Diff line number Diff line
#!/bin/sh

# Print out the KASAN_SHADOW_OFFSETS required to place the KASAN SHADOW
# start address at the mid-point of the kernel VA space

print_kasan_offset () {
	printf "%02d\t" $1
	printf "0x%08x00000000\n" $(( (0xffffffff & (-1 << ($1 - 1 - 32))) \
			+ (1 << ($1 - 32 - $2)) \
			- (1 << (64 - 32 - $2)) ))
}

echo KASAN_SHADOW_SCALE_SHIFT = 3
printf "VABITS\tKASAN_SHADOW_OFFSET\n"
print_kasan_offset 48 3
print_kasan_offset 47 3
print_kasan_offset 42 3
print_kasan_offset 39 3
print_kasan_offset 36 3
echo
echo KASAN_SHADOW_SCALE_SHIFT = 4
printf "VABITS\tKASAN_SHADOW_OFFSET\n"
print_kasan_offset 48 4
print_kasan_offset 47 4
print_kasan_offset 42 4
print_kasan_offset 39 4
print_kasan_offset 36 4
+95 −28
Original line number Diff line number Diff line
@@ -14,6 +14,10 @@ with the 4KB page configuration, allowing 39-bit (512GB) or 48-bit
64KB pages, only 2 levels of translation tables, allowing 42-bit (4TB)
virtual address, are used but the memory layout is the same.

ARMv8.2 adds optional support for Large Virtual Address space. This is
only available when running with a 64KB page size and expands the
number of descriptors in the first level of translation.

User addresses have bits 63:48 set to 0 while the kernel addresses have
the same bits set to 1. TTBRx selection is given by bit 63 of the
virtual address. The swapper_pg_dir contains only kernel (global)
@@ -22,40 +26,43 @@ The swapper_pg_dir address is written to TTBR1 and never written to
TTBR0.


AArch64 Linux memory layout with 4KB pages + 3 levels::

  Start			End			Size		Use
  -----------------------------------------------------------------------
  0000000000000000	0000007fffffffff	 512GB		user
  ffffff8000000000	ffffffffffffffff	 512GB		kernel


AArch64 Linux memory layout with 4KB pages + 4 levels::
AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::

  Start			End			Size		Use
  -----------------------------------------------------------------------
  0000000000000000	0000ffffffffffff	 256TB		user
  ffff000000000000	ffffffffffffffff	 256TB		kernel


AArch64 Linux memory layout with 64KB pages + 2 levels::
  ffff000000000000	ffff7fffffffffff	 128TB		kernel logical memory map
  ffff800000000000	ffff9fffffffffff	  32TB		kasan shadow region
  ffffa00000000000	ffffa00007ffffff	 128MB		bpf jit region
  ffffa00008000000	ffffa0000fffffff	 128MB		modules
  ffffa00010000000	fffffdffbffeffff	 ~93TB		vmalloc
  fffffdffbfff0000	fffffdfffe5f8fff	~998MB		[guard region]
  fffffdfffe5f9000	fffffdfffe9fffff	4124KB		fixed mappings
  fffffdfffea00000	fffffdfffebfffff	   2MB		[guard region]
  fffffdfffec00000	fffffdffffbfffff	  16MB		PCI I/O space
  fffffdffffc00000	fffffdffffdfffff	   2MB		[guard region]
  fffffdffffe00000	ffffffffffdfffff	   2TB		vmemmap
  ffffffffffe00000	ffffffffffffffff	   2MB		[guard region]


AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support)::

  Start			End			Size		Use
  -----------------------------------------------------------------------
  0000000000000000	000003ffffffffff	   4TB		user
  fffffc0000000000	ffffffffffffffff	   4TB		kernel


AArch64 Linux memory layout with 64KB pages + 3 levels::

  Start			End			Size		Use
  -----------------------------------------------------------------------
  0000000000000000	0000ffffffffffff	 256TB		user
  ffff000000000000	ffffffffffffffff	 256TB		kernel


For details of the virtual kernel memory layout please see the kernel
booting log.
  0000000000000000	000fffffffffffff	   4PB		user
  fff0000000000000	fff7ffffffffffff	   2PB		kernel logical memory map
  fff8000000000000	fffd9fffffffffff	1440TB		[gap]
  fffda00000000000	ffff9fffffffffff	 512TB		kasan shadow region
  ffffa00000000000	ffffa00007ffffff	 128MB		bpf jit region
  ffffa00008000000	ffffa0000fffffff	 128MB		modules
  ffffa00010000000	fffff81ffffeffff	 ~88TB		vmalloc
  fffff81fffff0000	fffffc1ffe58ffff	  ~3TB		[guard region]
  fffffc1ffe590000	fffffc1ffe9fffff	4544KB		fixed mappings
  fffffc1ffea00000	fffffc1ffebfffff	   2MB		[guard region]
  fffffc1ffec00000	fffffc1fffbfffff	  16MB		PCI I/O space
  fffffc1fffc00000	fffffc1fffdfffff	   2MB		[guard region]
  fffffc1fffe00000	ffffffffffdfffff	3968GB		vmemmap
  ffffffffffe00000	ffffffffffffffff	   2MB		[guard region]


Translation table lookup with 4KB pages::
@@ -83,7 +90,8 @@ Translation table lookup with 64KB pages::
   |                 |    |               |            [15:0]  in-page offset
   |                 |    |               +----------> [28:16] L3 index
   |                 |    +--------------------------> [41:29] L2 index
   |                 +-------------------------------> [47:42] L1 index
   |                 +-------------------------------> [47:42] L1 index (48-bit)
   |                                                   [51:42] L1 index (52-bit)
   +-------------------------------------------------> [63] TTBR0/1


@@ -96,3 +104,62 @@ ARM64_HARDEN_EL2_VECTORS is selected for particular CPUs.

When using KVM with the Virtualization Host Extensions, no additional
mappings are created, since the host kernel runs directly in EL2.

52-bit VA support in the kernel
-------------------------------
If the ARMv8.2-LVA optional feature is present, and we are running
with a 64KB page size; then it is possible to use 52-bits of address
space for both userspace and kernel addresses. However, any kernel
binary that supports 52-bit must also be able to fall back to 48-bit
at early boot time if the hardware feature is not present.

This fallback mechanism necessitates the kernel .text to be in the
higher addresses such that they are invariant to 48/52-bit VAs. Due
to the kasan shadow being a fraction of the entire kernel VA space,
the end of the kasan shadow must also be in the higher half of the
kernel VA space for both 48/52-bit. (Switching from 48-bit to 52-bit,
the end of the kasan shadow is invariant and dependent on ~0UL,
whilst the start address will "grow" towards the lower addresses).

In order to optimise phys_to_virt and virt_to_phys, the PAGE_OFFSET
is kept constant at 0xFFF0000000000000 (corresponding to 52-bit),
this obviates the need for an extra variable read. The physvirt
offset and vmemmap offsets are computed at early boot to enable
this logic.

As a single binary will need to support both 48-bit and 52-bit VA
spaces, the VMEMMAP must be sized large enough for 52-bit VAs and
also must be sized large enought to accommodate a fixed PAGE_OFFSET.

Most code in the kernel should not need to consider the VA_BITS, for
code that does need to know the VA size the variables are
defined as follows:

VA_BITS		constant	the *maximum* VA space size

VA_BITS_MIN	constant	the *minimum* VA space size

vabits_actual	variable	the *actual* VA space size


Maximum and minimum sizes can be useful to ensure that buffers are
sized large enough or that addresses are positioned close enough for
the "worst" case.

52-bit userspace VAs
--------------------
To maintain compatibility with software that relies on the ARMv8.0
VA space maximum size of 48-bits, the kernel will, by default,
return virtual addresses to userspace from a 48-bit range.

Software can "opt-in" to receiving VAs from a 52-bit space by
specifying an mmap hint parameter that is larger than 48-bit.
For example:
    maybe_high_address = mmap(~0UL, size, prot, flags,...);

It is also possible to build a debug kernel that returns addresses
from a 52-bit space by enabling the following kernel config options:
   CONFIG_EXPERT=y && CONFIG_ARM64_FORCE_52BIT=y

Note that this option is only intended for debugging applications
and should not be used in production.
+156 −0
Original line number Diff line number Diff line
==========================
AArch64 TAGGED ADDRESS ABI
==========================

Authors: Vincenzo Frascino <vincenzo.frascino@arm.com>
         Catalin Marinas <catalin.marinas@arm.com>

Date: 21 August 2019

This document describes the usage and semantics of the Tagged Address
ABI on AArch64 Linux.

1. Introduction
---------------

On AArch64 the ``TCR_EL1.TBI0`` bit is set by default, allowing
userspace (EL0) to perform memory accesses through 64-bit pointers with
a non-zero top byte. This document describes the relaxation of the
syscall ABI that allows userspace to pass certain tagged pointers to
kernel syscalls.

2. AArch64 Tagged Address ABI
-----------------------------

From the kernel syscall interface perspective and for the purposes of
this document, a "valid tagged pointer" is a pointer with a potentially
non-zero top-byte that references an address in the user process address
space obtained in one of the following ways:

- ``mmap()`` syscall where either:

  - flags have the ``MAP_ANONYMOUS`` bit set or
  - the file descriptor refers to a regular file (including those
    returned by ``memfd_create()``) or ``/dev/zero``

- ``brk()`` syscall (i.e. the heap area between the initial location of
  the program break at process creation and its current location).

- any memory mapped by the kernel in the address space of the process
  during creation and with the same restrictions as for ``mmap()`` above
  (e.g. data, bss, stack).

The AArch64 Tagged Address ABI has two stages of relaxation depending
how the user addresses are used by the kernel:

1. User addresses not accessed by the kernel but used for address space
   management (e.g. ``mmap()``, ``mprotect()``, ``madvise()``). The use
   of valid tagged pointers in this context is always allowed.

2. User addresses accessed by the kernel (e.g. ``write()``). This ABI
   relaxation is disabled by default and the application thread needs to
   explicitly enable it via ``prctl()`` as follows:

   - ``PR_SET_TAGGED_ADDR_CTRL``: enable or disable the AArch64 Tagged
     Address ABI for the calling thread.

     The ``(unsigned int) arg2`` argument is a bit mask describing the
     control mode used:

     - ``PR_TAGGED_ADDR_ENABLE``: enable AArch64 Tagged Address ABI.
       Default status is disabled.

     Arguments ``arg3``, ``arg4``, and ``arg5`` must be 0.

   - ``PR_GET_TAGGED_ADDR_CTRL``: get the status of the AArch64 Tagged
     Address ABI for the calling thread.

     Arguments ``arg2``, ``arg3``, ``arg4``, and ``arg5`` must be 0.

   The ABI properties described above are thread-scoped, inherited on
   clone() and fork() and cleared on exec().

   Calling ``prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE, 0, 0, 0)``
   returns ``-EINVAL`` if the AArch64 Tagged Address ABI is globally
   disabled by ``sysctl abi.tagged_addr_disabled=1``. The default
   ``sysctl abi.tagged_addr_disabled`` configuration is 0.

When the AArch64 Tagged Address ABI is enabled for a thread, the
following behaviours are guaranteed:

- All syscalls except the cases mentioned in section 3 can accept any
  valid tagged pointer.

- The syscall behaviour is undefined for invalid tagged pointers: it may
  result in an error code being returned, a (fatal) signal being raised,
  or other modes of failure.

- The syscall behaviour for a valid tagged pointer is the same as for
  the corresponding untagged pointer.


A definition of the meaning of tagged pointers on AArch64 can be found
in Documentation/arm64/tagged-pointers.rst.

3. AArch64 Tagged Address ABI Exceptions
-----------------------------------------

The following system call parameters must be untagged regardless of the
ABI relaxation:

- ``prctl()`` other than pointers to user data either passed directly or
  indirectly as arguments to be accessed by the kernel.

- ``ioctl()`` other than pointers to user data either passed directly or
  indirectly as arguments to be accessed by the kernel.

- ``shmat()`` and ``shmdt()``.

Any attempt to use non-zero tagged pointers may result in an error code
being returned, a (fatal) signal being raised, or other modes of
failure.

4. Example of correct usage
---------------------------
.. code-block:: c

   #include <stdlib.h>
   #include <string.h>
   #include <unistd.h>
   #include <sys/mman.h>
   #include <sys/prctl.h>
   
   #define PR_SET_TAGGED_ADDR_CTRL	55
   #define PR_TAGGED_ADDR_ENABLE	(1UL << 0)
   
   #define TAG_SHIFT		56
   
   int main(void)
   {
   	int tbi_enabled = 0;
   	unsigned long tag = 0;
   	char *ptr;
   
   	/* check/enable the tagged address ABI */
   	if (!prctl(PR_SET_TAGGED_ADDR_CTRL, PR_TAGGED_ADDR_ENABLE, 0, 0, 0))
   		tbi_enabled = 1;
   
   	/* memory allocation */
   	ptr = mmap(NULL, sysconf(_SC_PAGE_SIZE), PROT_READ | PROT_WRITE,
   		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
   	if (ptr == MAP_FAILED)
   		return 1;
   
   	/* set a non-zero tag if the ABI is available */
   	if (tbi_enabled)
   		tag = rand() & 0xff;
   	ptr = (char *)((unsigned long)ptr | (tag << TAG_SHIFT));
   
   	/* memory access to a tagged address */
   	strcpy(ptr, "tagged pointer\n");
   
   	/* syscall with a tagged pointer */
   	write(1, ptr, strlen(ptr));
   
   	return 0;
   }
Loading