Commit 0ef0fd35 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull KVM updates from Paolo Bonzini:
 "ARM:
   - support for SVE and Pointer Authentication in guests
   - PMU improvements

  POWER:
   - support for direct access to the POWER9 XIVE interrupt controller
   - memory and performance optimizations

  x86:
   - support for accessing memory not backed by struct page
   - fixes and refactoring

  Generic:
   - dirty page tracking improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
  kvm: fix compilation on aarch64
  Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
  kvm: x86: Fix L1TF mitigation for shadow MMU
  KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
  KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
  KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
  KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
  kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
  tests: kvm: Add tests for KVM_SET_NESTED_STATE
  KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
  tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
  tests: kvm: Add tests to .gitignore
  KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
  KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
  KVM: Fix the bitmap range to copy during clear dirty
  KVM: arm64: Fix ptrauth ID register masking logic
  KVM: x86: use direct accessors for RIP and RSP
  KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
  KVM: x86: Omit caching logic for always-available GPRs
  kvm, x86: Properly check whether a pfn is an MMIO or not
  ...
parents 4489da71 c011d23b
Loading
Loading
Loading
Loading
+85 −0
Original line number Diff line number Diff line
Perf Event Attributes
=====================

Author: Andrew Murray <andrew.murray@arm.com>
Date: 2019-03-06

exclude_user
------------

This attribute excludes userspace.

Userspace always runs at EL0 and thus this attribute will exclude EL0.


exclude_kernel
--------------

This attribute excludes the kernel.

The kernel runs at EL2 with VHE and EL1 without. Guest kernels always run
at EL1.

For the host this attribute will exclude EL1 and additionally EL2 on a VHE
system.

For the guest this attribute will exclude EL1. Please note that EL2 is
never counted within a guest.


exclude_hv
----------

This attribute excludes the hypervisor.

For a VHE host this attribute is ignored as we consider the host kernel to
be the hypervisor.

For a non-VHE host this attribute will exclude EL2 as we consider the
hypervisor to be any code that runs at EL2 which is predominantly used for
guest/host transitions.

For the guest this attribute has no effect. Please note that EL2 is
never counted within a guest.


exclude_host / exclude_guest
----------------------------

These attributes exclude the KVM host and guest, respectively.

The KVM host may run at EL0 (userspace), EL1 (non-VHE kernel) and EL2 (VHE
kernel or non-VHE hypervisor).

The KVM guest may run at EL0 (userspace) and EL1 (kernel).

Due to the overlapping exception levels between host and guests we cannot
exclusively rely on the PMU's hardware exception filtering - therefore we
must enable/disable counting on the entry and exit to the guest. This is
performed differently on VHE and non-VHE systems.

For non-VHE systems we exclude EL2 for exclude_host - upon entering and
exiting the guest we disable/enable the event as appropriate based on the
exclude_host and exclude_guest attributes.

For VHE systems we exclude EL1 for exclude_guest and exclude both EL0,EL2
for exclude_host. Upon entering and exiting the guest we modify the event
to include/exclude EL0 as appropriate based on the exclude_host and
exclude_guest attributes.

The statements above also apply when these attributes are used within a
non-VHE guest however please note that EL2 is never counted within a guest.


Accuracy
--------

On non-VHE hosts we enable/disable counters on the entry/exit of host/guest
transition at EL2 - however there is a period of time between
enabling/disabling the counters and entering/exiting the guest. We are
able to eliminate counters counting host events on the boundaries of guest
entry/exit when counting guest events by filtering out EL2 for
exclude_host. However when using !exclude_hv there is a small blackout
window at the guest entry/exit where host events are not captured.

On VHE systems there are no blackout windows.
+18 −4
Original line number Diff line number Diff line
@@ -87,7 +87,21 @@ used to get and set the keys for a thread.
Virtualization
--------------

Pointer authentication is not currently supported in KVM guests. KVM
will mask the feature bits from ID_AA64ISAR1_EL1, and attempted use of
the feature will result in an UNDEFINED exception being injected into
the guest.
Pointer authentication is enabled in KVM guest when each virtual cpu is
initialised by passing flags KVM_ARM_VCPU_PTRAUTH_[ADDRESS/GENERIC] and
requesting these two separate cpu features to be enabled. The current KVM
guest implementation works by enabling both features together, so both
these userspace flags are checked before enabling pointer authentication.
The separate userspace flag will allow to have no userspace ABI changes
if support is added in the future to allow these two features to be
enabled independently of one another.

As Arm Architecture specifies that Pointer Authentication feature is
implemented along with the VHE feature so KVM arm64 ptrauth code relies
on VHE mode to be present.

Additionally, when these vcpu feature flags are not set then KVM will
filter out the Pointer Authentication system key registers from
KVM_GET/SET_REG_* ioctls and mask those features from cpufeature ID
register. Any attempt to use the Pointer Authentication instructions will
result in an UNDEFINED exception being injected into the guest.
+200 −25
Original line number Diff line number Diff line
@@ -69,23 +69,6 @@ by and on behalf of the VM's process may not be freed/unaccounted when
the VM is shut down.


It is important to note that althought VM ioctls may only be issued from
the process that created the VM, a VM's lifecycle is associated with its
file descriptor, not its creator (process).  In other words, the VM and
its resources, *including the associated address space*, are not freed
until the last reference to the VM's file descriptor has been released.
For example, if fork() is issued after ioctl(KVM_CREATE_VM), the VM will
not be freed until both the parent (original) process and its child have
put their references to the VM's file descriptor.

Because a VM's resources are not freed until the last reference to its
file descriptor is released, creating additional references to a VM via
via fork(), dup(), etc... without careful consideration is strongly
discouraged and may have unwanted side effects, e.g. memory allocated
by and on behalf of the VM's process may not be freed/unaccounted when
the VM is shut down.


3. Extensions
-------------

@@ -347,7 +330,7 @@ They must be less than the value that KVM_CHECK_EXTENSION returns for
the KVM_CAP_MULTI_ADDRESS_SPACE capability.

The bits in the dirty bitmap are cleared before the ioctl returns, unless
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is enabled.  For more information,
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is enabled.  For more information,
see the description of the capability.

4.9 KVM_SET_MEMORY_ALIAS
@@ -1117,9 +1100,8 @@ struct kvm_userspace_memory_region {
This ioctl allows the user to create, modify or delete a guest physical
memory slot.  Bits 0-15 of "slot" specify the slot id and this value
should be less than the maximum number of user memory slots supported per
VM.  The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS,
if this capability is supported by the architecture.  Slots may not
overlap in guest physical address space.
VM.  The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS.
Slots may not overlap in guest physical address space.

If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of "slot"
specifies the address space which is being modified.  They must be
@@ -1901,6 +1883,12 @@ Architectures: all
Type: vcpu ioctl
Parameters: struct kvm_one_reg (in)
Returns: 0 on success, negative value on failure
Errors:
  ENOENT:   no such register
  EINVAL:   invalid register ID, or no such register
  EPERM:    (arm64) register access not allowed before vcpu finalization
(These error codes are indicative only: do not rely on a specific error
code being returned in a specific situation.)

struct kvm_one_reg {
       __u64 id;
@@ -1985,6 +1973,7 @@ registers, find a list below:
  PPC   | KVM_REG_PPC_TLB3PS            | 32
  PPC   | KVM_REG_PPC_EPTCFG            | 32
  PPC   | KVM_REG_PPC_ICP_STATE         | 64
  PPC   | KVM_REG_PPC_VP_STATE          | 128
  PPC   | KVM_REG_PPC_TB_OFFSET         | 64
  PPC   | KVM_REG_PPC_SPMC1             | 32
  PPC   | KVM_REG_PPC_SPMC2             | 32
@@ -2137,6 +2126,37 @@ contains elements ranging from 32 to 128 bits. The index is a 32bit
value in the kvm_regs structure seen as a 32bit array.
  0x60x0 0000 0010 <index into the kvm_regs struct:16>

Specifically:
    Encoding            Register  Bits  kvm_regs member
----------------------------------------------------------------
  0x6030 0000 0010 0000 X0          64  regs.regs[0]
  0x6030 0000 0010 0002 X1          64  regs.regs[1]
    ...
  0x6030 0000 0010 003c X30         64  regs.regs[30]
  0x6030 0000 0010 003e SP          64  regs.sp
  0x6030 0000 0010 0040 PC          64  regs.pc
  0x6030 0000 0010 0042 PSTATE      64  regs.pstate
  0x6030 0000 0010 0044 SP_EL1      64  sp_el1
  0x6030 0000 0010 0046 ELR_EL1     64  elr_el1
  0x6030 0000 0010 0048 SPSR_EL1    64  spsr[KVM_SPSR_EL1] (alias SPSR_SVC)
  0x6030 0000 0010 004a SPSR_ABT    64  spsr[KVM_SPSR_ABT]
  0x6030 0000 0010 004c SPSR_UND    64  spsr[KVM_SPSR_UND]
  0x6030 0000 0010 004e SPSR_IRQ    64  spsr[KVM_SPSR_IRQ]
  0x6060 0000 0010 0050 SPSR_FIQ    64  spsr[KVM_SPSR_FIQ]
  0x6040 0000 0010 0054 V0         128  fp_regs.vregs[0]    (*)
  0x6040 0000 0010 0058 V1         128  fp_regs.vregs[1]    (*)
    ...
  0x6040 0000 0010 00d0 V31        128  fp_regs.vregs[31]   (*)
  0x6020 0000 0010 00d4 FPSR        32  fp_regs.fpsr
  0x6020 0000 0010 00d5 FPCR        32  fp_regs.fpcr

(*) These encodings are not accepted for SVE-enabled vcpus.  See
    KVM_ARM_VCPU_INIT.

    The equivalent register content can be accessed via bits [127:0] of
    the corresponding SVE Zn registers instead for vcpus that have SVE
    enabled (see below).

arm64 CCSIDR registers are demultiplexed by CSSELR value:
  0x6020 0000 0011 00 <csselr:8>

@@ -2146,6 +2166,64 @@ arm64 system registers have the following id bit patterns:
arm64 firmware pseudo-registers have the following bit pattern:
  0x6030 0000 0014 <regno:16>

arm64 SVE registers have the following bit patterns:
  0x6080 0000 0015 00 <n:5> <slice:5>   Zn bits[2048*slice + 2047 : 2048*slice]
  0x6050 0000 0015 04 <n:4> <slice:5>   Pn bits[256*slice + 255 : 256*slice]
  0x6050 0000 0015 060 <slice:5>        FFR bits[256*slice + 255 : 256*slice]
  0x6060 0000 0015 ffff                 KVM_REG_ARM64_SVE_VLS pseudo-register

Access to register IDs where 2048 * slice >= 128 * max_vq will fail with
ENOENT.  max_vq is the vcpu's maximum supported vector length in 128-bit
quadwords: see (**) below.

These registers are only accessible on vcpus for which SVE is enabled.
See KVM_ARM_VCPU_INIT for details.

In addition, except for KVM_REG_ARM64_SVE_VLS, these registers are not
accessible until the vcpu's SVE configuration has been finalized
using KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE).  See KVM_ARM_VCPU_INIT
and KVM_ARM_VCPU_FINALIZE for more information about this procedure.

KVM_REG_ARM64_SVE_VLS is a pseudo-register that allows the set of vector
lengths supported by the vcpu to be discovered and configured by
userspace.  When transferred to or from user memory via KVM_GET_ONE_REG
or KVM_SET_ONE_REG, the value of this register is of type
__u64[KVM_ARM64_SVE_VLS_WORDS], and encodes the set of vector lengths as
follows:

__u64 vector_lengths[KVM_ARM64_SVE_VLS_WORDS];

if (vq >= SVE_VQ_MIN && vq <= SVE_VQ_MAX &&
    ((vector_lengths[(vq - KVM_ARM64_SVE_VQ_MIN) / 64] >>
		((vq - KVM_ARM64_SVE_VQ_MIN) % 64)) & 1))
	/* Vector length vq * 16 bytes supported */
else
	/* Vector length vq * 16 bytes not supported */

(**) The maximum value vq for which the above condition is true is
max_vq.  This is the maximum vector length available to the guest on
this vcpu, and determines which register slices are visible through
this ioctl interface.

(See Documentation/arm64/sve.txt for an explanation of the "vq"
nomenclature.)

KVM_REG_ARM64_SVE_VLS is only accessible after KVM_ARM_VCPU_INIT.
KVM_ARM_VCPU_INIT initialises it to the best set of vector lengths that
the host supports.

Userspace may subsequently modify it if desired until the vcpu's SVE
configuration is finalized using KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE).

Apart from simply removing all vector lengths from the host set that
exceed some value, support for arbitrarily chosen sets of vector lengths
is hardware-dependent and may not be available.  Attempting to configure
an invalid set of vector lengths via KVM_SET_ONE_REG will fail with
EINVAL.

After the vcpu's SVE configuration is finalized, further attempts to
write this register will fail with EPERM.


MIPS registers are mapped using the lower 32 bits.  The upper 16 of that is
the register group type:
@@ -2198,6 +2276,12 @@ Architectures: all
Type: vcpu ioctl
Parameters: struct kvm_one_reg (in and out)
Returns: 0 on success, negative value on failure
Errors include:
  ENOENT:   no such register
  EINVAL:   invalid register ID, or no such register
  EPERM:    (arm64) register access not allowed before vcpu finalization
(These error codes are indicative only: do not rely on a specific error
code being returned in a specific situation.)

This ioctl allows to receive the value of a single register implemented
in a vcpu. The register to read is indicated by the "id" field of the
@@ -2690,6 +2774,49 @@ Possible features:
	- KVM_ARM_VCPU_PMU_V3: Emulate PMUv3 for the CPU.
	  Depends on KVM_CAP_ARM_PMU_V3.

	- KVM_ARM_VCPU_PTRAUTH_ADDRESS: Enables Address Pointer authentication
	  for arm64 only.
	  Depends on KVM_CAP_ARM_PTRAUTH_ADDRESS.
	  If KVM_CAP_ARM_PTRAUTH_ADDRESS and KVM_CAP_ARM_PTRAUTH_GENERIC are
	  both present, then both KVM_ARM_VCPU_PTRAUTH_ADDRESS and
	  KVM_ARM_VCPU_PTRAUTH_GENERIC must be requested or neither must be
	  requested.

	- KVM_ARM_VCPU_PTRAUTH_GENERIC: Enables Generic Pointer authentication
	  for arm64 only.
	  Depends on KVM_CAP_ARM_PTRAUTH_GENERIC.
	  If KVM_CAP_ARM_PTRAUTH_ADDRESS and KVM_CAP_ARM_PTRAUTH_GENERIC are
	  both present, then both KVM_ARM_VCPU_PTRAUTH_ADDRESS and
	  KVM_ARM_VCPU_PTRAUTH_GENERIC must be requested or neither must be
	  requested.

	- KVM_ARM_VCPU_SVE: Enables SVE for the CPU (arm64 only).
	  Depends on KVM_CAP_ARM_SVE.
	  Requires KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE):

	   * After KVM_ARM_VCPU_INIT:

	      - KVM_REG_ARM64_SVE_VLS may be read using KVM_GET_ONE_REG: the
	        initial value of this pseudo-register indicates the best set of
	        vector lengths possible for a vcpu on this host.

	   * Before KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE):

	      - KVM_RUN and KVM_GET_REG_LIST are not available;

	      - KVM_GET_ONE_REG and KVM_SET_ONE_REG cannot be used to access
	        the scalable archietctural SVE registers
	        KVM_REG_ARM64_SVE_ZREG(), KVM_REG_ARM64_SVE_PREG() or
	        KVM_REG_ARM64_SVE_FFR;

	      - KVM_REG_ARM64_SVE_VLS may optionally be written using
	        KVM_SET_ONE_REG, to modify the set of vector lengths available
	        for the vcpu.

	   * After KVM_ARM_VCPU_FINALIZE(KVM_ARM_VCPU_SVE):

	      - the KVM_REG_ARM64_SVE_VLS pseudo-register is immutable, and can
	        no longer be written using KVM_SET_ONE_REG.

4.83 KVM_ARM_PREFERRED_TARGET

@@ -3809,7 +3936,7 @@ to I/O ports.

4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl)

Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT
Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
Architectures: x86, arm, arm64, mips
Type: vm ioctl
Parameters: struct kvm_dirty_log (in)
@@ -3842,10 +3969,10 @@ the address space for which you want to return the dirty bitmap.
They must be less than the value that KVM_CHECK_EXTENSION returns for
the KVM_CAP_MULTI_ADDRESS_SPACE capability.

This ioctl is mostly useful when KVM_CAP_MANUAL_DIRTY_LOG_PROTECT
This ioctl is mostly useful when KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
is enabled; for more information, see the description of the capability.
However, it can always be used as long as KVM_CHECK_EXTENSION confirms
that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT is present.
that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is present.

4.118 KVM_GET_SUPPORTED_HV_CPUID

@@ -3904,6 +4031,40 @@ number of valid entries in the 'entries' array, which is then filled.
'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently reserved,
userspace should not expect to get any particular value there.

4.119 KVM_ARM_VCPU_FINALIZE

Architectures: arm, arm64
Type: vcpu ioctl
Parameters: int feature (in)
Returns: 0 on success, -1 on error
Errors:
  EPERM:     feature not enabled, needs configuration, or already finalized
  EINVAL:    feature unknown or not present

Recognised values for feature:
  arm64      KVM_ARM_VCPU_SVE (requires KVM_CAP_ARM_SVE)

Finalizes the configuration of the specified vcpu feature.

The vcpu must already have been initialised, enabling the affected feature, by
means of a successful KVM_ARM_VCPU_INIT call with the appropriate flag set in
features[].

For affected vcpu features, this is a mandatory step that must be performed
before the vcpu is fully usable.

Between KVM_ARM_VCPU_INIT and KVM_ARM_VCPU_FINALIZE, the feature may be
configured by use of ioctls such as KVM_SET_ONE_REG.  The exact configuration
that should be performaned and how to do it are feature-dependent.

Other calls that depend on a particular feature being finalized, such as
KVM_RUN, KVM_GET_REG_LIST, KVM_GET_ONE_REG and KVM_SET_ONE_REG, will fail with
-EPERM unless the feature has already been finalized by means of a
KVM_ARM_VCPU_FINALIZE call.

See KVM_ARM_VCPU_INIT for details of vcpu features that require finalization
using this ioctl.

5. The kvm_run structure
------------------------

@@ -4505,6 +4666,15 @@ struct kvm_sync_regs {
        struct kvm_vcpu_events events;
};

6.75 KVM_CAP_PPC_IRQ_XIVE

Architectures: ppc
Target: vcpu
Parameters: args[0] is the XIVE device fd
            args[1] is the XIVE CPU number (server ID) for this vcpu

This capability connects the vcpu to an in-kernel XIVE device.

7. Capabilities that can be enabled on VMs
------------------------------------------

@@ -4798,7 +4968,7 @@ and injected exceptions.
* For the new DR6 bits, note that bit 16 is set iff the #DB exception
  will clear DR6.RTM.

7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT
7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2

Architectures: x86, arm, arm64, mips
Parameters: args[0] whether feature should be enabled or not
@@ -4821,6 +4991,11 @@ while userspace can see false reports of dirty pages. Manual reprotection
helps reducing this time, improving guest performance and reducing the
number of dirty log false positives.

KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make
it hard or impossible to use it correctly.  The availability of
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed.
Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.

8. Other capabilities.
----------------------
+2 −1
Original line number Diff line number Diff line
@@ -141,7 +141,8 @@ struct kvm_s390_vm_cpu_subfunc {
       u8 pcc[16];           # valid with Message-Security-Assist-Extension 4
       u8 ppno[16];          # valid with Message-Security-Assist-Extension 5
       u8 kma[16];           # valid with Message-Security-Assist-Extension 8
       u8 reserved[1808];    # reserved for future instructions
       u8 kdsa[16];          # valid with Message-Security-Assist-Extension 9
       u8 reserved[1792];    # reserved for future instructions
};

Parameters: address of a buffer to load the subfunction blocks from.
+197 −0
Original line number Diff line number Diff line
POWER9 eXternal Interrupt Virtualization Engine (XIVE Gen1)
==========================================================

Device types supported:
  KVM_DEV_TYPE_XIVE     POWER9 XIVE Interrupt Controller generation 1

This device acts as a VM interrupt controller. It provides the KVM
interface to configure the interrupt sources of a VM in the underlying
POWER9 XIVE interrupt controller.

Only one XIVE instance may be instantiated. A guest XIVE device
requires a POWER9 host and the guest OS should have support for the
XIVE native exploitation interrupt mode. If not, it should run using
the legacy interrupt mode, referred as XICS (POWER7/8).

* Device Mappings

  The KVM device exposes different MMIO ranges of the XIVE HW which
  are required for interrupt management. These are exposed to the
  guest in VMAs populated with a custom VM fault handler.

  1. Thread Interrupt Management Area (TIMA)

  Each thread has an associated Thread Interrupt Management context
  composed of a set of registers. These registers let the thread
  handle priority management and interrupt acknowledgment. The most
  important are :

      - Interrupt Pending Buffer     (IPB)
      - Current Processor Priority   (CPPR)
      - Notification Source Register (NSR)

  They are exposed to software in four different pages each proposing
  a view with a different privilege. The first page is for the
  physical thread context and the second for the hypervisor. Only the
  third (operating system) and the fourth (user level) are exposed the
  guest.

  2. Event State Buffer (ESB)

  Each source is associated with an Event State Buffer (ESB) with
  either a pair of even/odd pair of pages which provides commands to
  manage the source: to trigger, to EOI, to turn off the source for
  instance.

  3. Device pass-through

  When a device is passed-through into the guest, the source
  interrupts are from a different HW controller (PHB4) and the ESB
  pages exposed to the guest should accommadate this change.

  The passthru_irq helpers, kvmppc_xive_set_mapped() and
  kvmppc_xive_clr_mapped() are called when the device HW irqs are
  mapped into or unmapped from the guest IRQ number space. The KVM
  device extends these helpers to clear the ESB pages of the guest IRQ
  number being mapped and then lets the VM fault handler repopulate.
  The handler will insert the ESB page corresponding to the HW
  interrupt of the device being passed-through or the initial IPI ESB
  page if the device has being removed.

  The ESB remapping is fully transparent to the guest and the OS
  device driver. All handling is done within VFIO and the above
  helpers in KVM-PPC.

* Groups:

  1. KVM_DEV_XIVE_GRP_CTRL
  Provides global controls on the device
  Attributes:
    1.1 KVM_DEV_XIVE_RESET (write only)
    Resets the interrupt controller configuration for sources and event
    queues. To be used by kexec and kdump.
    Errors: none

    1.2 KVM_DEV_XIVE_EQ_SYNC (write only)
    Sync all the sources and queues and mark the EQ pages dirty. This
    to make sure that a consistent memory state is captured when
    migrating the VM.
    Errors: none

  2. KVM_DEV_XIVE_GRP_SOURCE (write only)
  Initializes a new source in the XIVE device and mask it.
  Attributes:
    Interrupt source number  (64-bit)
  The kvm_device_attr.addr points to a __u64 value:
  bits:     | 63   ....  2 |   1   |   0
  values:   |    unused    | level | type
  - type:  0:MSI 1:LSI
  - level: assertion level in case of an LSI.
  Errors:
    -E2BIG:  Interrupt source number is out of range
    -ENOMEM: Could not create a new source block
    -EFAULT: Invalid user pointer for attr->addr.
    -ENXIO:  Could not allocate underlying HW interrupt

  3. KVM_DEV_XIVE_GRP_SOURCE_CONFIG (write only)
  Configures source targeting
  Attributes:
    Interrupt source number  (64-bit)
  The kvm_device_attr.addr points to a __u64 value:
  bits:     | 63   ....  33 |  32  | 31 .. 3 |  2 .. 0
  values:   |    eisn       | mask |  server | priority
  - priority: 0-7 interrupt priority level
  - server: CPU number chosen to handle the interrupt
  - mask: mask flag (unused)
  - eisn: Effective Interrupt Source Number
  Errors:
    -ENOENT: Unknown source number
    -EINVAL: Not initialized source number
    -EINVAL: Invalid priority
    -EINVAL: Invalid CPU number.
    -EFAULT: Invalid user pointer for attr->addr.
    -ENXIO:  CPU event queues not configured or configuration of the
             underlying HW interrupt failed
    -EBUSY:  No CPU available to serve interrupt

  4. KVM_DEV_XIVE_GRP_EQ_CONFIG (read-write)
  Configures an event queue of a CPU
  Attributes:
    EQ descriptor identifier (64-bit)
  The EQ descriptor identifier is a tuple (server, priority) :
  bits:     | 63   ....  32 | 31 .. 3 |  2 .. 0
  values:   |    unused     |  server | priority
  The kvm_device_attr.addr points to :
    struct kvm_ppc_xive_eq {
	__u32 flags;
	__u32 qshift;
	__u64 qaddr;
	__u32 qtoggle;
	__u32 qindex;
	__u8  pad[40];
    };
  - flags: queue flags
    KVM_XIVE_EQ_ALWAYS_NOTIFY (required)
	forces notification without using the coalescing mechanism
	provided by the XIVE END ESBs.
  - qshift: queue size (power of 2)
  - qaddr: real address of queue
  - qtoggle: current queue toggle bit
  - qindex: current queue index
  - pad: reserved for future use
  Errors:
    -ENOENT: Invalid CPU number
    -EINVAL: Invalid priority
    -EINVAL: Invalid flags
    -EINVAL: Invalid queue size
    -EINVAL: Invalid queue address
    -EFAULT: Invalid user pointer for attr->addr.
    -EIO:    Configuration of the underlying HW failed

  5. KVM_DEV_XIVE_GRP_SOURCE_SYNC (write only)
  Synchronize the source to flush event notifications
  Attributes:
    Interrupt source number  (64-bit)
  Errors:
    -ENOENT: Unknown source number
    -EINVAL: Not initialized source number

* VCPU state

  The XIVE IC maintains VP interrupt state in an internal structure
  called the NVT. When a VP is not dispatched on a HW processor
  thread, this structure can be updated by HW if the VP is the target
  of an event notification.

  It is important for migration to capture the cached IPB from the NVT
  as it synthesizes the priorities of the pending interrupts. We
  capture a bit more to report debug information.

  KVM_REG_PPC_VP_STATE (2 * 64bits)
  bits:     |  63  ....  32  |  31  ....  0  |
  values:   |   TIMA word0   |   TIMA word1  |
  bits:     | 127       ..........       64  |
  values:   |            unused              |

* Migration:

  Saving the state of a VM using the XIVE native exploitation mode
  should follow a specific sequence. When the VM is stopped :

  1. Mask all sources (PQ=01) to stop the flow of events.

  2. Sync the XIVE device with the KVM control KVM_DEV_XIVE_EQ_SYNC to
  flush any in-flight event notification and to stabilize the EQs. At
  this stage, the EQ pages are marked dirty to make sure they are
  transferred in the migration sequence.

  3. Capture the state of the source targeting, the EQs configuration
  and the state of thread interrupt context registers.

  Restore is similar :

  1. Restore the EQ configuration. As targeting depends on it.
  2. Restore targeting
  3. Restore the thread interrupt contexts
  4. Restore the source states
  5. Let the vCPU run
Loading