Commit 6a447b0e authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull KVM updates from Paolo Bonzini:
 "Much x86 work was pushed out to 5.12, but ARM more than made up for it.

  ARM:
   - PSCI relay at EL2 when "protected KVM" is enabled
   - New exception injection code
   - Simplification of AArch32 system register handling
   - Fix PMU accesses when no PMU is enabled
   - Expose CSV3 on non-Meltdown hosts
   - Cache hierarchy discovery fixes
   - PV steal-time cleanups
   - Allow function pointers at EL2
   - Various host EL2 entry cleanups
   - Simplification of the EL2 vector allocation

  s390:
   - memcg accouting for s390 specific parts of kvm and gmap
   - selftest for diag318
   - new kvm_stat for when async_pf falls back to sync

  x86:
   - Tracepoints for the new pagetable code from 5.10
   - Catch VFIO and KVM irqfd events before userspace
   - Reporting dirty pages to userspace with a ring buffer
   - SEV-ES host support
   - Nested VMX support for wait-for-SIPI activity state
   - New feature flag (AVX512 FP16)
   - New system ioctl to report Hyper-V-compatible paravirtualization features

  Generic:
   - Selftest improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
  KVM: SVM: fix 32-bit compilation
  KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting
  KVM: SVM: Provide support to launch and run an SEV-ES guest
  KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
  KVM: SVM: Provide support for SEV-ES vCPU loading
  KVM: SVM: Provide support for SEV-ES vCPU creation/loading
  KVM: SVM: Update ASID allocation to support SEV-ES guests
  KVM: SVM: Set the encryption mask for the SVM host save area
  KVM: SVM: Add NMI support for an SEV-ES guest
  KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
  KVM: SVM: Do not report support for SMM for an SEV-ES guest
  KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
  KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
  KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
  KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
  KVM: SVM: Add support for EFER write traps for an SEV-ES guest
  KVM: SVM: Support string IO operations for an SEV-ES guest
  KVM: SVM: Support MMIO for an SEV-ES guest
  KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
  KVM: SVM: Create trace events for VMGEXIT processing
  ...
parents f4a2f786 d45f89f7
Loading
Loading
Loading
Loading
+10 −0
Original line number Diff line number Diff line
@@ -2254,6 +2254,16 @@
			for all guests.
			Default is 1 (enabled) if in 64-bit or 32-bit PAE mode.

	kvm-arm.mode=
			[KVM,ARM] Select one of KVM/arm64's modes of operation.

			protected: nVHE-based mode with support for guests whose
				   state is kept private from the host.
				   Not valid if the kernel is running in EL2.

			Defaults to VHE/nVHE based on hardware support and
			the value of CONFIG_ARM64_VHE.

	kvm-arm.vgic_v3_group0_trap=
			[KVM,ARM] Trap guest accesses to GICv3 group-0
			system registers
+1 −1
Original line number Diff line number Diff line
@@ -97,7 +97,7 @@ hypervisor maps kernel pages in EL2 at a fixed (and potentially
random) offset from the linear mapping. See the kern_hyp_va macro and
kvm_update_va_mask function for more details. MMIO devices such as
GICv2 gets mapped next to the HYP idmap page, as do vectors when
ARM64_HARDEN_EL2_VECTORS is selected for particular CPUs.
ARM64_SPECTRE_V3A is enabled for particular CPUs.

When using KVM with the Virtualization Host Extensions, no additional
mappings are created, since the host kernel runs directly in EL2.
+111 −5
Original line number Diff line number Diff line
@@ -262,6 +262,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
memory region.  This ioctl returns the size of that region.  See the
KVM_RUN documentation for details.

Besides the size of the KVM_RUN communication region, other areas of
the VCPU file descriptor can be mmap-ed, including:

- if KVM_CAP_COALESCED_MMIO is available, a page at
  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
  KVM_CAP_COALESCED_MMIO is not documented yet.

- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
  KVM_CAP_DIRTY_LOG_RING, see section 8.3.


4.6 KVM_SET_MEMORY_REGION
-------------------------
@@ -4455,9 +4467,9 @@ that KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 is present.
4.118 KVM_GET_SUPPORTED_HV_CPUID
--------------------------------

:Capability: KVM_CAP_HYPERV_CPUID
:Capability: KVM_CAP_HYPERV_CPUID (vcpu), KVM_CAP_SYS_HYPERV_CPUID (system)
:Architectures: x86
:Type: vcpu ioctl
:Type: system ioctl, vcpu ioctl
:Parameters: struct kvm_cpuid2 (in/out)
:Returns: 0 on success, -1 on error

@@ -4502,9 +4514,6 @@ Currently, the following list of CPUID leaves are returned:
 - HYPERV_CPUID_SYNDBG_INTERFACE
 - HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES

HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS was
enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).

Userspace invokes KVM_GET_SUPPORTED_HV_CPUID by passing a kvm_cpuid2 structure
with the 'nent' field indicating the number of entries in the variable-size
array 'entries'.  If the number of entries is too low to describe all Hyper-V
@@ -4515,6 +4524,15 @@ number of valid entries in the 'entries' array, which is then filled.
'index' and 'flags' fields in 'struct kvm_cpuid_entry2' are currently reserved,
userspace should not expect to get any particular value there.

Note, vcpu version of KVM_GET_SUPPORTED_HV_CPUID is currently deprecated. Unlike
system ioctl which exposes all supported feature bits unconditionally, vcpu
version has the following quirks:
- HYPERV_CPUID_NESTED_FEATURES leaf and HV_X64_ENLIGHTENED_VMCS_RECOMMENDED
  feature bit are only exposed when Enlightened VMCS was previously enabled
  on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).
- HV_STIMER_DIRECT_MODE_AVAILABLE bit is only exposed with in-kernel LAPIC.
  (presumes KVM_CREATE_IRQCHIP has already been called).

4.119 KVM_ARM_VCPU_FINALIZE
---------------------------

@@ -6390,3 +6408,91 @@ When enabled, KVM will disable paravirtual features provided to the
guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
(0x40000001). Otherwise, a guest may use the paravirtual features
regardless of what has actually been exposed through the CPUID leaf.


8.29 KVM_CAP_DIRTY_LOG_RING
---------------------------

:Architectures: x86
:Parameters: args[0] - size of the dirty log ring

KVM is capable of tracking dirty memory using ring buffers that are
mmaped into userspace; there is one dirty ring per vcpu.

The dirty ring is available to userspace as an array of
``struct kvm_dirty_gfn``.  Each dirty entry it's defined as::

  struct kvm_dirty_gfn {
          __u32 flags;
          __u32 slot; /* as_id | slot_id */
          __u64 offset;
  };

The following values are defined for the flags field to define the
current state of the entry::

  #define KVM_DIRTY_GFN_F_DIRTY           BIT(0)
  #define KVM_DIRTY_GFN_F_RESET           BIT(1)
  #define KVM_DIRTY_GFN_F_MASK            0x3

Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM
ioctl to enable this capability for the new guest and set the size of
the rings.  Enabling the capability is only allowed before creating any
vCPU, and the size of the ring must be a power of two.  The larger the
ring buffer, the less likely the ring is full and the VM is forced to
exit to userspace. The optimal size depends on the workload, but it is
recommended that it be at least 64 KiB (4096 entries).

Just like for dirty page bitmaps, the buffer tracks writes to
all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
with the flag set, userspace can start harvesting dirty pages from the
ring buffer.

An entry in the ring buffer can be unused (flag bits ``00``),
dirty (flag bits ``01``) or harvested (flag bits ``1X``).  The
state machine for the entry is as follows::

          dirtied         harvested        reset
     00 -----------> 01 -------------> 1X -------+
      ^                                          |
      |                                          |
      +------------------------------------------+

To harvest the dirty pages, userspace accesses the mmaped ring buffer
to read the dirty GFNs.  If the flags has the DIRTY bit set (at this stage
the RESET bit must be cleared), then it means this GFN is a dirty GFN.
The userspace should harvest this GFN and mark the flags from state
``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set
to show that this GFN is harvested and waiting for a reset), and move
on to the next GFN.  The userspace should continue to do this until the
flags of a GFN have the DIRTY bit cleared, meaning that it has harvested
all the dirty GFNs that were available.

It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.

After processing one or more entries in the ring buffer, userspace
calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about
it, so that the kernel will reprotect those collected GFNs.
Therefore, the ioctl must be called *before* reading the content of
the dirty pages.

The dirty ring can get full.  When it happens, the KVM_RUN of the
vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL.

The dirty ring interface has a major difference comparing to the
KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from
userspace, it's still possible that the kernel has not yet flushed the
processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the
flushing is done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one
needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
vmexit ensures that all dirty GFNs are flushed to the dirty rings.

NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG.  After enabling
KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
machine will switch to ring-buffer dirty page tracking and further
KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.
+2 −2
Original line number Diff line number Diff line
@@ -19,8 +19,8 @@ Two new SMCCC compatible hypercalls are defined:

These are only available in the SMC64/HVC64 calling convention as
paravirtualized time is not available to 32 bit Arm guests. The existence of
the PV_FEATURES hypercall should be probed using the SMCCC 1.1 ARCH_FEATURES
mechanism before calling it.
the PV_TIME_FEATURES hypercall should be probed using the SMCCC 1.1
ARCH_FEATURES mechanism before calling it.

PV_TIME_FEATURES
    ============= ========    ==========
+3 −2
Original line number Diff line number Diff line
@@ -19,7 +19,7 @@
#define ARM64_HAS_VIRT_HOST_EXTN		11
#define ARM64_WORKAROUND_CAVIUM_27456		12
#define ARM64_HAS_32BIT_EL0			13
#define ARM64_HARDEN_EL2_VECTORS		14
#define ARM64_SPECTRE_V3A			14
#define ARM64_HAS_CNP				15
#define ARM64_HAS_NO_FPSIMD			16
#define ARM64_WORKAROUND_REPEAT_TLBI		17
@@ -65,7 +65,8 @@
#define ARM64_MTE				57
#define ARM64_WORKAROUND_1508412		58
#define ARM64_HAS_LDAPR				59
#define ARM64_KVM_PROTECTED_MODE		60

#define ARM64_NCAPS				60
#define ARM64_NCAPS				61

#endif /* __ASM_CPUCAPS_H */
Loading