Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm (f9a705ad) · Commits · 戴 / test

Documentation/virt/kvm/api.rst

+207 −9

Original line number	Diff line number	Diff line
		@@ -4498,11 +4498,14 @@ Currently, the following list of CPUID leaves are returned:
		- HYPERV_CPUID_ENLIGHTMENT_INFO
		- HYPERV_CPUID_IMPLEMENT_LIMITS
		- HYPERV_CPUID_NESTED_FEATURES
		- HYPERV_CPUID_SYNDBG_VENDOR_AND_MAX_FUNCTIONS
		- HYPERV_CPUID_SYNDBG_INTERFACE
		- HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES

		HYPERV_CPUID_NESTED_FEATURES leaf is only exposed when Enlightened VMCS was
		enabled on the corresponding vCPU (KVM_CAP_HYPERV_ENLIGHTENED_VMCS).

		Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
		Userspace invokes KVM_GET_SUPPORTED_HV_CPUID by passing a kvm_cpuid2 structure
		with the 'nent' field indicating the number of entries in the variable-size
		array 'entries'. If the number of entries is too low to describe all Hyper-V
		feature leaves, an error (E2BIG) is returned. If the number is more or equal
		@@ -4704,6 +4707,106 @@ KVM_PV_VM_VERIFY
		Verify the integrity of the unpacked image. Only if this succeeds,
		KVM is allowed to start protected VCPUs.

		4.126 KVM_X86_SET_MSR_FILTER
		----------------------------

		:Capability: KVM_X86_SET_MSR_FILTER
		:Architectures: x86
		:Type: vm ioctl
		:Parameters: struct kvm_msr_filter
		:Returns: 0 on success, < 0 on error

		::

		struct kvm_msr_filter_range {
		#define KVM_MSR_FILTER_READ (1 << 0)
		#define KVM_MSR_FILTER_WRITE (1 << 1)
		__u32 flags;
		__u32 nmsrs; /* number of msrs in bitmap */
		__u32 base; /* MSR index the bitmap starts at */
		__u8 bitmap; / a 1 bit allows the operations in flags, 0 denies */
		};

		#define KVM_MSR_FILTER_MAX_RANGES 16
		struct kvm_msr_filter {
		#define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0)
		#define KVM_MSR_FILTER_DEFAULT_DENY (1 << 0)
		__u32 flags;
		struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES];
		};

		flags values for ``struct kvm_msr_filter_range``:

		``KVM_MSR_FILTER_READ``

		Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
		indicates that a read should immediately fail, while a 1 indicates that
		a read for a particular MSR should be handled regardless of the default
		filter action.

		``KVM_MSR_FILTER_WRITE``

		Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
		indicates that a write should immediately fail, while a 1 indicates that
		a write for a particular MSR should be handled regardless of the default
		filter action.

		``KVM_MSR_FILTER_READ \| KVM_MSR_FILTER_WRITE``

		Filter both read and write accesses to MSRs using the given bitmap. A 0
		in the bitmap indicates that both reads and writes should immediately fail,
		while a 1 indicates that reads and writes for a particular MSR are not
		filtered by this range.

		flags values for ``struct kvm_msr_filter``:

		``KVM_MSR_FILTER_DEFAULT_ALLOW``

		If no filter range matches an MSR index that is getting accessed, KVM will
		fall back to allowing access to the MSR.

		``KVM_MSR_FILTER_DEFAULT_DENY``

		If no filter range matches an MSR index that is getting accessed, KVM will
		fall back to rejecting access to the MSR. In this mode, all MSRs that should
		be processed by KVM need to explicitly be marked as allowed in the bitmaps.

		This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
		specify whether a certain MSR access should be explicitly filtered for or not.

		If this ioctl has never been invoked, MSR accesses are not guarded and the
		default KVM in-kernel emulation behavior is fully preserved.

		Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
		filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
		an error.

		As soon as the filtering is in place, every MSR access is processed through
		the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff);
		x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
		and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
		register.

		If a bit is within one of the defined ranges, read and write accesses are
		guarded by the bitmap's value for the MSR index if the kind of access
		is included in the ``struct kvm_msr_filter_range`` flags. If no range
		cover this particular access, the behavior is determined by the flags
		field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW``
		and ``KVM_MSR_FILTER_DEFAULT_DENY``.

		Each bitmap range specifies a range of MSRs to potentially allow access on.
		The range goes from MSR index [base .. base+nmsrs]. The flags field
		indicates whether reads, writes or both reads and writes are filtered
		by setting a 1 bit in the bitmap for the corresponding MSR index.

		If an MSR access is not permitted through the filtering, it generates a
		#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
		allows user space to deflect and potentially handle various MSR accesses
		into user space.

		If a vCPU is in running state while this ioctl is invoked, the vCPU may
		experience inconsistent filtering behavior on MSR accesses.


		5. The kvm_run structure
		========================
		@@ -4869,9 +4972,8 @@ to the byte array.

		.. note::

		For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
		KVM_EXIT_EPR the corresponding

		For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR,
		KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding
		operations are complete (and guest state is consistent) only after userspace
		has re-entered the kernel with KVM_RUN. The kernel side will first finish
		incomplete operations and then check for pending signals. Userspace
		@@ -5163,6 +5265,44 @@ Note that KVM does not skip the faulting instruction as it does for
		KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
		if it decides to decode and emulate the instruction.

		::

		/* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
		struct {
		__u8 error; /* user -> kernel */
		__u8 pad[7];
		__u32 reason; /* kernel -> user */
		__u32 index; /* kernel -> user */
		__u64 data; /* kernel <-> user */
		} msr;

		Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
		enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
		will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
		exit for writes.

		The "reason" field specifies why the MSR trap occurred. User space will only
		receive MSR exit traps when a particular reason was requested during through
		ENABLE_CAP. Currently valid exit reasons are:

		KVM_MSR_EXIT_REASON_UNKNOWN - access to MSR that is unknown to KVM
		KVM_MSR_EXIT_REASON_INVAL - access to invalid MSRs or reserved bits
		KVM_MSR_EXIT_REASON_FILTER - access blocked by KVM_X86_SET_MSR_FILTER

		For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
		wants to read. To respond to this request with a successful read, user space
		writes the respective data into the "data" field and must continue guest
		execution to ensure the read data is transferred into guest register state.

		If the RDMSR request was unsuccessful, user space indicates that with a "1" in
		the "error" field. This will inject a #GP into the guest when the VCPU is
		executed again.

		For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
		wants to write. Once finished processing the event, user space must continue
		vCPU execution. If the MSR write was unsuccessful, user space also sets the
		"error" field to "1".

		::

		/* Fix the size of the union. */
		@@ -5852,6 +5992,28 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows
		the maximum halt time to specified on a per-VM basis, effectively overriding
		the module parameter for the target VM.

		7.21 KVM_CAP_X86_USER_SPACE_MSR
		-------------------------------

		:Architectures: x86
		:Target: VM
		:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report
		:Returns: 0 on success; -1 on error

		This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
		into user space.

		When a guest requests to read or write an MSR, KVM may not implement all MSRs
		that are relevant to a respective system. It also does not differentiate by
		CPU type.

		To allow more fine grained control over MSR handling, user space may enable
		this capability. With it enabled, MSR accesses that match the mask specified in
		args[0] and trigger a #GP event inside the guest by KVM will instead trigger
		KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space
		can then handle to implement model specific MSR handling and/or user notifications
		to inform a user that an MSR was not handled.

		8. Other capabilities.
		======================

		@@ -6193,3 +6355,39 @@ distribution...)

		If this capability is available, then the CPNC and CPVC can be synchronized
		between KVM and userspace via the sync regs mechanism (KVM_SYNC_DIAG318).

		8.26 KVM_CAP_X86_USER_SPACE_MSR
		-------------------------------

		:Architectures: x86

		This capability indicates that KVM supports deflection of MSR reads and
		writes to user space. It can be enabled on a VM level. If enabled, MSR
		accesses that would usually trigger a #GP by KVM into the guest will
		instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
		KVM_EXIT_X86_WRMSR exit notifications.

		8.25 KVM_X86_SET_MSR_FILTER
		---------------------------

		:Architectures: x86

		This capability indicates that KVM supports that accesses to user defined MSRs
		may be rejected. With this capability exposed, KVM exports new VM ioctl
		KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR
		ranges that KVM should reject access to.

		In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
		trap and emulate MSRs that are outside of the scope of KVM as well as
		limit the attack surface on KVM's MSR emulation code.


		8.26 KVM_CAP_ENFORCE_PV_CPUID
		-----------------------------

		Architectures: x86

		When enabled, KVM will disable paravirtual features provided to the
		guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
		(0x40000001). Otherwise, a guest may use the paravirtual features
		regardless of what has actually been exposed through the CPUID leaf.

Documentation/virt/kvm/cpuid.rst

+44 −44

Original line number	Diff line number	Diff line
		@@ -38,9 +38,9 @@ returns::

		where ``flag`` is defined as below:

		================================= =========== ================================
		================================== =========== ================================
		flag value meaning
		================================= =========== ================================
		================================== =========== ================================
		KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs
		0x11 and 0x12

		@@ -62,7 +62,7 @@ KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt
		handler can be enabled by
		writing to msr 0x4b564d04

		KVM_FEATURE_PV_UNHAULT 7 guest checks this feature bit
		KVM_FEATURE_PV_UNHALT 7 guest checks this feature bit
		before enabling paravirtualized
		spinlock support

		@@ -76,7 +76,7 @@ KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT

		KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit
		before enabling paravirtualized
		sebd IPIs
		send IPIs

		KVM_FEATURE_POLL_CONTROL 12 host-side polling on HLT can
		be disabled by writing
		@@ -92,10 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit
		async pf acknowledgment msr
		0x4b564d07.

		KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24 host will warn if no guest-side
		per-cpu warps are expeced in
		KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side
		per-cpu warps are expected in
		kvmclock
		================================= =========== ================================
		================================== =========== ================================

		::

Documentation/virt/kvm/devices/vcpu.rst

+53 −4

Original line number	Diff line number	Diff line
		@@ -25,8 +25,10 @@ Returns:

		======= ========================================================
		-EBUSY The PMU overflow interrupt is already set
		-ENXIO The overflow interrupt not set when attempting to get it
		-ENODEV PMUv3 not supported
		-EFAULT Error reading interrupt number
		-ENXIO PMUv3 not supported or the overflow interrupt not set
		when attempting to get it
		-ENODEV KVM_ARM_VCPU_PMU_V3 feature missing from VCPU
		-EINVAL Invalid PMU overflow interrupt number supplied or
		trying to set the IRQ number without using an in-kernel
		irqchip.
		@@ -45,9 +47,10 @@ all vcpus, while as an SPI it must be a separate number per vcpu.
		Returns:

		======= ======================================================
		-EEXIST Interrupt number already used
		-ENODEV PMUv3 not supported or GIC not initialized
		-ENXIO PMUv3 not properly configured or in-kernel irqchip not
		configured as required prior to calling this attribute
		-ENXIO PMUv3 not supported, missing VCPU feature or interrupt
		number not set
		-EBUSY PMUv3 already initialized
		======= ======================================================

		@@ -55,6 +58,52 @@ Request the initialization of the PMUv3. If using the PMUv3 with an in-kernel
		virtual GIC implementation, this must be done after initializing the in-kernel
		irqchip.

		1.3 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_FILTER
		-----------------------------------------

		:Parameters: in kvm_device_attr.addr the address for a PMU event filter is a
		pointer to a struct kvm_pmu_event_filter

		:Returns:

		======= ======================================================
		-ENODEV PMUv3 not supported or GIC not initialized
		-ENXIO PMUv3 not properly configured or in-kernel irqchip not
		configured as required prior to calling this attribute
		-EBUSY PMUv3 already initialized
		-EINVAL Invalid filter range
		======= ======================================================

		Request the installation of a PMU event filter described as follows::

		struct kvm_pmu_event_filter {
		__u16 base_event;
		__u16 nevents;

		#define KVM_PMU_EVENT_ALLOW 0
		#define KVM_PMU_EVENT_DENY 1

		__u8 action;
		__u8 pad[3];
		};

		A filter range is defined as the range [@base_event, @base_event + @nevents),
		together with an @action (KVM_PMU_EVENT_ALLOW or KVM_PMU_EVENT_DENY). The
		first registered range defines the global policy (global ALLOW if the first
		@action is DENY, global DENY if the first @action is ALLOW). Multiple ranges
		can be programmed, and must fit within the event space defined by the PMU
		architecture (10 bits on ARMv8.0, 16 bits from ARMv8.1 onwards).

		Note: "Cancelling" a filter by registering the opposite action for the same
		range doesn't change the default action. For example, installing an ALLOW
		filter for event range [0:10) as the first filter and then applying a DENY
		action for the same range will leave the whole range as disabled.

		Restrictions: Event 0 (SW_INCR) is never filtered, as it doesn't count a
		hardware event. Filtering event 0x1E (CHAIN) has no effect either, as it
		isn't strictly speaking an event. Filtering the cycle counter is possible
		using event 0x11 (CPU_CYCLES).


		2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
		=================================

arch/arm64/include/asm/assembler.h

+19 −10

Original line number	Diff line number	Diff line
		@@ -218,6 +218,23 @@ lr .req x30 // link register
		str \src, [\tmp, :lo12:\sym]
		.endm

		/*
		* @dst: destination register
		*/
		#if defined(__KVM_NVHE_HYPERVISOR__) \|\| defined(__KVM_VHE_HYPERVISOR__)
		.macro this_cpu_offset, dst
		mrs \dst, tpidr_el2
		.endm
		#else
		.macro this_cpu_offset, dst
		alternative_if_not ARM64_HAS_VIRT_HOST_EXTN
		mrs \dst, tpidr_el1
		alternative_else
		mrs \dst, tpidr_el2
		alternative_endif
		.endm
		#endif

		/*
		* @dst: Result of per_cpu(sym, smp_processor_id()) (can be SP)
		* @sym: The name of the per-cpu variable
		@@ -226,11 +243,7 @@ lr .req x30 // link register
		.macro adr_this_cpu, dst, sym, tmp
		adrp \tmp, \sym
		add \dst, \tmp, #:lo12:\sym
		alternative_if_not ARM64_HAS_VIRT_HOST_EXTN
		mrs \tmp, tpidr_el1
		alternative_else
		mrs \tmp, tpidr_el2
		alternative_endif
		this_cpu_offset \tmp
		add \dst, \dst, \tmp
		.endm

		@@ -241,11 +254,7 @@ alternative_endif
		*/
		.macro ldr_this_cpu dst, sym, tmp
		adr_l \dst, \sym
		alternative_if_not ARM64_HAS_VIRT_HOST_EXTN
		mrs \tmp, tpidr_el1
		alternative_else
		mrs \tmp, tpidr_el2
		alternative_endif
		this_cpu_offset \tmp
		ldr \dst, [\dst, \tmp]
		.endm

arch/arm64/include/asm/hyp_image.h

0 → 100644

+36 −0

Original line number	Diff line number	Diff line
		/* SPDX-License-Identifier: GPL-2.0 */
		/*
		* Copyright (C) 2020 Google LLC.
		* Written by David Brazdil <dbrazdil@google.com>
		*/

		#ifndef __ARM64_HYP_IMAGE_H__
		#define __ARM64_HYP_IMAGE_H__

		/*
		* KVM nVHE code has its own symbol namespace prefixed with __kvm_nvhe_,
		* to separate it from the kernel proper.
		*/
		#define kvm_nvhe_sym(sym) __kvm_nvhe_##sym

		#ifdef LINKER_SCRIPT

		/*
		* KVM nVHE ELF section names are prefixed with .hyp, to separate them
		* from the kernel proper.
		*/
		#define HYP_SECTION_NAME(NAME) .hyp##NAME

		/* Defines an ELF hyp section from input section @NAME and its subsections. */
		#define HYP_SECTION(NAME) \
		HYP_SECTION_NAME(NAME) : { (NAME NAME##.) }

		/*
		* Defines a linker script alias of a kernel-proper symbol referenced by
		* KVM nVHE hyp code.
		*/
		#define KVM_NVHE_ALIAS(sym) kvm_nvhe_sym(sym) = sym;

		#endif /* LINKER_SCRIPT */

		#endif /* __ARM64_HYP_IMAGE_H__ */

Admin message