Reuse "-cpu,tsc-frequency=" to get user wanted tsc frequency and call VM
scope VM_SET_TSC_KHZ to set the tsc frequency of TD before KVM_TDX_INIT_VM.
Besides, sanity check the tsc frequency to be in the legal range and
legal granularity (required by TDX module).
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Validate TD attributes with tdx_caps that fixed-0 bits must be zero and
fixed-1 bits must be set.
Besides, sanity check the attribute bits that have not been supported by
QEMU yet. e.g., debug bit, it will be allowed in the future when debug
TD support lands in QEMU.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
For QEMU VMs, PKS is configured via CPUID_7_0_ECX_PKS and PMU is
configured by x86cpu->enable_pmu. Reuse the existing configuration
interface for TDX VMs.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
For TDX KVM use case, Linux guest is the most major one. It requires
sept_ve_disable set. Make it default for the main use case. For other use
case, it can be enabled/disabled via qemu command line.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Bit 28 of TD attribute, named SEPT_VE_DISABLE. When set to 1, it disables
EPT violation conversion to #VE on guest TD access of PENDING pages.
Some guest OS (e.g., Linux TD guest) may require this bit as 1.
Otherwise refuse to boot.
Add sept-ve-disable property for tdx-guest object, for user to configure
this bit.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Invoke KVM_TDX_INIT in kvm_arch_pre_create_vcpu() that KVM_TDX_INIT
configures global TD state, e.g. the canonical CPUID config, and must
be executed prior to creating vCPUs.
Use kvm_x86_arch_cpuid() to setup the CPUID settings for TDX VM and
tie x86cpu->enable_pmu with TD's attributes.
Note, this doesn't address the fact that QEMU may change the CPUID
configuration when creating vCPUs, i.e. punts on refactoring QEMU to
provide a stable CPUID config prior to kvm_arch_init().
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Introduce kvm_arch_pre_create_vcpu(), to perform arch-dependent
work prior to create any vcpu. This is for i386 TDX because it needs
call TDX_INIT_VM before creating any vcpu.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Some bits in TD attributes have corresponding CPUID feature bits. Reflect
the fixed0/1 restriction on TD attributes to their corresponding CPUID
bits in tdx_cpuid_lookup[] as well.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
KVMM requires userspace to pass XFAM configuration via CPUID leaves 0xDs.
Convert tdx_caps->xfam_fixed0/1 into corresponding
tdx_cpuid_lookup[].tdx_fixed0/1 field of CPUID leaves 0xD. Thus the
requirement can applied naturally.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
tdx_cpuid_lookup[].tdx_fixed0/1 is the QEMU maintained data which
reflects TDX restrictions regrading how some CPUID is virtualized by
TDX. It's retrieved from TDX spec. However, TDX may change some fixed
fields to configurable in the future. Update
tdx_cpuid.lookup[].tdx_fixed0/1 fields by removing the bits that
reported from TDX module as configurable. This can adapt with the
updated TDX (module) automatically.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
According to Chapter "CPUID Virtualization" in TDX module spec, CPUID
bits of TD can be classified into 6 types:
------------------------------------------------------------------------
1 | As configured | configurable by VMM, independent of native value;
------------------------------------------------------------------------
2 | As configured | configurable by VMM if the bit is supported natively
(if native) | Otherwise it equals as native(0).
------------------------------------------------------------------------
3 | Fixed | fixed to 0/1
------------------------------------------------------------------------
4 | Native | reflect the native value
------------------------------------------------------------------------
5 | Calculated | calculated by TDX module.
------------------------------------------------------------------------
6 | Inducing #VE | get #VE exception
------------------------------------------------------------------------
Note:
1. All the configurable XFAM related features and TD attributes related
features fall into type #2. And fixed0/1 bits of XFAM and TD
attributes fall into type #3.
2. For CPUID leaves not listed in "CPUID virtualization Overview" table
in TDX module spec, TDX module injects #VE to TDs when those are
queried. For this case, TDs can request CPUID emulation from VMM via
TDVMCALL and the values are fully controlled by VMM.
Due to TDX module has its own virtualization policy on CPUID bits, it leads
to what reported via KVM_GET_SUPPORTED_CPUID diverges from the supported
CPUID bits for TDs. In order to keep a consistent CPUID configuration
between VMM and TDs. Adjust supported CPUID for TDs based on TDX
restrictions.
Currently only focus on the CPUID leaves recognized by QEMU's
feature_word_info[] that are indexed by a FeatureWord.
Introduce a TDX CPUID lookup table, which maintains 1 entry for each
FeatureWord. Each entry has below fields:
- tdx_fixed0/1: The bits that are fixed as 0/1;
- vmm_fixup: The bits that are configurable from the view of TDX module.
But they requires emulation of VMM when they are configured
as enabled. For those, they are not supported if VMM doesn't
report them as supported. So they need be fixed up by
checking if VMM supports them.
- inducing_ve: TD gets #VE when querying this CPUID leaf. The result is
totally configurable by VMM.
- supported_on_ve: It's valid only when @inducing_ve is true. It represents
the maximum feature set supported that be emulated
for TDs.
By applying TDX CPUID lookup table and TDX capabilities reported from
TDX module, the supported CPUID for TDs can be obtained from following
steps:
- get the base of VMM supported feature set;
- if the leaf is not a FeatureWord just return VMM's value without
modification;
- if the leaf is an inducing_ve type, applying supported_on_ve mask and
return;
- include all native bits, it covers type #2, #4, and parts of type #1.
(it also includes some unsupported bits. The following step will
correct it.)
- apply fixed0/1 to it (it covers #3, and rectifies the previous step);
- add configurable bits (it covers the other part of type #1);
- fix the ones in vmm_fixup;
- filter the one has valid .supported field;
(Calculated type is ignored since it's determined at runtime).
Co-developed-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
It will need special handling for TDX VMs all around the QEMU.
Introduce is_tdx_vm() helper to query if it's a TDX VM.
Cache tdx_guest object thus no need to cast from ms->cgs every time.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
KVM provides TDX capabilities via sub command KVM_TDX_CAPABILITIES of
IOCTL(KVM_MEMORY_ENCRYPT_OP). Get the capabilities when initializing
TDX context. It will be used to validate user's setting later.
Since there is no interface reporting how many cpuid configs contains in
KVM_TDX_CAPABILITIES, QEMU chooses to try starting with a known number
and abort when it exceeds KVM_MAX_CPUID_ENTRIES.
Besides, introduce the interfaces to invoke TDX "ioctls" at different
scope (KVM, VM and VCPU) in preparation.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Introduce tdx_kvm_init() and invoke it in kvm_confidential_guest_init()
if it's a TDX VM. More initialization will be added later.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Introduce a separate function kvm_confidential_guest_init() for SEV (and
future TDX).
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
TDX VM requires VM type KVM_X86_TDX_VM to be passed to
kvm_ioctl(KVM_CREATE_VM). Hence implement mc->kvm_type() for i386
architecture.
If tdx-guest object is specified to confidential-guest-support, like,
qemu -machine ...,confidential-guest-support=tdx0 \
-object tdx-guest,id=tdx0,...
it parses VM type as KVM_X86_TDX_VM. Otherwise, it's KVM_X86_DEFAULT_VM.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Introduce tdx-guest object which implements the interface of
CONFIDENTIAL_GUEST_SUPPORT, and will be used to create TDX VMs (TDs) by
qemu -machine ...,confidential-guest-support=tdx0 \
-object tdx-guset,id=tdx0
It has only one property 'attributes' with fixed value 0 and not
configurable so far.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Pull in recent TDX updates, which are not backwards compatible.
It's just to make this series runnable. It will be updated by script
scripts/update-linux-headers.sh
once TDX support is upstreamed in linux kernel
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
KVM_EXIT_MEMORY_FAULT happens. It indicates userspace needs to do
the memory conversion on the RAMBlock to turn the memory into desired
attribute, i.e., private/shared.
Note, KVM_MEMORY_MEMORY_FAULT makes sense only when the RAMBlock has
restricted memory backend.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
This new routine adds support for memory conversion between
shared/private memory for memfd based private ram_block.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
This new object is similar to existing 'memory-backend-memfd' object
which provides host memory with linux memfd_create API but differs in
that this new object will have two memfds, one is for share memory which
can be mmaped, the other is for private memory which is restricted memfd
and can not be accessed from userspace.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Add restricted memory support to RAMBlock so we can have both normal
memory and restricted memory in one RAMBlock.
The restricted part is represented by the restricted_fd.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Pull in restricted_mem APIs.
It's just to make this series runnable. It will be updated by script
scripts/update-linux-headers.sh
once restricted_mem support is upstreamed in linux kernel.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Decrease array index cpuid_i when CPUID leaf 1F is skipped, otherwise it
will get an all zero'ed CPUID entry with leaf 0 and subleaf 0. It
conflicts with correct leaf 0.
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
The upper 16 bits of kvm_userspace_memory_region::slot are
address space id. Parse it separately in trace_kvm_set_user_memory().
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
pull for 7.2-rc4
# -----BEGIN PGP SIGNATURE-----
#
# iLMEAAEIAB0WIQS4/x2g0v3LLaCcbCxAov/yOSY+3wUCY4nPggAKCRBAov/yOSY+
# 36cRA/9JFWuDT0TDhu0g1x0ktvpV+1GBPzkEXR2CVhDf2bly1ka2cLEtPUpiSE8E
# Osw9cEBR3qX+LyO3gA0GySUr9jsc/yRqD38OL8HGZTCmZ/qCnHJSXvy+6a0LWYQq
# ZIrFat7UjiTTeErkSQ6C4bUIl6YoUUSP0X2XxO6YF5j4uhGyqA==
# =sVrx
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 02 Dec 2022 05:12:18 EST
# gpg: using RSA key B8FF1DA0D2FDCB2DA09C6C2C40A2FFF239263EDF
# gpg: Good signature from "Song Gao <m17746591750@163.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: B8FF 1DA0 D2FD CB2D A09C 6C2C 40A2 FFF2 3926 3EDF
* tag 'pull-loongarch-20221202' of https://gitlab.com/gaosong/qemu:
hw/loongarch/virt: Add cfi01 pflash device
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
virtio: regression fix
Fixes regression with migration and vsock, as fixing that
exposes some known issues in vhost user cleanup, this attempts
to fix those as well. More work on vhost user is needed :)
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# -----BEGIN PGP SIGNATURE-----
#
# iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAmOIWaEPHG1zdEByZWRo
# YXQuY29tAAoJECgfDbjSjVRp+RQH/2PVAjD/GA3zF5F3Z07vH51c55T6tluZ85c3
# 4u66SSkF5JR1hATCujYCtrt9V0mnqhmhhm4gJH5xcsynFjjyIXd2dDrTFRpCtRgn
# icXOmYCc9pCu8XsluJnWvY/5r/KEDxqmGVE8Kyhz551QjvsBkezhI9x9vhJZJLCn
# Xn1XQ/3jpUcQLwasu8AxZb0IDW8WdCtonbke6xIyMzOYGR2bnRdXlDXVVG1zJ/SZ
# eS3HUad71VekhfzWq0fx8yEJnfvbes9vo007y8rOGdHOcMneWGAie52W1dOBhclh
# Zt56zID55t1USEwlPxkZSj7UXNbVl7Uz/XU5ElN0yTesttP4Iq0=
# =ZkaX
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 01 Dec 2022 02:37:05 EST
# gpg: using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469
# gpg: issuer "mst@redhat.com"
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full]
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>" [full]
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* tag 'for_upstream' of https://git.kernel.org/pub/scm/virt/kvm/mst/qemu:
include/hw: VM state takes precedence in virtio_device_should_start
hw/virtio: generalise CHR_EVENT_CLOSED handling
hw/virtio: add started_vu status field to vhost-user-gpio
vhost: enable vrings in vhost_dev_start() for vhost-user devices
tests/qtests: override "force-legacy" for gpio virtio-mmio tests
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
When running the migration test compiled with Clang from Fedora 37
and sanitizers enabled, there is an error complaining about unlink():
../tests/qtest/migration-test.c:1072:12: runtime error: null pointer
passed as argument 1, which is declared to never be null
/usr/include/unistd.h:858:48: note: nonnull attribute specified here
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior
../tests/qtest/migration-test.c:1072:12 in
(test program exited with status code 1)
TAP parsing error: Too few tests run (expected 33, got 20)
The data->clientcert and data->clientkey pointers can indeed be unset
in some tests, so we have to check them before calling unlink() with
those.
While we're at it, I also noticed that the code is only freeing
some but not all of the allocated strings in this function, and
indeed, valgrind is also complaining about memory leaks here.
So let's call g_free() on all allocated strings to avoid leaking
memory here.
Message-Id: <20221125083054.117504-1-thuth@redhat.com>
Tested-by: Bin Meng <bmeng@tinylab.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
MMX state is saved/restored by FSAVE/FRSTOR so the instructions are
not illegal opcodes even if CR4.OSFXSR=0. Make sure that validate_vex
takes into account the prefix and only checks HF_OSFXSR_MASK in the
presence of an SSE instruction.
Fixes: 20581aadec ("target/i386: validate VEX prefixes via the instructions' exception classes", 2022-10-18)
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1350
Reported-by: Helge Konetzka (@hejko on gitlab.com)
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix a potential use-after-free by removing the bottom half and enqueuing
the completion directly.
Fixes: 796d20681d ("hw/nvme: reimplement the copy command to allow aio cancellation")
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
When the DSM operation is cancelled asynchronously, we set iocb->ret to
-ECANCELED. However, the callback function only checks the return value
of the completed aio, which may have completed succesfully prior to the
cancellation and thus the callback ends up continuing the dsm operation
instead of bailing out. Fix this.
Secondly, fix a potential use-after-free by removing the bottom half and
enqueuing the completion directly.
Fixes: d7d1474fd8 ("hw/nvme: reimplement dsm to allow cancellation")
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
If the zone reset operation is cancelled but the block unmap operation
completes normally, the callback will continue resetting the next zone
since it neglects to check iocb->ret which will have been set to
-ECANCELED. Make sure that this is checked and bail out if an error is
present.
Secondly, fix a potential use-after-free by removing the bottom half and
enqueuing the completion directly.
Fixes: 63d96e4ffd ("hw/nvme: reimplement zone reset to allow cancellation")
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Make sure that iocb->aiocb is NULL'ed when cancelling.
Fix a potential use-after-free by removing the bottom half and enqueuing
the completion directly.
Fixes: 38f4ac65ac ("hw/nvme: reimplement flush to allow cancellation")
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
There are several bugs in the async cancel code for the Format command.
Firstly, cancelling a format operation neglects to set iocb->ret as well
as clearing the iocb->aiocb after cancelling the underlying aiocb which
causes the aio callback to ignore the cancellation. Trivial fix.
Secondly, and worse, because the request is queued up for posting to the
CQ in a bottom half, if the cancellation is due to the submission queue
being deleted (which calls blk_aio_cancel), the req structure is
deallocated in nvme_del_sq prior to the bottom half being schedulued.
Fix this by simply removing the bottom half, there is no reason to defer
it anyway.
Fixes: 3bcf26d3d6 ("hw/nvme: reimplement format nvm to allow cancellation")
Reported-by: Jonathan Derrick <jonathan.derrick@linux.dev>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>