Pre-binding allows a user TD to launch before a MigTD is launched. For
example, a MigTD can be launched later (on demand) before live migration
is initiated. This avoids resource consumption of running MigTDs when
live migration of user TDs isn't needed.
To support pre-binding, a SHA384 hash of the service TD's measurable
attributes (i.e. SERVTD_INFO_HASH as defined in the TDX 1.5 spec) is
required to be added via the "migtd-hash=" QEMU command line. The hash
identifies the MigTD that can be bound later.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
TDVF_SECTION_TYPE_PAYLOAD is defined in the TDVF spec as a piece of
memory to added by VMM via TDH.MEM.ADD. As documented in the TDVF spec,
it is used by TD-Shim to load the service TD core (e.g. the MigTD
excutable). Add the section type to allow QEMU to pass the section
memory to KVM via the KVM_TDX_INIT_MEM_REGION command to add the pages
to TD via TDH.MEM.ADD.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Limits below get/put vcpu context functions only work
with DEBUG TD for TD guest:
kvm_getput_regs()
kvm_get_xcrs()
kvm_get_sregs()
kvm_get_sregs2()
kvm_get_debugregs()
kvm_arch_tdx_debug_get_registers()
Otherwise a migratable TD guest will also try to get/put
VCPU context via path only for DEBUG TD, which won't work
because these 2 DEBUG or MIGRATE capabilities are
mutuxed.
Signed-off-by: Yao Yuan <yuan.yao@intel.com>
Modify mwait/tsx enabling rules for TD guest as below:
1. Add HLE/RTM/MONITOR CPUID flags to tdx_fixed1 configuration,
although adding the flags is not necessary per current design
logics, this is to explictly tell the patch is to enable
TSX/MWAIT support for TD guest.
2. Clear the HLE/RTM flags if platform is not enable to support
them due to various reasons, e.g., some HW bugs.
3. Clear MONITOR/MWAIT flag if user doesn't add -overcommit cpu-pm=on
option when launch TD guest. Per current TDX configuration,
the flag in CPUID is always enabled even the option is omitted
only if TDX module supports it. But this experience is not aligned
with legacy guest case, so add the check to force the same usage.
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
Once the hypervisor exits to userspace to convert the page from private
to shared or vice versa, notify the state change via RamDiscardListener
so that the DMA mappings can be updated accordingly.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
RamDiscardManager is firstly introduced for some special RAM memory
regions (managed by virtio-mem). This interface can coordinate which
parts of specific memory regions are currently populated to be used by
the VM, notifying after parts were discarded and before parts will be
populated.
It is required by VFIO because VFIO tries to register the whole region
with the kernel and pin all pages, which is incompatible with discarding
of RAM. With RamDiscardManager, VFIO can require proper coordination to
only map the currently populated parts, to hinder parts that are
expected to remain discarded from silently getting populated and
consuming memory.
This interface also applies to the private memfd backend, which can
represent two types of memory (shared and private). Shared memory is
accessiable by host VMM and can be mapped in IOMMU page tables. Private
memory is inaccessible and can't be DMA-mapped. So the memory regions
backed by private memfd can coordinate to only ap the shared parts,
update eh mapping when notified about page attribute conversion between
shared and private.
Introduce the RamDiscardManager interface in private memfd backend and
utilize the existing VFIORamDiscardListener (registered in
vfio_register_ram_discard_listener()). The users can call
priv_memfd_backend_state_change() notifier to DMA map the shared parts
and unmap the private parts.
There are some notes in current implementation.
- VFIO has registerd the population ram discard listener which will DMA
map in minimun granularity (i.e. block_size in virtio-mem and private
memfd). The block_size of private memfd is set to host page size
because the conversion size can be as small as one page.
- Current VFIO do the map/unmap with minimal pranularity each time,
which can avoid partial unmap request from TD guest.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
The HOB list and TempMem should be accepted (via reserving ranges) and
PermMem shouldn't accept a range.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
When it is the last round of guest live migration, the guest is paused
and the migration state transits to RUN_STATE_FINISH_MIGRATE. There is
no need to call into KVM to clear the dirty log and write protects the
guest memory,
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
cgs_mig_init calls the vendor specific init function based the vm type
to initialize the cgs_mig callbacks.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Group the MSRs (e.g. KVM_MSR_STEAL_TIME) that QEMU/KVM emulates for TD
guests into kvm_put_msrs_common, which can be put from QEMU for both TD
guests and the legacy VM guests. The rest of MSRs are grouped into
kvm_put_msrs_vm, which puts the MSRs for the legacy VMs only
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Some TD MSRs (e.g. MSR_IA32_TSCDEADLINE) are emulated by QEMU/KVM. Group
those MSRs into kvm_get_msrs_common, and allow QEMU to get them for both
the legacy VM guest and TD guest cases. Group the rest of MSRs that
are not emulated for TD guest into kvm_get_msrs_vm, and QEMU are not
allowed to get those MSRs for the TD guest case.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
TD guests don't have nested virtualization support so far, so don't
get/set the nested related states for the TD guest case.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
The KVM special registers (e.g. segment registers, control registers)
are not emulated by QEMU/KVM for the TD guests, so don't get/set them
for the TD guest case.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
The xcr registers are not emulated by QEMU/KVM for TD guests, so no need
to get/set them for the TD guest case.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
The xsave states are not emulated by QEMU/KVM for TD guests, so no need
to get/set them for the TD guest case.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
The GP registers are emulated by KVM when TD executes a hypercall (i.e.
TDG.VP.VMCALL). Live migration could pause a TD when it's in KVM side
execution of the hypercall, and in this case, the GP registers need to
be migrated to the destination before resume the guest to run.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
During live migration, the destination TD are initialized via importing
states from the source. So some TD-scope initialization work (e.g.
TDH.MNG.INIT) shouldn't be performed on the destination TD, and some
initialization work should be performed after the TD states are
imported. Add a flag to indicate to KVM to do post-initialization of the
TD after the migration is done.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
During TD live migration, the destination TD is created to receive state
from the source side. The TDX module will set the TD state to finalized
after all the TD states are imported from the source. So skip sending
the command to KVM to finalize the TD.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
After the cgs migration ends successfully, set the cgs->ready flag to
indicate the confidential guest is ready.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
The destination TD in live migration isn't ready after the migration is
done. Skip the verification of cgs->ready during the destination TD
creation.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
During the pre-copy phase, the guest's page may be converted from private
to shared or shared to private. The destination migration thread's checks
if the received page is private or shared via the RAM_SAVE_FLAG_CGS_STATE
flag. The flag is compared with the related bit in the cgs_bmap to see if
there is a conversion from shared to private or shared to private. If they
are different, update the cgs_bmap bit and calls kvm_convert_memory to
convert the page.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
When a page is received from the source, the RAM_SAVE_FLAG_CGS_STATE flag
indicates if the received page is a private page. The guest private memory
isn't mapped by QEMU, so don't find the host virtual addr in this case.
COLO isn't supported to work with private pages currently, so skip the
related COLO handling for the private page case.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
The cgs epoch data is identified via receiving the RAM_SAVE_FLAG_CGS_EPOCH
flag. When receiving the flag, the epoch data is directly passed to the
vendor specific implementation to handle via the cgs load API.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Rename file_ram_alloc to ram_block_map_fd, as the function is
essentially doing the mapping of the fd-based memory for the ram block,
and break the long coded function into 3 pieces:
- ram_block_map_check_and_truncate: check if the size and alignment is
valid for a mapping, and truncate the file if needed. The aligned size
is returned;
- ram_block_get_map_flags: generate the flags for the mapping based on
the flags that have been set to the ram block and the readonly from
the caller;
- qemu_ram_map: finally map the fd-based memory
This facilitates further expose ram_block_map_fd for more usages, e.g.
mapping an aditional fd from a ram block.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Load the private page data into the shared memory before sending the
KVM_TDX_MIG_IMPORT_MEM command to the tdx-mig driver to import the
data to the guest.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_loadvm_state is assigned to the cgs migration framework's
loadvm_state API to load TD private states on the destination. The type
of states can be obtained from the MB_TYPE field (offset 6 according to
TDX module v1.5 spec) of the MBMD header. Based on the type of the
states, sends the corresponding command to the tdx-mig driver. The
migration bundle data (i.e. mbmd header and state data) is loaded into
the shared memory (mbmd header loaded in to the mbmd section, and the
state data is loaded in to the section pointed by buf_list).
The flag from TdxMigHdr indicates if this includes a chain of states.
If the TDX_MIG_F_CONTINUE is set, the API continue to handle the next
migration bundle to load the related states.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
When the live migration of a confidential guest is cancelled, ram save
needs to be cancelled with some vendor specific work required (e.g.
restore some private pages's state). Here are 3 cases:
If the migration cancelling request is received before ram saving
starts (i.e. cgs_epochs is 0), nothing needs to be done;
If the migration cancelling request is received during the first round
of ram saving, provide the vendor specific handling via the cgs_mig API
with the gfn of the last page that has been saved; and
If the migration cancelling request is received after the first round of
ram saving, it is likely all the pages have been saved. The gfn of the
last page is set to the end of the guest ram page.
As clearing the KVM's bitmap write-protects the private pages in chunks
(e.g. 256K pages by default), so the end_gfn needs to be aligned to the
chunk boundary.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_savevm_state_ram_cancel is assigned to the cgs migration
framework's savevm_state_ram_cancel and it is invoked when the user
cancels live migration process after it is initiated. The gfn_end
parameter input by the migration flow tells the last gfn of the
guest physical page that has been exported.
The API sends the KVM_TDX_MIG_EXPORT_ABORT command to the tdx-mig
driver to abort the migration. The driver requests the TDX module to
restore the states for the pages that have been exported. If no page
has been exported, the API just returns without sending the command
to the driver.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_cleanup is assigned to the cgs migration framework's
savevm_state_cleanup and loadvm_state_cleanup to be invoked on both
source and destination side to do the TDX migration cleanup (e.g. unmap
the shared memory) when migration is done or cancelled.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_savevm_state_end is assigned to the cgs migration framework's
API, cgs_mig_savevm_state_end, to export and send the non-iteratable
states after the guest enters downtime. Those states include:
- TD-scope mutable states: the KVM_TDX_MIG_EXPORT_STATE_TD command is
added for the API to request the tdx-mig driver to load the TD states
into the shared memory;
- states of all the vCPUs: the KVM_TDX_MIG_EXPORT_STATE_VP command is
added for the API to request the tdx-mig driver to load one vCPU state
into the shared memory. The command can be sent to the driver multiple
times to load the states of multiple vCPUs (starting from vCPU 0);
- start token: The token is a TDX migration concept which indicates the
end of the pre-copy migration. Since post-copy hasn't been supported,
no migration data is sent after the start token is sent.
The above states are sent in a chain by using the TDX_MIG_F_CONTINUE
flag in the TdxMigHdr, so the destination side will continue to load
states till the flag is cleared.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_savevm_state_downtime is assigned to the cgs migration framework's
cgs_mig_savevm_state_downtime to request TDX module to enter migration
downtime. KVM_TDX_MIG_EXPORT_PAUSE is a migration command added for the
API to send to the tdx-mig driver for the request.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_savevm_state_ram is assigned to the cgs migration framework API,
cgs_mig_savevm_state_ram, to export private pages from the TDX module
and send it to the destination. The interface supports passing a list of
pages to export, but the migration flow currently only export and send 1
page each time. The API sends the KVM_TDX_MIG_EXPORT_MEM command to the
tdx-mig driver and the the exported data are loaded in to the shared
memory.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_savevm_state_ram_start_epoch is assigned to the cgs migration
framework's API, tdx_mig_savevm_state_ram_start_epoch to send a TDX
migration epoch. Migration epoch is used by TDX to enforce one guest
physical page is migrated only once during an epoch (i.e. in one memory
save iteration).
The epoch is obtained via sending a KVM_TDX_MIG_EXPORT_TRACK command to
the tdx-mig driver, with the TDX_MIG_EXPORT_TRACK_F_IN_ORDER_DONE flag
cleared to indicate this is a regaular migration epoch exported at the
beginning of each iteration. The driver then loads the epoch data into
the shared memory.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_savevm_state_start is assigned to the framework's API,
cgs_mig_savevm_state_start, to send the TD-scope immutable states to the
destination side. The immutable states is required by the TDX migration
architecture to be the very first (i.e. kickoff) state to be imported into
the destination TD. This is done by sending a migration command (i.e.
KVM_TDX_MIG_EXPORT_STATE_IMMUTABLE) to the driver. The driver loads the
immutable states into the shared memory, with export_num indicating the
number of bytes stored in the migration buffer list.
A TDX migration specific header (i.e. TdxMigHdr) is added before the
exported migration bundle (i.e. mbmd + states in buf_list) to tell the
destination the number of buffers (4KB each) used by the buffer list.
The flag field isn't used for the immutable state, and is reserved for
other states migration.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_stream_ioctl is added for the migration flow to send a migration
command to the driver. The ioctl is performed on the kvm_device's fd
returned at the setup step.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_setup is assigned to the cgs migration framework's
savevm_state_setup API (invoked on the source side) and loadvm_state_setup
API (invoked on the destination side). The setup work includes:
- create a kvm_device from the tdx-mig driver in KVM. The device fd is
returned for later communication with the device.
- negotiate with the driver for the size if the memory to map, this
includes:
-- KVM_SET_DEVICE_ATTR: sets the configurable attr (only the migration
buffer size currently) of the device to KVM. The migration flow
currently finds and send dirty pages one by one, so the migration
buffer size set to the driver is 4KB (TAGET_PAGE_SIZE);
-- KVM_GET_DEVICE_ATTR: gets the negotiated kvm_device's attr. This
obtains from KVM the sizes of the 4 parts (i.e. mbmd buffer size,
migration buffer size, mac list buffer size, and gpa list buffer
size) of shared memory.
- map the 4 parts of shared memory.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Migration stream is a TDX concept and it is analogous to a "channel" in
the TDX module for migration data export and import. The default migration
flow (i.e. with multifd disabled) supports 1 channel to migrate data, so
for the current tdx migration only 1 migration stream is implemented.
In KVM, the migration stream is supported via building a piece of memory
shared to QEMU to map, so the QEMU side TdxMigStream includes the pointers
which will be set to mapped memory.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
tdx_mig_is_ready is invoked from the cgs migration framework to check if
the migration flow is ready to proceed. This requires the TDX
pre-migration step has been done successfully.
With tdx_mig_is_ready being the first TDX function added to the cgs
migration framework APIs, create cgs-tdx.c to hold all the upcoming TDX
migration specific APIs to be added to the framework.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
cgs_mig_loadvm_state_cleanup in invoked on the destination side when
migration is done to invoke vendor specific cleanup function to do the
cleanup work.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
The confidential guest migration starts with a kickoff message, which
may includes vendor specific initial private states (e.g. TDX's TD-scope
immutable states) to load. At the end, the non-iteratable private
states (e.g. vcpu states) need to be loaded.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>