Commit Graph

100060 Commits

Author SHA1 Message Date
dad3f53fda scsi-generic: replace logical block count of response of READ CAPACITY
Git-commit: 0000000000000000000000000000000000000000
References: [SUSE-JIRA] (SLE-20965)

While using SCSI passthrough, Following scenario makes qemu doesn't
realized the capacity change of remote scsi target:
1. online resize the scsi target.
2. issue 'rescan-scsi-bus.sh -s ...' in host.
3. issue 'rescan-scsi-bus.sh -s ...' in vm.

In above scenario I used to experienced errors while accessing the
additional disk space in vm. I think the reasonable operations should
be:
1. online resize the scsi target.
2. issue 'rescan-scsi-bus.sh -s ...' in host.
3. issue 'block_resize' via qmp to notify qemu.
4. issue 'rescan-scsi-bus.sh -s ...' in vm.

The errors disappear once I notify qemu by block_resize via qmp.

So this patch replaces the number of logical blocks of READ CAPACITY
response from scsi target by qemu's bs->total_sectors. If the user in
vm wants to access the additional disk space, The administrator of
host must notify qemu once resizeing the scsi target.

Bonus is that domblkinfo of libvirt can reflect the consistent capacity
information between host and vm in case of missing block_resize in qemu.
E.g:
...
    <disk type='block' device='lun'>
      <driver name='qemu' type='raw'/>
      <source dev='/dev/sdc' index='1'/>
      <backingStore/>
      <target dev='sda' bus='scsi'/>
      <alias name='scsi0-0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
...

Before:
1. online resize the scsi target.
2. host:~  # rescan-scsi-bus.sh -s /dev/sdc
3. guest:~ # rescan-scsi-bus.sh -s /dev/sda
4  host:~  # virsh domblkinfo --domain $DOMAIN --human --device sda
Capacity:       4.000 GiB
Allocation:     0.000 B
Physical:       8.000 GiB

5. guest:~ # lsblk /dev/sda
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda      8:0    0   8G  0 disk
└─sda1   8:1    0   2G  0 part

After:
1. online resize the scsi target.
2. host:~  # rescan-scsi-bus.sh -s /dev/sdc
3. guest:~ # rescan-scsi-bus.sh -s /dev/sda
4  host:~  # virsh domblkinfo --domain $DOMAIN --human --device sda
Capacity:       4.000 GiB
Allocation:     0.000 B
Physical:       8.000 GiB

5. guest:~ # lsblk /dev/sda
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda      8:0    0   4G  0 disk
└─sda1   8:1    0   2G  0 part

Signed-off-by: Lin Ma <lma@suse.com>
2023-12-06 10:44:55 +01:00
Bruce Rogers
ce4a21692d xen_disk: Add suse specific flush disable handling and map to QEMU equiv
Add code to read the suse specific suse-diskcache-disable-flush flag out
of xenstore, and set the equivalent flag within QEMU.

Patch taken from Xen's patch queue, Olaf Hering being the original author.
[bsc#879425]

[BR: minor edits to pass qemu's checkpatch script]
[BR: With qdevification of xen-block, code has changed significantly]
Signed-off-by: Bruce Rogers <brogers@suse.com>
Signed-off-by: Olaf Hering <olaf@aepfle.de>
2023-12-06 10:44:55 +01:00
Andreas Färber
3825434f12 Raise soft address space limit to hard limit
For SLES we want users to be able to use large memory configurations
with KVM without fiddling with ulimit -Sv.

Signed-off-by: Andreas Färber <afaerber@suse.de>
[BR: add include for sys/resource.h]
Signed-off-by: Bruce Rogers <brogers@suse.com>
2023-12-06 10:44:55 +01:00
Bruce Rogers
d1b2091899 qemu-bridge-helper: reduce security profile
References: boo#988279

Change from using glib alloc and free routines to those
from libc. Also perform safety measure of dropping privs
to user if configured no-caps.

Signed-off-by: Bruce Rogers <brogers@suse.com>
[AF: Rebased for v2.7.0-rc2]
Signed-off-by: Andreas Färber <afaerber@suse.de>
2023-12-06 10:44:55 +01:00
Alexander Graf
bae107c3fc Make char muxer more robust wrt small FIFOs
Virtio-Console can only process one character at a time. Using it on S390
gave me strange "lags" where I got the character I pressed before when
pressing one. So I typed in "abc" and only received "a", then pressed "d"
but the guest received "b" and so on.

While the stdio driver calls a poll function that just processes on its
queue in case virtio-console can't take multiple characters at once, the
muxer does not have such callbacks, so it can't empty its queue.

To work around that limitation, I introduced a new timer that only gets
active when the guest can not receive any more characters. In that case
it polls again after a while to check if the guest is now receiving input.

This patch fixes input when using -nographic on s390 for me.

[AF: Rebased for v2.7.0-rc2]
[BR: minor edits to pass qemu's checkpatch script]
Signed-off-by: Bruce Rogers <brogers@suse.com>
2023-12-06 10:44:55 +01:00
Alexander Graf
54669a99cd PPC: KVM: Disable mmu notifier check
When using hugetlbfs (which is required for HV mode KVM on 970), we
check for MMU notifiers that on 970 can not be implemented properly.

So disable the check for mmu notifiers on PowerPC guests, making
KVM guests work there, even if possibly racy in some odd circumstances.

Signed-off-by: Bruce Rogers <brogers@suse.com>
2023-12-06 10:44:54 +01:00
Alexander Graf
b8f20f22b3 linux-user: lseek: explicitly cast non-set offsets to signed
When doing lseek, SEEK_SET indicates that the offset is an unsigned variable.
Other seek types have parameters that can be negative.

When converting from 32bit to 64bit parameters, we need to take this into
account and enable SEEK_END and SEEK_CUR to be negative, while SEEK_SET stays
absolute positioned which we need to maintain as unsigned.

Signed-off-by: Alexander Graf <agraf@suse.de>
2023-12-06 10:44:54 +01:00
Alexander Graf
b6fad87c88 linux-user: use target_ulong
Linux syscalls pass pointers or data length or other information of that sort
to the kernel. This is all stuff you don't want to have sign extended.
Otherwise a host 64bit variable parameter with a size parameter will extend
it to a negative number, breaking lseek for example.

Pass syscall arguments as ulong always.

Signed-off-by: Alexander Graf <agraf@suse.de>
[JRZ: changes from linux-user/qemu.h wass moved to linux-user/user-internals.h]
Signed-off-by: Jose R Ziviani <jziviani@suse.de>
[DF: Forward port, i.e., use ulong for do_prctl too]
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
2023-12-06 10:44:54 +01:00
Alexander Graf
a26e18306b linux-user: Fake /proc/cpuinfo
Fedora 17 for ARM reads /proc/cpuinfo and fails if it doesn't contain
ARM related contents. This patch implements a quick hack to expose real
/proc/cpuinfo data taken from a real world machine.

The real fix would be to generate at least the flags automatically based
on the selected CPU. Please do not submit this patch upstream until this
has happened.

Signed-off-by: Alexander Graf <agraf@suse.de>
[AF: Rebased for v1.6 and v1.7]
Signed-off-by: Andreas Färber <afaerber@suse.de>
[DF: Restructured it a bit, to make ARM look like other arch-es]
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
2023-12-06 10:44:53 +01:00
Andreas Färber
b8f900de8a qemu-binfmt-conf: Modify default path
Change QEMU_PATH from /usr/local/bin to /usr/bin prefix.

Signed-off-by: Andreas Färber <afaerber@suse.de>
2023-12-06 10:44:53 +01:00
Bruce Rogers
aa17838f4b Revert "roms/efirom, tests/uefi-test-tools: update edk2's own submodules first"
This reverts commit ec87b5daca.

No need. In our build system submodules are checked out.

Signed-off-by: Bruce Rogers <brogers@suse.com>
[DF: Rebased on top of 6.2.0]
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
2023-12-06 10:44:53 +01:00
Bruce Rogers
b3889651c0 roms/Makefile: add --cross-file to qboot meson setup for aarch64
We add a --cross-file reference so that we can do cross compilation
of qboot from an aarch64 build.

Signed-off-by: Bruce Rogers <brogers@suse.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
2023-12-06 10:44:52 +01:00
Bruce Rogers
4416a893bf roms: change cross compiler naming to be suse specific
Signed-off-by: Bruce Rogers <brogers@suse.com>
2023-12-06 10:44:52 +01:00
Bruce Rogers
d57bc5e264 roms/Makefile: pass a packaging timestamp to subpackages with date info
References: bsc#1011213

Certain rom subpackages build from qemu git-submodules call the date
program to include date information in the packaged binaries. This
causes repeated builds of the package to be different, wkere the only
real difference is due to the fact that time build timestamp has
changed. To promote reproducible builds and avoid customers being
prompted to update packages needlessly, we'll use the timestamp of the
VERSION file as the packaging timestamp for all packages that build in a
timestamp for whatever reason.

Signed-off-by: Bruce Rogers <brogers@suse.com>
2023-12-06 10:44:51 +01:00
addf977986 [openSUSE] Adapt the package file to this new, special version
This includes:
- base QEMU version is 7.2.0 (not 7.1)
- update the spec files, so that they work with QEMU 7.2
- disable xen
- build only for x86_64
- refresh seabios version

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
2023-12-06 10:42:38 +01:00
0f0a481d7f [openSUSE] Update submodule references
Make sure we use the branches of the submodule repositories that have
our downstream patches applied.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
2023-12-05 08:23:09 +01:00
fcffc18e6b [openSUSE] Import (from 7.1.0-sle15sp5) downstream package files
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
2023-12-05 08:22:53 +01:00
Chao Gao
177b3f0f25 util/aio: Defer disabling poll mode as long as possible
When we measure FIO read performance (cache=writethrough, bs=4k,
iodepth=64) in VMs, ~80K/s notifications (e.g., EPT_MISCONFIG) are observed
from guest to qemu.

It turns out those frequent notificatons are caused by interference from
worker threads. Worker threads queue bottom halves after completing IO
requests.  Pending bottom halves may lead to either aio_compute_timeout()
zeros timeout and pass it to try_poll_mode() or run_poll_handlers() returns
no progress after noticing pending aio_notify() events. Both cause
run_poll_handlers() to call poll_set_started(false) to disable poll mode.
However, for both cases, as timeout is already zeroed, the event loop
(i.e., aio_poll()) just processes bottom halves and then starts the next
event loop iteration. So, disabling poll mode has no value but leads to
unnecessary notifications from guest.

To minimize unnecessary notifications from guest, defer disabling poll
mode to when the event loop is about to be blocked.

With this patch applied, FIO seq-read performance (bs=4k, iodepth=64,
cache=writethrough) in VMs increases from 330K/s to 413K/s IOPS.

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Message-id: 20220710120849.63086-1-chao.gao@intel.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2023-11-28 17:14:58 +02:00
Xiaocheng Dong
505ef4fc71 kvm/tdx: Set vTPM to enabled when vTPM is initialized successfully
The VMM will set vTPM enabled when the vTPM is initialized
successfully, otherwise, TDVMCALL_SERVICE.Query returns vTPM is
unsupported, OVMF will not try to initialize vTPM anymore.

Signed-off-by: Xiaocheng Dong <xiaocheng.dong@intel.com>
2023-11-28 17:14:58 +02:00
Mauro Matteo Cascella
1a7e5e4ac0 ui/vnc-clipboard: fix infinite loop in inflate_buffer (CVE-2023-3255)
A wrong exit condition may lead to an infinite loop when inflating a
valid zlib buffer containing some extra bytes in the `inflate_buffer`
function. The bug only occurs post-authentication. Return the buffer
immediately if the end of the compressed data has been reached
(Z_STREAM_END).

Fixes: CVE-2023-3255
Fixes: 0bf41cab ("ui/vnc: clipboard support")
Reported-by: Kevin Denis <kevin.denis@synacktiv.com>
Signed-off-by: Mauro Matteo Cascella <mcascell@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Tested-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-ID: <20230704084210.101822-1-mcascell@redhat.com>
2023-11-28 17:14:58 +02:00
Daniel P. Berrangé
79b83c46a3 io: remove io watch if TLS channel is closed during handshake
The TLS handshake make take some time to complete, during which time an
I/O watch might be registered with the main loop. If the owner of the
I/O channel invokes qio_channel_close() while the handshake is waiting
to continue the I/O watch must be removed. Failing to remove it will
later trigger the completion callback which the owner is not expecting
to receive. In the case of the VNC server, this results in a SEGV as
vnc_disconnect_start() tries to shutdown a client connection that is
already gone / NULL.

CVE-2023-3354
Reported-by: jiangyegen <jiangyegen@huawei.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2023-11-28 17:14:58 +02:00
zhenwei pi
a9e07364f4 virtio-crypto: verify src&dst buffer length for sym request
For symmetric algorithms, the length of ciphertext must be as same
as the plaintext.
The missing verification of the src_len and the dst_len in
virtio_crypto_sym_op_helper() may lead buffer overflow/divulged.

This patch is originally written by Yiming Tao for QEMU-SECURITY,
resend it(a few changes of error message) in qemu-devel.

Fixes: CVE-2023-3180
Fixes: 04b9b37edda("virtio-crypto: add data queue processing handler")
Cc: Gonglei <arei.gonglei@huawei.com>
Cc: Mauro Matteo Cascella <mcascell@redhat.com>
Cc: Yiming Tao <taoym@zju.edu.cn>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Message-Id: <20230803024314.29962-2-pizhenwei@bytedance.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
(cherry picked from commit 9d38a84347)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2023-11-28 17:14:58 +02:00
Alexander Bulekov
3bd558cca6 memory: prevent dma-reentracy issues
Add a flag to the DeviceState, when a device is engaged in PIO/MMIO/DMA.
This flag is set/checked prior to calling a device's MemoryRegion
handlers, and set when device code initiates DMA.  The purpose of this
flag is to prevent two types of DMA-based reentrancy issues:

1.) mmio -> dma -> mmio case
2.) bh -> dma write -> mmio case

These issues have led to problems such as stack-exhaustion and
use-after-frees.

Summary of the problem from Peter Maydell:
https://lore.kernel.org/qemu-devel/CAFEAcA_23vc7hE3iaM-JVA6W38LK4hJoWae5KcknhPRD5fPBZA@mail.gmail.com

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/62
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/540
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/541
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/556
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/557
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/827
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1282
Resolves: CVE-2023-0330

Signed-off-by: Alexander Bulekov <alxndr@bu.edu>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20230427211013.2994127-2-alxndr@bu.edu>
[thuth: Replace warn_report() with warn_report_once()]
Signed-off-by: Thomas Huth <thuth@redhat.com>
2023-11-28 17:14:58 +02:00
Christian Schoenebeck
dcb2c79d00 9pfs: prevent opening special files (CVE-2023-2861)
The 9p protocol does not specifically define how server shall behave when
client tries to open a special file, however from security POV it does
make sense for 9p server to prohibit opening any special file on host side
in general. A sane Linux 9p client for instance would never attempt to
open a special file on host side, it would always handle those exclusively
on its guest side. A malicious client however could potentially escape
from the exported 9p tree by creating and opening a device file on host
side.

With QEMU this could only be exploited in the following unsafe setups:

  - Running QEMU binary as root AND 9p 'local' fs driver AND 'passthrough'
    security model.

or

  - Using 9p 'proxy' fs driver (which is running its helper daemon as
    root).

These setups were already discouraged for safety reasons before,
however for obvious reasons we are now tightening behaviour on this.

Fixes: CVE-2023-2861
Reported-by: Yanwu Shen <ywsPlz@gmail.com>
Reported-by: Jietao Xiao <shawtao1125@gmail.com>
Reported-by: Jinku Li <jkli@xidian.edu.cn>
Reported-by: Wenbo Shen <shenwenbo@zju.edu.cn>
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Michael Tokarev <mjt@tls.msk.ru>
Message-Id: <E1q6w7r-0000Q0-NM@lizzy.crudebyte.com>
(cherry picked from commit f6b0de53fb)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
(Mjt: drop adding qemu_fstat wrapper for 7.2 where wrappers aren't used)
2023-11-28 17:14:58 +02:00
Binbin Wu
e44d43d8c4 memory: Convert only low memory to private at memory slot creation for TD
Convert all memory to private at memory slot creation has boot performace
impact for TD with big memory. Only convert low memory (0~2GB) to private
to avoid exit to userspace each 4K page because TDVF will accepts about 2GB
low memory during boot.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
2023-11-28 17:14:58 +02:00
Wei Wang
6fbf65fc33 memory: init TD guest ram with SHARED attr when migration is enabled
Fix BLR-932

Signed-off-by: Wei Wang <wei.w.wang@intel.com>
2023-11-28 17:14:58 +02:00
Lei Wang
1227330e7e migration: cleanup using tdx_mig.nr_streams to avoid segfault
It will segfault when cleanup migration streams before they are setup, to
avoid this, use tdx_mig.nr_streams rather than nr_channels.

Opportunistically remove the unnecessary parameter "nr_channels" in
following 2 callback functions in cgs_mig:

    cgs_mig.savevm_state_cleanup
    cgs_mig.loadvm_state_cleanup

This fixes LFE-9211 and LFE-9481.

Signed-off-by: Lei Wang <lei4.wang@intel.com>
2023-11-28 17:14:58 +02:00
Xiaocheng Dong
675b07c6f9 i386/tdx: Fix Hyper-Threading status in TDX guest to false when smp=1
Co-authored-by: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Xiaocheng Dong <xiaocheng.dong@intel.com>
2023-11-28 17:14:57 +02:00
Qian Wen
20a1248b75 i386: Replace SPR-TDX CPU model with upstream one
Since SPR CPU model patch has merged into upstream, remove the internal
version and use upstreamed one instead.

 - Internal CPU model:
   "target/i386: add SPR-TDX CPU model"
   66f1b579fa73aa3e00737e91776ca1f2567ca1a4

 - Upstreamed CPU model:
   "i386: Add new CPU model SapphireRapids"
   7eb061b06e

Signed-off-by: Qian Wen <qian.wen@intel.com>
2023-11-28 17:14:57 +02:00
Qian Wen
2b7c2e66dd i386/tdx: Update tdx_cpuid_lookup[]
Add several fixed0/fixed1 type CPUIDs to TDX CPUID lookup table according
to the CPUID Virtualization in TDX Module v1.5 ABI Specification.

Regarding the added CPUIDs, the difference between the public spec of
TDX 1.0 and 1.5 is shown as follows:
                                          TDX 1.0               TDX 1.5
 **CPUID.0x01:ECX.DTES64[2]               native                fixed1**
 **CPUID.0x01:ECX.DSCPL[4]                native                fixed1**
   CPUID.0x07:EDX.SGX-KEYS[1]             fixed0                fixed0
 **CPUID.0x07:EDX.L1D_FLUSH[28]           native                fixed1**
   CPUID.0x12.0x0:EAX[31:0]               fixed0                fixed0
   CPUID.0x12.0x0:EBX[31:0]               fixed0                fixed0
   CPUID.0x12.0x1:EAX[31:0]               fixed0                fixed0

Although the virtualization of three CPUIDs is not the same in TDX1.0
and 1.5, they won’t break the compatibility, since
update_tdx_cpuid_lookup_by_tdx_caps() uses TDSYSINFO_STRUCT.cpuid to
drop unsupported bits.

Signed-off-by: Qian Wen <qian.wen@intel.com>
2023-11-28 17:14:57 +02:00
Binbin Wu
6e53fc839c softmmu/physmem: Make ram_block_convert_range() 2MB compaitble
Make the alignment check against 4KB instead of page size of ramblock
to make it compatible to support 2MB page.

Also, cgs_bmap should always be calculated using 4KB page size instead
of page size of ramblock.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
2023-11-28 17:14:57 +02:00
Binbin Wu
1ea55257a0 MemoryBackendMemfdPrivate: Make the mount point path optional
Make the mount point path optional for MemoryBackendMemfdPrivate so
that users need not to specify the path to create a TD without 2MB
page support.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
2023-11-28 17:14:57 +02:00
Isaku Yamahata
a9e2064e49 confidential guest support, KVM/TDX: Disable pv clock for guest TD
KVM TDX guest doesn't allow KVM clock.  Although guest TD doesn't use a KVM
clock, qemu creates it by default with i386 KVM enabled.  When guest TD
crashes and KVM_RUN returns -EIO, the following message is shown.
  KVM_GET_CLOCK failed: Input/output error
The message confuses the user and misleads the debug process.  Don't create
KVM_CLOCK when confidential computing is enabled, and it has a property to
disable the pv clock.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
2023-11-28 17:14:57 +02:00
Isaku Yamahata
6e8132e3ab memfd_restricted: Allow any memory backend for HostMemoryBackend
For large page support, it's convenient to be able to use any backend as
shared memory for memory-backend-memfd-private.  Add shmemdev property to
specify memory backend.

Example:
-object memory-backend-memfd,id=ramhuge,size=6144M,hugetlb=on,hugetlbsize=2M
-object memory-backend-memfd-private,id=ram1,size=6144M,hugetlb=on,hugetlbsize=2M,shmemdev=ramhuge
-machine memory-backend=ram1

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
2023-11-28 17:14:57 +02:00
Isaku Yamahata
cf015bf13b hostmem: Make HostMemoryBackend::mr pointer
Make HostMemoryBackend::mr pointer and add pointee to HostMemoryBackend.
Initialize mr in host_memory_backend_memory_complete().
For restricted memfd memory backend to be able to use any host memory
backend.

No functional change, just add one pointer as indirection.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
2023-11-28 17:14:57 +02:00
Isaku Yamahata
3172e362ec TDX: Make memory type private explicitly at the memory slot creation
By default (due to the recent UPM change), restricted memory attribute is
shared.  Convert the memory region from shared to private at the memory
slot creation time.

Without this patch
- Secure-EPT violation on private area
- KVM_MEMORY_FAULT EXIT (kvm->qemu)
- qemu converts the 4K page from shared to private
- Resume VCPU execution
- Secure-EPT violation again
- KVM resolves EPT Violation
This also prevents huge page because page conversion is done at 4K
granularity.  Although it's possible to merge 4K private mapping into
2M large page, it slows guest boot.

With this patch
- After memory slot creation, convert the region from private to shraed
- Secure-EPT violation on private area.
- KVM resolves EPT Violation

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
2023-11-28 17:14:57 +02:00
Isaku Yamahata
c159594a2f memfd_restricted: Allow memfd_restricted to specify path for mount point
Property for tmpfs can be utilized.  It's needed for large page support.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
2023-11-28 17:14:57 +02:00
Isaku Yamahata
70b6bc1f21 memfd_restricted: Update for memfd_restricted
memfd has a patch to specify mountpoint for properties of tmpfs.
This patch catches up memfd_restricted for the change.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/592ebd9e33a906ba026d56dc68f42d691706f865.1680306489.git.ackerleytng@google.com
2023-11-28 17:14:57 +02:00
Isaku Yamahata
e2fd42edbc uitl: Add open_tree syscall stub
open_tree() Linux system call is needed to support memfd_restricted().  Add
stub for it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
2023-11-28 17:14:57 +02:00
Feng, Jialei
5da2d79b71 linux-user: sync syscall_64.tbl
Signed-off-by: Feng, Jialei <jialei.feng@intel.com>
2023-11-28 17:14:57 +02:00
Xiaocheng Dong
c68f6c95ac *** HACK *** TDX vmcall service use saved APIC ID
The current guest kernel registers IRQ to one vCPU, and call
VMCALL.<SETUPEVENTNOTIFYINTERRUPT> to save the interrupt vector and APIC
ID, VMCALL.<Service> and VMCALL.<GETQUOTE> reuse the same IRQ. But QEMU
notifies the vCPU who calls VMCALL.<Service>.

This patch is a workaround to notify the saved vCPU APIC ID. There is a
potential risk for current vTPM TD interrupt implementation for
multiple vCPUs.

Signed-off-by: Xiaocheng Dong <xiaocheng.dong@intel.com>
2023-11-28 17:14:57 +02:00
Yuan Yao
391c541664 kvm/tdx: Fix use after free when resize the socket recv buffer
The size is used after it was freed, which lead to
incorrect buffer size, cause later memory write overflow
and QEMU runtime crash.

Fixes: 9a5b26f590 ("kvm/tdx: Introduce SocketRecvBuffer")
Signed-off-by: Yuan Yao <yuan.yao@intel.com>
2023-11-28 17:14:57 +02:00
Yuan Yao
680b9f336b kvm/tdx: Fix write overflow when parse user id from vTPM's instance management QMP
The user id is treated as raw byte stream but not common
C-style string, so the null ending byte is not necessary,
use memcpy() instead of strcpy() which adds null ending
byte, caused write overflow.

Fixes: 9cf69dadac ("kvm/tdx: QMP protocol for create/destroy vTPM instance")
Signed-off-by: Yuan Yao <yuan.yao@intel.com>
2023-11-28 17:14:57 +02:00
Yuan Yao
f77734f669 kvm/tdx: Fix incorrect copy size of vTPM client's user_id
The copy size fixed to sizeof(vms->vtpm_userid) caused
buffer overread issue in case of user_id is plain text
string, fixed by only copy same size as the plain text
length.

Fixes: 13c835fa89 ("kvm/tdx: Introduce vTPM client/server vmcall service")
Signed-off-by: Yuan Yao <yuan.yao@intel.com>
2023-11-28 17:14:57 +02:00
Yuan Yao
b5cafa9dc4 kvm/tdx: Fix potential nested locks in vTPM Server
disconnection can happens with/without the lock is
hold, adjusts the lock behavior based on locking
state.

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
2023-11-28 17:14:57 +02:00
Yuan Yao
cffa142905 kvm/tdx: return DEVICE_ERROR for all request of vTPM client when disconnected
Make sure the service call returns to guest for disconnected event.

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
2023-11-28 17:14:57 +02:00
Yuan Yao
b9e71c8209 kvm/tdx: Refactor: Common helper tdx_vtpm_client_check_pending_request()
Abstracts tdx_vtpm_client_check_pending_request() as common
interface to check pending request, it can be used by return
error state for disconnected case.

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
2023-11-28 17:14:57 +02:00
Yuan Yao
096e51bfcd kvm/tdx: Fix size checking in SocketRecvBuffer
Handle trans_protocal types which has no payload
part, only head part correctly

Fixes: 619ad4e6f4 ("kvm/tdx: Introduce SocketRecvBuffer")

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
2023-11-28 17:14:57 +02:00
Yao Yuan
a43b17b361 kmv/tdx: Introduce CONNECTING/CONNECTED/DISCONNECTED vTPM client state
These states are used to manage the STREAM socket communication,
make sure the QIOChannelSocket is freed properly and return
DEVICE_ERROR to user for disconnected case.

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
2023-11-28 17:14:57 +02:00
Yao Yuan
a038e1f049 kvm/tdx: Intdoduce ZOMBIE Session state
ZOMBIE state is used for a session which is removed
from the client_session set, but haven't been destried
by vTPM server.

The ZOMBIE session will be freed by the peer disconnection
event. A common case for ZOMBIE session is: >= 2 TDs
connecting to same vTPM server with same user id, in this
case the later one connects to the server kick down the
existed early one, the session of early one is marked ZOMBIE.

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
2023-11-28 17:14:57 +02:00