Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
functions. These functions handles pre-copy and stop-and-copy phase.
In _SAVING|_RUNNING device state or pre-copy phase:
- read pending_bytes. If pending_bytes > 0, go through below steps.
- read data_offset - indicates kernel driver to write data to staging
buffer.
- read data_size - amount of data in bytes written by vendor driver in
migration region.
- read data_size bytes of data from data_offset in the migration region.
- Write data packet to file stream as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
VFIO_MIG_FLAG_END_OF_STATE }
In _SAVING device state or stop-and-copy phase
a. read config space of device and save to migration file stream. This
doesn't need to be from vendor driver. Any other special config state
from driver can be saved as data in following iteration.
b. read pending_bytes. If pending_bytes > 0, go through below steps.
c. read data_offset - indicates kernel driver to write data to staging
buffer.
d. read data_size - amount of data in bytes written by vendor driver in
migration region.
e. read data_size bytes of data from data_offset in the migration region.
f. Write data packet as below:
{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
g. iterate through steps b to f while (pending_bytes > 0)
h. Write {VFIO_MIG_FLAG_END_OF_STATE}
When data region is mapped, its user's responsibility to read data from
data_offset of data_size before moving to next steps.
Added fix suggested by Artem Polyakov to reset pending_bytes in
vfio_save_iterate().
Added fix suggested by Zhi Wang to add 0 as data size in migration stream and
add END_OF_STATE delimiter to indicate phase complete.
Suggested-by: Artem Polyakov <artemp@nvidia.com>
Suggested-by: Zhi Wang <zhi.wang.linux@gmail.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Define flags to be used as delimiter in migration stream for VFIO devices.
Added .save_setup and .save_cleanup functions. Map & unmap migration
region from these functions at source during saving or pre-copy phase.
Set VFIO device state depending on VM's state. During live migration, VM is
running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
device. During save-restore, VM is paused, _SAVING state is set for VFIO device.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Added migration state change notifier to get notification on migration state
change. These states are translated to VFIO device state and conveyed to
vendor driver.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VM state change handler is called on change in VM's state. Based on
VM state, VFIO device state should be changed.
Added read/write helper functions for migration region.
Added function to set device_state.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
[aw: lx -> HWADDR_PRIx, remove redundant parens]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Whether the VFIO device supports migration or not is decided based of
migration region query. If migration region query is successful and migration
region initialization is successful then migration is supported else
migration is blocked.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Added functions to save and restore PCI device specific data,
specifically config space of PCI device.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This function will be used for migration region.
Migration region is mmaped when migration starts and will be unmapped when
migration is complete.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
pc,pci,vhost,virtio: misc fixes
Just a bunch of bugfixes all over the place.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Fri 30 Oct 2020 12:44:31 GMT
# gpg: using RSA key 5D09FD0871C8F85B94CA8A0D281F0DB8D28D5469
# gpg: issuer "mst@redhat.com"
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full]
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>" [full]
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream:
intel_iommu: Fix two misuse of "0x%u" prints
virtio: skip guest index check on device load
vhost-blk: set features before setting inflight feature
pci: Disallow improper BAR registration for type 1
pci: Change error_report to assert(3)
pci: advertise a page aligned ATS
pc: Implement -no-hpet as sugar for -machine hpet=on
vhost: Don't special case vq->used_phys in vhost_get_log_size()
pci: Assert irqnum is between 0 and bus->nirqs in pci_bus_change_irq_level
hw/pci: Extract pci_bus_change_irq_level() from pci_change_irq_level()
hw/virtio/vhost-vdpa: Fix Coverity CID 1432864
acpi/crs: Support ranges > 32b for hosts
acpi/crs: Prevent bad ranges for host bridges
vhost-vsock: set vhostfd to non-blocking mode
vhost-vdpa: negotiate VIRTIO_NET_F_STATUS with driver
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
qemu-sparc queue
# gpg: Signature made Wed 28 Oct 2020 08:17:23 GMT
# gpg: using RSA key CC621AB98E82200D915CC9C45BC2C56FAE0F321F
# gpg: issuer "mark.cave-ayland@ilande.co.uk"
# gpg: Good signature from "Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>" [full]
# Primary key fingerprint: CC62 1AB9 8E82 200D 915C C9C4 5BC2 C56F AE0F 321F
* remotes/mcayland/tags/qemu-sparc-20201028:
hw/pci-host/sabre: Simplify code initializing variable once
hw/pci-host/sabre: Remove superfluous address range check
hw/pci-host/sabre: Update documentation link
sabre: increase number of PCI bus IRQs from 32 to 64
hw/display/tcx: Allow 64-bit accesses to framebuffer stippler and blitter
sabre: don't call sysbus_mmio_map() in sabre_realize()
sparc32-ledma: don't reference nd_table directly within the device
sparc32-espdma: use object_initialize_child() for esp child object
sparc32-ledma: use object_initialize_child() for lance child object
sparc32-dma: use object_initialize_child() for espdma and ledma child objects
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
QEMU must be careful when loading device state off migration streams to
prevent a malicious source from exploiting the emulator. Overdoing these
checks has the side effect of allowing a guest to "pin itself" in cloud
environments by messing with state which is entirely in its control.
Similarly to what f3081539 achieved in usb_device_post_load(), this
commit removes such a check from virtio_load(). Worth noting, the result
of a load without this check is the same as if a guest enables a VQ with
invalid indexes to begin with. That is, the virtual device is set in a
broken state (by the datapath handler) and must be reset.
Signed-off-by: Felipe Franciosi <felipe@nutanix.com>
Message-Id: <20201028134643.110698-1-felipe@nutanix.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Prevent future developers working on root complexes, root ports, or
bridges that also wish to implement a BAR for those, from shooting
themselves in the foot. PCI type 1 headers only support 2 base address
registers. It is incorrect and difficult to figure out what is wrong
with the device when this mistake is made. With this, it is immediate
and obvious what has gone wrong.
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Message-Id: <20201015181411.89104-2-ben.widawsky@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Asserts are used for developer bugs. As registering a bar of the wrong
size is not something that should be possible for a user to achieve,
this is a developer bug.
While here, use the more obvious helper function.
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Message-Id: <20201015181411.89104-1-ben.widawsky@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
After Linux kernel commit 61363c1474b1 ("iommu/vt-d: Enable ATS only
if the device uses page aligned address."), ATS will be only enabled
if device advertises a page aligned request.
Unfortunately, vhost-net is the only user and we don't advertise the
aligned request capability in the past since both vhost IOTLB and
address_space_get_iotlb_entry() can support non page aligned request.
Though it's not clear that if the above kernel commit makes
sense. Let's advertise a page aligned ATS here to make vhost device
IOTLB work with Intel IOMMU again.
Note that in the future we may extend pcie_ats_init() to accept
parameters like queue depth and page alignment.
Cc: qemu-stable@nongnu.org
Signed-off-by: Jason Wang <jasowang@redhat.com>
Message-Id: <20200909081731.24688-1-jasowang@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The first loop in vhost_get_log_size() computes the size of the dirty log
bitmap so that it allows to track changes in the entire guest memory, in
terms of GPA.
When not using a vIOMMU, the address of the vring's used structure,
vq->used_phys, is a GPA. It is thus already covered by the first loop.
When using a vIOMMU, vq->used_phys is a GIOVA that will be translated
to an HVA when the vhost backend needs to update the used structure. It
will log the corresponding GPAs into the bitmap but it certainly won't
log the GIOVA.
So in any case, vq->used_phys shouldn't be explicitly used to size the
bitmap. Drop the second loop.
This fixes a crash of the source when migrating a guest using in-kernel
vhost-net and iommu_platform=on on POWER, because DMA regions are put
over 0x800000000000000ULL. The resulting insanely huge log size causes
g_malloc0() to abort.
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1879349
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160208823418.29027.15172801181796272300.stgit@bahia.lan>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Fix uninitialized value issues reported by Coverity:
Field 'msg.reserved' is uninitialized when calling write().
Fixes: a5bd05800f ("vhost-vdpa: batch updating IOTLB mappings")
Reported-by: Coverity (CID 1432864: UNINIT)
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20201028154004.776760-1-philmd@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
According to PCIe spec 5.0 Type 1 header space Base Address Registers
are defined by 7.5.1.2.1 Base Address Registers (same as Type 0). The
_CRS region should allow for the same range (up to 64b). Prior to this
change, any host bridge utilizing more than 32b for the BAR would have
the address truncated and likely lead to conflicts when the operating
systems reads the _CRS object.
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Message-Id: <20201026193924.985014-2-ben.widawsky@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Prevent _CRS resources being quietly chopped off and instead throw an
assertion. _CRS is used by host bridges to declare regions of io and/or
memory that they consume. On some (all?) platforms the host bridge
doesn't have PCI header space and so they need some way to convey the
information.
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Message-Id: <20201026193924.985014-1-ben.widawsky@intel.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Igor Mammedov <imammedo@redhat.com>
vhost IOTLB API uses read()/write() to exchange iotlb messages with
the kernel module.
The QEMU implementation expects a non-blocking fd, indeed commit
c471ad0e9b ("vhost_net: device IOTLB support") set it for vhost-net.
Without this patch, if we enable iommu for the vhost-vsock device,
QEMU can hang when exchanging IOTLB messages.
As commit 894022e616 ("net: check if the file descriptor is valid
before using it") did for tap, let's use qemu_try_set_nonblock()
when fd is provided by the user.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20201029144849.70958-1-sgarzare@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
ppc patch queue 2020-10-28
Here's the next pull request for ppc and spapr related patches, which
should be the last things for soft freeze. Includes:
* Numerous error handling cleanups from Greg Kurz
* Cleanups to cpu realization and hotplug handling from Greg Kurz
* A handful of other small fixes and cleanups
This does include a change to pc_dimm_plug() that isn't in my normal
areas of concern. That's there as a a prerequisite for ppc specific
changes, and has an ack from Igor.
# gpg: Signature made Tue 27 Oct 2020 14:13:21 GMT
# gpg: using RSA key 75F46586AE61A66CC44E87DC6C38CACA20D9B392
# gpg: Good signature from "David Gibson <david@gibson.dropbear.id.au>" [full]
# gpg: aka "David Gibson (Red Hat) <dgibson@redhat.com>" [full]
# gpg: aka "David Gibson (ozlabs.org) <dgibson@ozlabs.org>" [full]
# gpg: aka "David Gibson (kernel.org) <dwg@kernel.org>" [unknown]
# Primary key fingerprint: 75F4 6586 AE61 A66C C44E 87DC 6C38 CACA 20D9 B392
* remotes/dgibson/tags/ppc-for-5.2-20201028:
ppc/: fix some comment spelling errors
spapr: Improve spapr_reallocate_hpt() error reporting
target/ppc: Fix kvmppc_load_htab_chunk() error reporting
spapr: Use error_append_hint() in spapr_reallocate_hpt()
spapr: Simplify error handling in spapr_memory_plug()
spapr: Pass &error_abort when getting some PC DIMM properties
spapr: Use appropriate getter for PC_DIMM_SLOT_PROP
spapr: Use appropriate getter for PC_DIMM_ADDR_PROP
pc-dimm: Drop @errp argument of pc_dimm_plug()
spapr: Simplify spapr_cpu_core_realize() and spapr_cpu_core_unrealize()
spapr: Make spapr_cpu_core_unrealize() idempotent
spapr: Drop spapr_delete_vcpu() unused argument
spapr: Unrealize vCPUs with qdev_unrealize()
spapr: Fix leak of CPU machine specific data
spapr: Move spapr_create_nvdimm_dr_connectors() to core machine code
hw/net: move allocation to the heap due to very large stack frame
ppc/spapr: re-assert IRQs during event-scan if there are pending
spapr: Clarify why DR connectors aren't user creatable
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The S24/TCX datasheet is listed as "Unable to locate" on [1].
However the NetBSD revision 1.32 of the driver introduced
64-bit accesses to the stippler and blitter [2]. It is safe
to assume these memory regions are 64-bit accessible.
QEMU implementation is 32-bit, so fill the 'impl' fields.
Michael Lorenz (author of the NetBSD code [2]) provided us with more
information in [3]:
> IIRC the real hardware *requires* 64bit accesses for stipple and
> blitter operations to work. For stipples you write a 64bit word into
> STIP space, the address defines where in the framebuffer you want to
> draw, the data contain a 32bit bitmask, foreground colour and a ROP.
> BLIT space works similarly, the 64bit word contains an offset were to
> read pixels from, and how many you want to copy.
>
> One more thing since there seems to be some confusion - 64bit accesses
> on the framebuffer are fine as well. TCX/S24 is *not* an SBus device,
> even though its node says it is.
> S24 is a card that plugs into a special slot on the SS5 mainboard,
> which is shared with an SBus slot and looks a lot like a horizontal
> UPA slot. Both S24 and TCX are accessed through the Micro/TurboSPARC's
> AFX bus which is 64bit wide and intended for graphics.
> Early FFB docs even mentioned connecting to both AFX and UPA,
> no idea if that was ever realized in hardware though.
[1] http://web.archive.org/web/20111209011516/http://wikis.sun.com/display/FOSSdocs/Home
[2] http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/dev/sbus/tcx.c.diff?r1=1.31&r2=1.32
[3] https://www.mail-archive.com/qemu-devel@nongnu.org/msg734928.html
Cc: qemu-stable@nongnu.org
Reported-by: Andreas Gustafsson <gson@gson.org>
Buglink: https://bugs.launchpad.net/bugs/1892540
Fixes: 55d7bfe229 ("tcx: Implement hardware acceleration")
Tested-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Andreas Gustafsson <gson@gson.org>
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20201024205100.3623006-1-f4bug@amsat.org>
Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
The SRST protocol states that after diagnostics are complete and the
status is posted, we should clear the SRST bit if it should so happen to
be set.
The reset method itself should handle this, but just in case -- make our
intention explicit here.
Signed-off-by: John Snow <jsnow@redhat.com>
Tested-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-id: 20201020200242.1497705-4-jsnow@redhat.com
Signed-off-by: John Snow <jsnow@redhat.com>
spapr_reallocate_hpt() has three users, two of which pass &error_fatal
and the third one, htab_load(), passes &local_err, uses it to detect
failures and simply propagates -EINVAL up to vmstate_load(), which will
cause QEMU to exit. It is thus confusing that spapr_reallocate_hpt()
doesn't return right away when an error is detected in some cases. Also,
the comment suggesting that the caller is welcome to try to carry on
seems like a remnant in this respect.
This can be improved:
- change spapr_reallocate_hpt() to always report a negative errno on
failure, either as reported by KVM or -ENOSPC if the HPT is smaller
than what was asked,
- use that to detect failures in htab_load() which is preferred over
checking &local_err,
- propagate this negative errno to vmstate_load() because it is more
accurate than propagating -EINVAL for all possible errors.
[dwg: Fix compile error due to omitted prelim patch]
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160371605460.305923.5890143959901241157.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
If kvmppc_load_htab_chunk() fails, its return value is propagated up
to vmstate_load(). It should thus be a negative errno, not -1 (which
maps to EPERM and would lure the user into thinking that the problem
is necessarily related to a lack of privilege).
Return the error reported by KVM or ENOSPC in case of short write.
While here, propagate the error message through an @errp argument
and have the caller to print it with error_report_err() instead
of relying on fprintf().
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160371604713.305923.5264900354159029580.stgit@bahia.lan>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>