Migration Pull request (20230509 vintage) take 2
Hi
In this take 2:
- Change uint -> uint32_t to fix mingw32 compilation.
Please apply.
[take 1]
In this PULL request:
- 1st part of colo support for multifd (lukas)
- 1st part of disabling colo option (vladimir)
Please, apply.
# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEEGJn/jt6/WMzuA0uC9IfvGFhy1yMFAmRb3dgACgkQ9IfvGFhy
# 1yNLBxAAwHiAOdSPS7TqJXH2/PkBKsd42XMtWzC9UowZ6SUdQi0Q2bQUBnygJ8BA
# 59yLOTPdwUhaPWk4KsyKM2znOCJ+f9MF5V4QXbyILf1WCAq6d+mtPwArnYF1TRwi
# XIewVDeRopdOO5lnWGcfAKZZ5WIDzA/bn6NiGLi+pQa5HGyk84Bk+tFa8kJI6xBL
# 5CWfhNTcxDNYRFg/z/9YVirkuxIXEEL6VEeRFV+pmFuj05q9bysWJkLFoEcFNawO
# gp1foHDkU7wHmHDJ3D4AVTm3TW641ft1wdlHIHZRoOiIIu3EUOoDEVVsaCfdxrY8
# pPJZ5m37wb52GIaCJmigG8rkHxIJ8xKLk4HKu4umDqFq5jZQ2krnnj7AkQhpp7p2
# aEIOXJQQq7XCsKpuvSUIexPv4gbN5SEYKi7XKoOPe3sZ03Rkn0I5xY3KSyMQMamP
# jtk8tNlRA+9Wug82eb/FtIKDj3//4SbuQOJEdRXjKJBldd3mtWTT/FRj/8oo96/p
# hmTu/cGDrP5qgtWpz0kKI/xaBf8at1nwpDgdEzOjRw4zf6xQHFjbXgJ7tQBH/JUI
# T3A9pdiXN6QdRupcWUSV0iJsfS/5i3mOUTA/C529qGXabSnZzfMK+unL/I8N02yt
# 83o7jSg22etMjaS1c+VuDmzKCAfuZloDZv2Bms/+yM/8k8Xe5S4=
# =vbqf
# -----END PGP SIGNATURE-----
# gpg: Signature made Wed 10 May 2023 07:09:28 PM BST
# gpg: using RSA key 1899FF8EDEBF58CCEE034B82F487EF185872D723
# gpg: Good signature from "Juan Quintela <quintela@redhat.com>" [undefined]
# gpg: aka "Juan Quintela <quintela@trasno.org>" [undefined]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 1899 FF8E DEBF 58CC EE03 4B82 F487 EF18 5872 D723
* tag 'migration-20230509-pull-request' of https://gitlab.com/juan.quintela/qemu:
migration: block incoming colo when capability is disabled
migration: disallow change capabilities in COLO state
migration: process_incoming_migration_co: simplify code flow around ret
migration: drop colo_incoming_thread from MigrationIncomingState
build: move COLO under CONFIG_REPLICATION
colo: make colo_checkpoint_notify static and provide simpler API
block/meson.build: prefer positive condition for replication
multifd: Add the ramblock to MultiFDRecvParams
ram: Let colo_flush_ram_cache take the bitmap_mutex
ram: Add public helper to set colo bitmap
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
COLO is not listed as running state in migrate_is_running(), so, it's
theoretically possible to disable colo capability in COLO state and the
unexpected error in migration_iteration_finish() is reachable.
Let's disallow that in qmp_migrate_set_capabilities. Than the error
becomes absolutely unreachable: we can get into COLO state only with
enabled capability and can't disable it while we are in COLO state. So
substitute the error by simple assertion.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <20230428194928.1426370-10-vsementsov@yandex-team.ru>
Signed-off-by: Juan Quintela <quintela@redhat.com>
We don't allow to use x-colo capability when replication is not
configured. So, no reason to build COLO when replication is disabled,
it's unusable in this case.
Note also that the check in migrate_caps_check() is not the only
restriction: some functions in migration/colo.c will just abort if
called with not defined CONFIG_REPLICATION, for example:
migration_iteration_finish()
case MIGRATION_STATUS_COLO:
migrate_start_colo_process()
colo_process_checkpoint()
abort()
It could probably make sense to have possibility to enable COLO without
REPLICATION, but this requires deeper audit of colo & replication code,
which may be done later if needed.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Acked-by: Dr. David Alan Gilbert <dave@treblig.org>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20230428194928.1426370-4-vsementsov@yandex-team.ru>
Signed-off-by: Juan Quintela <quintela@redhat.com>
colo_checkpoint_notify() is mostly used in colo.c. Outside we use it
once when x-checkpoint-delay migration parameter is set. So, let's
simplify the external API to only that function - notify COLO that
parameter was set. This make external API more robust and hides
implementation details from external callers. Also this helps us to
make COLO module optional in further patch (i.e. we are going to add
possibility not build the COLO module).
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Zhang Chen <chen.zhang@intel.com>
Message-Id: <20230428194928.1426370-3-vsementsov@yandex-team.ru>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Testing updates:
- fix up xtensa docker container base to current Debian
- document breakpoint and watchpoint support
- clean up the ansible scripts for Ubuntu 22.04
- add a minimal device profile
- drop https on mipsdistros URL
- fix Kconfig bug for XLNX_VERSAL
# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCgAdFiEEZoWumedRZ7yvyN81+9DbCVqeKkQFAmRbspsACgkQ+9DbCVqe
# KkSBowf+JjcVxZMb2kS8pV8WEdAq+fceBYI7mDBSEu0DFqZF+w0XSM+T+VZHyZ8+
# QmPeE+McKBUXvq/V4osPnDVVZfBKmwzFN548M6qIMLUbHjbDp94DtudNkAZ0ejhc
# +Ack73vzTiTWsGmBaqQxZlcYkZNZiZAhQsTF6cPwna74cDkcRghvd/Zxzy831rVB
# gVWhbEkk7SBQhJ+PqRIeso60DbWvCaVDMrkPc2WX8kup6QltbUpoayS/eNOtBkfA
# C557eOBxoM8s0cu33O780K5mCPCyk1IaIynvZtmkty0DXUSd5y9SNpsofhAY7BGy
# 4QdlolLygDgEC3s4bMULGy04nzaylw==
# =a+97
# -----END PGP SIGNATURE-----
# gpg: Signature made Wed 10 May 2023 04:04:59 PM BST
# gpg: using RSA key 6685AE99E75167BCAFC8DF35FBD0DB095A9E2A44
# gpg: Good signature from "Alex Bennée (Master Work Key) <alex.bennee@linaro.org>" [undefined]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 6685 AE99 E751 67BC AFC8 DF35 FBD0 DB09 5A9E 2A44
* tag 'pull-testing-updates-100523-1' of https://gitlab.com/stsquad/qemu:
hw/arm: Select XLNX_USB_SUBSYS for xlnx-zcu102 machine
tests/avocado: use http for mipsdistros.mips.com
gitlab: enable minimal device profile for aarch64 --disable-tcg
gitlab: add ubuntu-22.04-aarch64-without-defaults
scripts/ci: clean-up the 20.04/22.04 confusion in ansible
scripts/ci: add gitlab-runner to kvm group
docs: document breakpoint and watchpoint support
tests/docker: bump the xtensa base to debian:11-slim
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
This machine hardcodes initialization of the USB device, so select the
corresponding Kconfig. It is not enough to have it as "default y if
XLNX_VERSAL" at usb/Kconfig because building --without-default-devices
disables the default selection resulting in:
$ ./qemu-system-aarch64 -M xlnx-zcu102
qemu-system-aarch64: missing object type 'usb_dwc3'
Aborted (core dumped)
Signed-off-by: Fabiano Rosas <farosas@suse.de>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20230208192654.8854-8-farosas@suse.de>
Message-Id: <20230503091244.1450613-8-alex.bennee@linaro.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
As the cached assets have fallen out of our cache new attempts to
fetch these binaries fail hard due to certificate expiry. It's hard
to find a contact email for the domain as the root page of mipsdistros
throws up some random XML. I suspect Amazon are merely the hosts.
The checksums should protect us from any man-in-the-middle type
attacks.
Message-Id: <20230503091244.1450613-22-alex.bennee@linaro.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Cc: Philippe Mathieu-Daudé <philmd@linaro.org>
We have a bunch of references to 20.04 (which s390x is still on)
although we are basically building on 22.04 now. Clean up the textual
references and use lcitool to generate the full package list to be
consistent.
We can drop "Install packages to build QEMU on Ubuntu on non-s390x" as
when we upgrade the s390x builder to 22.04 it won't need this
workaround.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20230503091244.1450613-19-alex.bennee@linaro.org>
Block layer patches
- Graph locking, part 3 (more block drivers)
- Compile out assert_bdrv_graph_readable() by default
- Add configure options for vmdk, vhdx and vpc
- Fix use after free in blockdev_mark_auto_del()
- migration: Attempt disk reactivation in more failure scenarios
- Coroutine correctness fixes
# -----BEGIN PGP SIGNATURE-----
#
# iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmRbi6ERHGt3b2xmQHJl
# ZGhhdC5jb20ACgkQfwmycsiPL9Y66A//ZRk/0M6EZUJPAKG6m/XLTDNrOCNBZ1Tu
# kBGvxXsVQZMt4gGpBad4l2INN6IQKTIdIf+lK71EpxMPmFG6xK32btn38yywCAfQ
# lr1p5nR0Y/zSlT+XzP4yKy/CtQl6U0rkysmjCIk35bZc7uLy6eo4oFR4vmhRRt2M
# UGltB50/Nicx12YFufVjodbhv+apxTGwS2XHatmwqtjKeYReSz8mJHslEy6DvC8m
# ziNThD6YBy7hMktAhNaqUqtZD0OSWz66VMObco/4i2++sOAMZIspXQkjv3AjH74e
# lmgMhNc/xgJKPwFBPsj6F7dOKxwhdKD9jzZlx3yaBtAU18hpWX54QWuA3/CFlySc
# 5QbbqIstFTC8lqoRWThQrcHHRKbDBJCP4ImRXUIKhuPaxEzXA9zb3+f3QPTIjLSA
# KO7nxuSmO+tC7hQ1K9kAjRZHWlxxAk4clk+7UrK4UrWgGxfCUKgFg4Tyx7RrpwA6
# j4L5vwAY60LW74tikWe9xJx2QbdRoWBTTZhUyirbO7rLX1e8mS1nUWmtIsFSQxAq
# Z7nX7ygN0WEF+8qIsk3jTGaEeJoCM7+7B+X2RpSy0sftFjFYmybIiUgLMO7e+ozK
# rvUPnwlHAbGCVIJOKrUDj3cGt6k3/xnrTajUc7pCB3KKqG4pe+IlZuHyKIUMActb
# dBLaBnj0M2o=
# =hw9E
# -----END PGP SIGNATURE-----
# gpg: Signature made Wed 10 May 2023 01:18:41 PM BST
# gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg: issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
* tag 'for-upstream' of https://repo.or.cz/qemu/kevin: (28 commits)
block: compile out assert_bdrv_graph_readable() by default
block: Mark bdrv_refresh_limits() and callers GRAPH_RDLOCK
block: Mark bdrv_recurse_can_replace() and callers GRAPH_RDLOCK
block: Mark bdrv_query_block_graph_info() and callers GRAPH_RDLOCK
block: Mark bdrv_query_bds_stats() and callers GRAPH_RDLOCK
block: Mark BlockDriver callbacks for amend job GRAPH_RDLOCK
block: Mark bdrv_co_debug_event() GRAPH_RDLOCK
block: Mark bdrv_co_get_info() and callers GRAPH_RDLOCK
block: Mark bdrv_co_get_allocated_file_size() and callers GRAPH_RDLOCK
mirror: Require GRAPH_RDLOCK for accessing a node's parent list
vhdx: Require GRAPH_RDLOCK for accessing a node's parent list
nbd: Mark nbd_co_do_establish_connection() and callers GRAPH_RDLOCK
nbd: Remove nbd_co_flush() wrapper function
block: .bdrv_open is non-coroutine and unlocked
graph-lock: Fix GRAPH_RDLOCK_GUARD*() to be reader lock
graph-lock: Add GRAPH_UNLOCKED(_PTR)
test-bdrv-drain: Don't modify the graph in coroutines
iotests: Test resizing image attached to an iothread
block: Don't call no_coroutine_fns in qmp_block_resize()
block: bdrv/blk_co_unref() for calls in coroutine context
...
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
reader_count() is a performance bottleneck because the global
aio_context_list_lock mutex causes thread contention. Put this debugging
assertion behind a new ./configure --enable-debug-graph-lock option and
disable it by default.
The --enable-debug-graph-lock option is also enabled by the more general
--enable-debug option.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230501173443.153062-1-stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This adds GRAPH_RDLOCK annotations to declare that callers of
bdrv_co_debug_event() need to hold a reader lock for the graph.
Unfortunately we cannot use a co_wrapper_bdrv_rdlock (i.e. make the
coroutine wrapper a no_coroutine_fn), because the function is called
(using the BLKDBG_EVENT macro) by mixed functions that run both in
coroutine and non-coroutine context (for example many of the functions
in qcow2-cluster.c and qcow2-refcount.c).
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230504115750.54437-16-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Drivers were a bit confused about whether .bdrv_open can run in a
coroutine and whether or not it holds a graph lock.
It cannot keep a graph lock from the caller across the whole function
because it both changes the graph (requires a writer lock) and does I/O
(requires a reader lock). Therefore, it should take these locks
internally as needed.
The functions used to be called in coroutine context during image
creation. This was buggy for other reasons, and as of commit 32192301,
all block drivers go through no_co_wrappers. So it is not called in
coroutine context any more.
Fix qcow2 and qed to work with the correct assumptions: The graph lock
needs to be taken internally instead of just assuming it's already
there, and the coroutine path is dead code that can be removed.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230504115750.54437-9-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For some functions, it is part of their interface to be called without
holding the graph lock. Add a new macro to document this.
The macro expands to TSA_EXCLUDES(), which is a relatively weak check
because it passes in cases where the compiler just doesn't know if the
lock is held. Function pointers can't be checked at all. Therefore, its
primary purpose is documentation.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230504115750.54437-7-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
test-bdrv-drain contains a few test cases that are run both in coroutine
and non-coroutine context. Running the entire code including the setup
and shutdown in coroutines is incorrect because graph modifications can
generally not happen in coroutines.
Change the test so that creating and destroying the test nodes and
BlockBackends always happens outside of coroutine context.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20230504115750.54437-6-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This tests that trying to resize an image with QMP block_resize doesn't
hang or otherwise fail when the image is attached to a device running in
an iothread.
This is a regression test for the recent fix that changed
qmp_block_resize, which is a coroutine based QMP handler, to avoid
calling no_coroutine_fns directly.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230509134133.373408-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Migration code can call bdrv_activate() in coroutine context, whereas
other callers call it outside of coroutines. As it calls other code that
is not supposed to run in coroutines, standardise on running outside of
coroutines.
This adds a no_co_wrapper to switch to the main loop before calling
bdrv_activate().
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230504115750.54437-3-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is a bdrv_co_getlength() now, which should be used in coroutine
context.
This requires adding GRAPH_RDLOCK to some functions so that this still
compiles with TSA because bdrv_co_getlength() is GRAPH_RDLOCK.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230504115750.54437-2-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit fe904ea824 added a fail_inactivate label, which tries to
reactivate disks on the source after a failure while s->state ==
MIGRATION_STATUS_ACTIVE, but didn't actually use the label if
qemu_savevm_state_complete_precopy() failed. This failure to
reactivate is also present in commit 6039dd5b1c (also covering the new
s->state == MIGRATION_STATUS_DEVICE state) and 403d18ae (ensuring
s->block_inactive is set more reliably).
Consolidate the two labels back into one - no matter HOW migration is
failed, if there is any chance we can reach vm_start() after having
attempted inactivation, it is essential that we have tried to restart
disks before then. This also makes the cleanup more like
migrate_fd_cancel().
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20230502205212.134680-1-eblake@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Socket paths need to be short to avoid failures. This is why there is a
iotests.sock_dir (defaulting to /tmp) separate from the disk image base
directory.
Make use of it to fix failures in too deeply nested test directories.
Fixes: ab7f7e67a7
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230503165019.8867-1-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
job_cancel_locked() drops the job list lock temporarily and it may call
aio_poll(). We must assume that the list has changed after this call.
Also, with unlucky timing, it can end up freeing the job during
job_completed_txn_abort_locked(), making the job pointer invalid, too.
For both reasons, we can't just continue at block_job_next_locked(job).
Instead, start at the head of the list again after job_cancel_locked()
and skip those jobs that we already cancelled (or that are completing
anyway).
Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20230503140142.474404-1-kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need for the AioContext lock in aio_wait_bh_oneshot().
It's easy to remove the lock from existing callers and then switch from
AIO_WAIT_WHILE() to AIO_WAIT_WHILE_UNLOCKED() in aio_wait_bh_oneshot().
Document that the AioContext lock should not be held across
aio_wait_bh_oneshot(). Holding a lock across aio_poll() can cause
deadlock so we don't want callers to do that.
This is a step towards getting rid of the AioContext lock.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20230404153307.458883-1-stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>