Commit Graph

756 Commits

Author SHA1 Message Date
Peter Xu
f444eeda71 migration: Move migrate_allow_multifd and helpers into migration.c
This variable, along with its helpers, is used to detect whether multiple
channel will be supported for migration.  In follow up patches, there'll be
other capability that requires multi-channels.  Hence move it outside multifd
specific code and make it public.  Meanwhile rename it from "multifd" to
"multi_channels" to show its real meaning.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220331150857.74406-5-peterx@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-04-21 19:36:46 +01:00
Peter Xu
83174765da migration: Postpone releasing MigrationState.hostname
We used to release it right after migrate_fd_connect().  That's not good
enough when there're more than one socket pair required, because it'll be
needed to establish TLS connection for the rest channels.

One example is multifd, where we copied over the hostname for each channel
but that's actually not needed.

Keeping the hostname until the cleanup phase of migration.

Cc: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220331150857.74406-2-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  dgilbert: Fixup checkpatch error; don't need to check for NULL
  around g_free
2022-04-21 19:36:46 +01:00
Marc-André Lureau
8e3b0cbb72 Replace qemu_real_host_page variables with inlined functions
Replace the global variables with inlined helper functions. getpagesize() is very
likely annotated with a "const" function attribute (at least with glibc), and thus
optimization should apply even better.

This avoids the need for a constructor initialization too.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20220323155743.1585078-12-marcandre.lureau@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-06 10:50:38 +02:00
Emanuele Giuseppe Esposito
3b71719462 block: rename bdrv_invalidate_cache_all, blk_invalidate_cache and test_sync_op_invalidate_cache
Following the bdrv_activate renaming, change also the name
of the respective callers.

bdrv_invalidate_cache_all -> bdrv_activate_all
blk_invalidate_cache -> blk_activate
test_sync_op_invalidate_cache -> test_sync_op_activate

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220209105452.1694545-5-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:14:40 +01:00
Peter Xu
e031149c78 migration: Add migration_incoming_transport_cleanup()
Add a helper to cleanup the transport listener.

When do it, we should also null-ify the cleanup hook and the data, then it's
even safe to call it multiple times.

Move the socket_address_list cleanup altogether, because that's a mirror of the
listener channels and only for the purpose of query-migrate.  Hence when
someone wants to cleanup the listener transport, it should also want to cleanup
the socket list too, always.

No functional change intended.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220301083925.33483-15-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-03-02 18:20:45 +00:00
Peter Xu
d5c8f2afe8 migration: Enlarge postcopy recovery to capture !-EIO too
We used to have quite a few places making sure -EIO happened and that's the
only way to trigger postcopy recovery.  That's based on the assumption that
we'll only return -EIO for channel issues.

It'll work in 99.99% cases but logically that won't cover some corner cases.
One example is e.g. ram_block_from_stream() could fail with an interrupted
network, then -EINVAL will be returned instead of -EIO.

I remembered Dave Gilbert pointed that out before, but somehow this is
overlooked.  Neither did I encounter anything outside the -EIO error.

However we'd better touch that up before it triggers a rare VM data loss during
live migrating.

To cover as much those cases as possible, remove the -EIO restriction on
triggering the postcopy recovery, because even if it's not a channel failure,
we can't do anything better than halting QEMU anyway - the corpse of the
process may even be used by a good hand to dig out useful memory regions, or
the admin could simply kill the process later on.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20220301083925.33483-11-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2022-03-02 18:20:45 +00:00
David Edmondson
ae68066880 migration: Tally pre-copy, downtime and post-copy bytes independently
Provide information on the number of bytes copied in the pre-copy,
downtime and post-copy phases of migration.

Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2022-01-28 15:38:23 +01:00
Peter Xu
739fcc1b0e migration: Drop return code for disgard ram process
It will just never fail.  Drop those return values where they're constantly
zeros.

A tiny touch-up on the tracepoint so trace_ram_postcopy_send_discard_bitmap()
is called after the logic itself (which sounds more reasonable).

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2022-01-28 15:38:23 +01:00
Zhang Chen
01ee5e3556 migration/migration.c: Remove the MIGRATION_STATUS_ACTIVE when migration finished
The MIGRATION_STATUS_ACTIVE indicates that migration is running.
Remove it to be handled by the default operation,
It should be part of the unknown ending states.

Signed-off-by: Zhang Chen <chen.zhang@intel.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2022-01-28 15:38:23 +01:00
Zhang Chen
eeeb48ee33 migration/migration.c: Avoid COLO boot in postcopy migration
COLO dose not support postcopy migration and remove the Fixme.

Signed-off-by: Zhang Chen <chen.zhang@intel.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2022-01-28 15:38:23 +01:00
Zhang Chen
444252b96a migration/migration.c: Add missed default error handler for migration state
In the migration_completion() no other status is expected, for
example MIGRATION_STATUS_CANCELLING, MIGRATION_STATUS_CANCELLED, etc.

Signed-off-by: Zhang Chen <chen.zhang@intel.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2022-01-28 15:38:23 +01:00
Laurent Vivier
1b529d908d failover: Silence warning messages during qtest
virtio-net-failover test tries several device combinations that produces
some expected warnings.
These warning can be confusing, so we disable them during the qtest
sequence.

Reported-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-Id: <20211220145314.390697-1-lvivier@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
[thuth: Fix memory leak by using error_free()]
Signed-off-by: Thomas Huth <thuth@redhat.com>
2021-12-22 08:12:45 +01:00
Juan Quintela
144fa06b34 migration: Never call twice qemu_target_page_size()
Signed-off-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-12-15 10:31:42 +01:00
Zhang Chen
751fe4c608 migration/colo: Optimize COLO primary node start code path
Optimize COLO primary start path from:
MIGRATION_STATUS_XXX --> MIGRATION_STATUS_ACTIVE --> MIGRATION_STATUS_COLO --> MIGRATION_STATUS_COMPLETED
To:
MIGRATION_STATUS_XXX --> MIGRATION_STATUS_COLO --> MIGRATION_STATUS_COMPLETED
No need to start primary COLO through "MIGRATION_STATUS_ACTIVE".

Signed-off-by: Zhang Chen <chen.zhang@intel.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-12-15 10:31:42 +01:00
Rao, Lei
795969ab1f Fixed a QEMU hang when guest poweroff in COLO mode
When the PVM guest poweroff, the COLO thread may wait a semaphore
in colo_process_checkpoint().So, we should wake up the COLO thread
before migration shutdown.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Reviewed-by: Zhang Chen <chen.zhang@intel.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-12-15 10:31:42 +01:00
Rao, Lei
684bfd1820 Fixed SVM hang when do failover before PVM crash
This patch fixed as follows:
    Thread 1 (Thread 0x7f34ee738d80 (LWP 11212)):
    #0 __pthread_clockjoin_ex (threadid=139847152957184, thread_return=0x7f30b1febf30, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
    #1 0x0000563401998e36 in qemu_thread_join (thread=0x563402d66610) at util/qemu-thread-posix.c:587
    #2 0x00005634017a79fa in process_incoming_migration_co (opaque=0x0) at migration/migration.c:502
    #3 0x00005634019b59c9 in coroutine_trampoline (i0=63395504, i1=22068) at util/coroutine-ucontext.c:115
    #4 0x00007f34ef860660 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 from /lib/x86_64-linux-gnu/libc.so.6
    #5 0x00007f30b21ee730 in ?? ()
    #6 0x0000000000000000 in ?? ()

    Thread 13 (Thread 0x7f30b3dff700 (LWP 11747)):
    #0  __lll_lock_wait (futex=futex@entry=0x56340218ffa0 <qemu_global_mutex>, private=0) at lowlevellock.c:52
    #1  0x00007f34efa000a3 in _GI__pthread_mutex_lock (mutex=0x56340218ffa0 <qemu_global_mutex>) at ../nptl/pthread_mutex_lock.c:80
    #2  0x0000563401997f99 in qemu_mutex_lock_impl (mutex=0x56340218ffa0 <qemu_global_mutex>, file=0x563401b7a80e "migration/colo.c", line=806) at util/qemu-thread-posix.c:78
    #3  0x0000563401407144 in qemu_mutex_lock_iothread_impl (file=0x563401b7a80e "migration/colo.c", line=806) at /home/workspace/colo-qemu/cpus.c:1899
    #4  0x00005634017ba8e8 in colo_process_incoming_thread (opaque=0x563402d664c0) at migration/colo.c:806
    #5  0x0000563401998b72 in qemu_thread_start (args=0x5634039f8370) at util/qemu-thread-posix.c:519
    #6  0x00007f34ef9fd609 in start_thread (arg=<optimized out>) at pthread_create.c:477
    #7  0x00007f34ef924293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

    The QEMU main thread is holding the lock:
    (gdb) p qemu_global_mutex
    $1 = {lock = {_data = {lock = 2, __count = 0, __owner = 11212, __nusers = 9, __kind = 0, __spins = 0, __elision = 0, __list = {_prev = 0x0, __next = 0x0}},
     __size = "\002\000\000\000\000\000\000\000\314+\000\000\t", '\000' <repeats 26 times>, __align = 2}, file = 0x563401c07e4b "util/main-loop.c", line = 240,
    initialized = true}

>From the call trace, we can see it is a deadlock bug. and the QEMU main thread holds the global mutex to wait until the COLO thread ends. and the colo thread
wants to acquire the global mutex, which will cause a deadlock. So, we should release the qemu_global_mutex before waiting colo thread ends.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Reviewed-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-03 09:38:53 +01:00
Rao, Lei
aa505f8e0e Fixed qemu crash when guest power off in COLO mode
This patch fixes the following:
qemu-system-x86_64: invalid runstate transition: 'shutdown' -> 'running'
Aborted (core dumped)
The gdb bt as following:
0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
1  0x00007faa3d613859 in __GI_abort () at abort.c:79
2  0x000055c5a21268fd in runstate_set (new_state=RUN_STATE_RUNNING) at vl.c:723
3  0x000055c5a1f8cae4 in vm_prepare_start () at /home/workspace/colo-qemu/cpus.c:2206
4  0x000055c5a1f8cb1b in vm_start () at /home/workspace/colo-qemu/cpus.c:2213
5  0x000055c5a2332bba in migration_iteration_finish (s=0x55c5a4658810) at migration/migration.c:3376
6  0x000055c5a2332f3b in migration_thread (opaque=0x55c5a4658810) at migration/migration.c:3527
7  0x000055c5a251d68a in qemu_thread_start (args=0x55c5a5491a70) at util/qemu-thread-posix.c:519
8  0x00007faa3d7e9609 in start_thread (arg=<optimized out>) at pthread_create.c:477
9  0x00007faa3d710293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Signed-off-by: Lei Rao <lei.rao@intel.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-03 09:38:53 +01:00
yuxiating
fa0b31d585 migration: initialise compression_counters for a new migration
If the compression migration fails or is canceled, the query for the value of
compression_counters during the next compression migration is wrong.

Signed-off-by: yuxiating <yuxiating@huawei.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-03 09:38:53 +01:00
Laurent Vivier
458fecca80 migration: provide an error message to migration_cancel()
This avoids to call migrate_get_current() in the caller function
whereas migration_cancel() already needs the pointer to the current
migration state.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-03 09:38:53 +01:00
David Hildenbrand
7648297d40 migration: Simplify alignment and alignment checks
Let's use QEMU_ALIGN_DOWN() and friends to make the code a bit easier to
read.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-01 22:56:44 +01:00
Peter Xu
60fd680193 migration: Add migrate_add_blocker_internal()
An internal version that removes -only-migratable implications.  It can be used
for temporary migration blockers like dump-guest-memory.

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-01 22:56:44 +01:00
Peter Xu
4c170330aa migration: Make migration blocker work for snapshots too
save_snapshot() checks migration blocker, which looks sane.  At the meantime we
should also teach the blocker add helper to fail if during a snapshot, just
like for migrations.

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-11-01 22:56:43 +01:00
Li Zhijian
5ad15e8614 migration: allow enabling mutilfd for specific protocol only
To: <quintela@redhat.com>, <dgilbert@redhat.com>, <qemu-devel@nongnu.org>
CC: Li Zhijian <lizhijian@cn.fujitsu.com>
Date: Sat, 31 Jul 2021 22:05:52 +0800 (5 weeks, 4 days, 17 hours ago)

And change the default to true so that in '-incoming defer' case, user is able
to change multifd capability.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-10-19 08:39:04 +02:00
Li Zhijian
b7acd65707 migration: allow multifd for socket protocol only
To: <quintela@redhat.com>, <dgilbert@redhat.com>, <qemu-devel@nongnu.org>
CC: Li Zhijian <lizhijian@cn.fujitsu.com>
Date: Sat, 31 Jul 2021 22:05:51 +0800 (5 weeks, 4 days, 17 hours ago)

multifd with unsupported protocol will cause a segment fault.
(gdb) bt
 #0  0x0000563b4a93faf8 in socket_connect (addr=0x0, errp=0x7f7f02675410) at ../util/qemu-sockets.c:1190
 #1 0x0000563b4a797a03 in qio_channel_socket_connect_sync
(ioc=0x563b4d16e8c0, addr=0x0, errp=0x7f7f02675410) at
../io/channel-socket.c:145
 #2  0x0000563b4a797abf in qio_channel_socket_connect_worker (task=0x563b4cd86c30, opaque=0x0) at ../io/channel-socket.c:168
 #3  0x0000563b4a792631 in qio_task_thread_worker (opaque=0x563b4cd86c30) at ../io/task.c:124
 #4  0x0000563b4a91da69 in qemu_thread_start (args=0x563b4c44bb80) at ../util/qemu-thread-posix.c:541
 #5  0x00007f7fe9b5b3f9 in ?? ()
 #6  0x0000000000000000 in ?? ()

It's enough to check migrate_multifd_is_allowed() in multifd cleanup() and
multifd setup() though there are so many other places using migrate_use_multifd().

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
2021-10-19 08:39:04 +02:00
Emanuele Giuseppe Esposito
68b88468f6 migration: add missing qemu_mutex_lock_iothread in migration_completion
qemu_savevm_state_complete_postcopy assumes the iothread lock (BQL)
to be held, but instead it isn't.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20211005080751.3797161-3-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-05 13:10:29 +02:00
Markus Armbruster
7d6f6933aa migration: Handle migration_incoming_setup() errors consistently
Commit b673eab4e2 "multifd: Make multifd_load_setup() get an Error
parameter" changed migration_incoming_setup() to take an Error **
argument, and adjusted the callers accordingly.  It neglected to
change adjust multifd_load_setup(): it still exit()s on error.  Clean
that up.

The error now gets propagated up two call chains: via
migration_fd_process_incoming() to rdma_accept_incoming_migration(),
and via migration_ioc_process_incoming() to
migration_channel_process_incoming().  Both chain ends report the
error with error_report_err(), but otherwise ignore it.  Behavioral
change: we no longer exit() on this error.

This is consistent with how we handle other errors here, e.g. from
multifd_recv_new_channel() via migration_ioc_process_incoming() to
migration_channel_process_incoming().  Whether it's consistently right
or consistently wrong I can't tell.

Also clean up the return value from the unusual 0 on success, 1 on
error to the more common true on success, false on error.

Cc: Juan Quintela <quintela@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20210720125408.387910-11-armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2021-08-26 17:15:28 +02:00
Markus Armbruster
f9734d5d40 error: Use error_fatal to simplify obvious fatal errors (again)
We did this with scripts/coccinelle/use-error_fatal.cocci before, in
commit 50beeb6809 and 007b06578a.  This commit cleans up rarer
variations that don't seem worth matching with Coccinelle.

Cc: Thomas Huth <thuth@redhat.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Marc-André Lureau <marcandre.lureau@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20210720125408.387910-2-armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
2021-08-26 17:15:28 +02:00
Peter Xu
39675ffffb migration: Move the yank unregister of channel_close out
It's efficient, but hackish to call yank unregister calls in channel_close(),
especially it'll be hard to debug when qemu crashed with some yank function
leaked.

Remove that hack, but instead explicitly unregister yank functions at the
places where needed, they are:

  (on src)
  - migrate_fd_cleanup
  - postcopy_pause

  (on dst)
  - migration_incoming_state_destroy
  - postcopy_pause_incoming

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210722175841.938739-6-peterx@redhat.com>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-07-26 12:45:03 +01:00
Peter Xu
43044ac0ee migration: Make from_dst_file accesses thread-safe
Accessing from_dst_file is potentially racy in current code base like below:

  if (s->from_dst_file)
    do_something(s->from_dst_file);

Because from_dst_file can be reset right after the check in another
thread (rp_thread).  One example is migrate_fd_cancel().

Use the same qemu_file_lock to protect it too, just like to_dst_file.

When it's safe to access without lock, comment it.

There's one special reference in migration_thread() that can be replaced by
the newly introduced rp_thread_created flag.

Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Message-Id: <20210722175841.938739-3-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  with Peter's fixup
2021-07-26 12:44:46 +01:00
Peter Xu
53021ea165 migration: Fix missing join() of rp_thread
It's possible that the migration thread skip the join() of the rp_thread in
below race and crash on src right at finishing migration:

       migration_thread                     rp_thread
       ----------------                     ---------
    migration_completion()
                                        (before rp_thread quits)
                                        from_dst_file=NULL
                                        [thread got scheduled out]
      s->rp_state.from_dst_file==NULL
        (skip join() of rp_thread)
    migrate_fd_cleanup()
      qemu_fclose(s->to_dst_file)
      yank_unregister_instance()
        assert(yank_find_entry())  <------- crash

It could mostly happen with postcopy, but that shouldn't be required, e.g., I
think it could also trigger with MIGRATION_CAPABILITY_RETURN_PATH set.

It's suspected that above race could be the root cause of a recent (but rare)
migration-test break reported by either Dave or PMM:

https://lore.kernel.org/qemu-devel/YPamXAHwan%2FPPXLf@work-vm/

The issue is: from_dst_file is reset in the rp_thread, so if the thread reset
it to NULL fast enough then the migration thread will assume there's no
rp_thread at all.

This could potentially cause more severe issue (e.g. crash) after the yank code.

Fix it by using a boolean to keep "whether we've created rp_thread".

Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210722175841.938739-2-peterx@redhat.com>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-07-26 12:44:26 +01:00
Peter Xu
ca7bd0821b migration: Clear error at entry of migrate_fd_connect()
For each "migrate" command, remember to clear the s->error before going on.
For one reason, when there's a new error it'll be still remembered; see
migrate_set_error() who only sets the error if error==NULL.  Meanwhile if a
failed migration completes (e.g., postcopy recovered and finished), we
shouldn't dump an error when calling migrate_fd_cleanup() at last.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210708190653.252961-4-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-07-13 16:21:57 +01:00
Peter Xu
ca30f24d12 migration: Don't do migrate cleanup if during postcopy resume
Below process could crash qemu with postcopy recovery:

  1. (hmp) migrate -d ..
  2. (hmp) migrate_start_postcopy
  3. [network down, postcopy paused]
  4. (hmp) migrate -r $WRONG_PORT
     when try the recover on an invalid $WRONG_PORT, cleanup_bh will be cleared
  5. (hmp) migrate -r $RIGHT_PORT
     [qemu crash on assert(cleanup_bh)]

The thing is we shouldn't cleanup if it's postcopy resume; the error is set
mostly because the channel is wrong, so we return directly waiting for the user
to retry.

migrate_fd_cleanup() should only be called when migration is cancelled or
completed.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210708190653.252961-3-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-07-13 16:21:57 +01:00
Peter Xu
2e3e3da3c2 migration: Release return path early for paused postcopy
When postcopy pause triggered, we rely on the migration thread to cleanup the
to_dst_file handle, and the return path thread to cleanup the from_dst_file
handle (which is stored in the local variable "rp").

Within the process, from_dst_file cleanup (qemu_fclose) is postponed until it's
setup again due to a postcopy recovery.

It used to work before yank was born; after yank is introduced we rely on the
refcount of IOC to correctly unregister yank function in channel_close().  If
without the early and on-time release of from_dst_file handle the yank function
will be leftover during paused postcopy.

Without this patch, below steps (quoted from Xiaohui) could trigger qemu src
crash:

  1.Boot vm on src host
  2.Boot vm on dst host
  3.Enable postcopy on src&dst host
  4.Load stressapptest in vm and set postcopy speed to 50M
  5.Start migration from src to dst host, change into postcopy mode when migration is active.
  6.When postcopy is active, down the network card(do migration via this network) on dst host.
  7.Wait untill postcopy is paused on src&dst host.
  8.Before up network card, recover migration on dst host, will get error like following.
  9.Ignore the error of step 8, go on recovering migration on src host:

  After step 9, qemu on src host will core dump after some seconds:
  qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
  1.sh: line 38: 44662 Aborted                 (core dumped)

Reported-by: Li Xiaohui <xiaohli@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210708190653.252961-2-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-07-13 16:21:57 +01:00
Laurent Vivier
a51dcef08b migration: failover: emit a warning when the card is not fully unplugged
When the migration fails or is canceled we wait the end of the unplug
operation to be able to plug it back. But if the unplug operation
is never finished we stop to wait and QEMU emits a warning to inform
the user.

Based-on: 20210629155007.629086-1-lvivier@redhat.com
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-Id: <20210701131458.112036-1-lvivier@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-07-13 16:21:57 +01:00
Laurent Vivier
944bc52842 migration: failover: continue to wait card unplug on error
If the user cancels the migration in the unplug-wait state,
QEMU will try to plug back the card and this fails because the card
is partially unplugged.
To avoid the problem, continue to wait the card unplug, but to
allow the migration to be canceled if the card never finishes to unplug
use a timeout.

Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1976852
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20210629155007.629086-3-lvivier@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-07-05 10:51:26 +01:00
Laurent Vivier
fde93d99d9 migration: move wait-unplug loop to its own function
The loop is used in migration_thread() and bg_migration_thread(),
so we can move it to its own function and call it from these both places.

Moreover, in migration_thread() we have a wrong state transition from
SETUP to ACTIVE while state could be WAIT_UNPLUG. This is correctly
managed in bg_migration_thread() so use this code instead.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-Id: <20210629155007.629086-2-lvivier@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-07-05 10:51:26 +01:00
Peter Xu
b7f9afd48e migration: Allow reset of postcopy_recover_triggered when failed
It's possible qemu_start_incoming_migration() failed at any point, when it
happens we should reset postcopy_recover_triggered to false so that the user
can still retry with a saner incoming port.

Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20210629181356.217312-3-peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-07-05 10:51:26 +01:00
Peter Xu
cc48c587d2 migration: Move yank outside qemu_start_incoming_migration()
Starting from commit b5eea99ec2, qmp_migrate_recover() calls unregister
before calling qemu_start_incoming_migration(). I believe it wanted to mitigate
the next call to yank_register_instance(), but I think that's wrong.

Firstly, if during recover, we should keep the yank instance there, not
"quickly removing and adding it back".

Meanwhile, calling qmp_migrate_recover() twice with b5eea99ec2 will directly
crash the dest qemu (right now it can't; but it'll start to work right after
the next patch) because the 1st call of qmp_migrate_recover() will unregister
permanently when the channel failed to establish, then the 2nd call of
qmp_migrate_recover() crashes at yank_unregister_instance().

This patch fixes it by moving yank ops out of qemu_start_incoming_migration()
into qmp_migrate_incoming.  For qmp_migrate_recover(), drop the unregister of
yank instance too since we keep it there during the recovery phase.

Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <20210629181356.217312-2-peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-07-05 10:51:26 +01:00
Daniel P. Berrangé
626ff6515d migration: add trace point when vm_stop_force_state fails
This is a critical failure scenario for migration that is hard to
diagnose from existing probes. Most likely it is caused by an error
from bdrv_flush(), but we're not logging the errno anywhere, hence
this new probe.

Reviewed-by: Connor Kuehl <ckuehl@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2021-06-14 13:28:50 +01:00
Rao, Lei
3ba024457f Remove migrate_set_block_enabled in checkpoint
We can detect disk migration in migrate_prepare, if disk migration
is enabled in COLO mode, we can directly report an error.and there
is no need to disable block migration at every checkpoint.

Signed-off-by: Lei Rao <lei.rao@intel.com>
Signed-off-by: Zhang Chen <chen.zhang@intel.com>
Reviewed-by: Li Zhijian <lizhijian@fujitsu.com>
Reviewed-by: Zhang Chen <chen.zhang@intel.com>
Reviewed-by: Lukas Straub <lukasstraub2@web.de>
Tested-by: Lukas Straub <lukasstraub2@web.de>
Signed-off-by: Jason Wang <jasowang@redhat.com>
2021-06-11 10:30:13 +08:00
Dr. David Alan Gilbert
1df6ddb43b migration: Add cleanup hook for inwards migration
Add a cleanup hook for incoming migration that gets called
at the end as a way for a transport to allow cleanup.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210421112834.107651-4-dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-06-08 19:36:18 +01:00
Peter Maydell
9b1e81d1c2 Merge remote-tracking branch 'remotes/thuth-gitlab/tags/pull-request-2021-05-14' into staging
* Replace YAML anchors by extends in the gitlab-CI yaml files
* Many small qtest fixes (e.g. to fix issues discovered by Coverity)
* Poison more config switches in common code
* Fix the failing Travis-CI and Cirrus-CI tasks

# gpg: Signature made Fri 14 May 2021 12:17:39 BST
# gpg:                using RSA key 27B88847EEE0250118F3EAB92ED9D774FE702DB5
# gpg:                issuer "thuth@redhat.com"
# gpg: Good signature from "Thomas Huth <th.huth@gmx.de>" [full]
# gpg:                 aka "Thomas Huth <thuth@redhat.com>" [full]
# gpg:                 aka "Thomas Huth <huth@tuxfamily.org>" [full]
# gpg:                 aka "Thomas Huth <th.huth@posteo.de>" [unknown]
# Primary key fingerprint: 27B8 8847 EEE0 2501 18F3  EAB9 2ED9 D774 FE70 2DB5

* remotes/thuth-gitlab/tags/pull-request-2021-05-14:
  cirrus.yml: Fix the MSYS2 task
  pc-bios/s390-ccw: Fix inline assembly for older versions of Clang
  tests/qtest/migration-test: Use g_autofree to avoid leaks on error paths
  configure: Poison all current target-specific #defines
  migration: Move populate_vfio_info() into a separate file
  include/sysemu: Poison all accelerator CONFIG switches in common code
  tests: Avoid side effects inside g_assert() arguments
  tests/qtest/rtc-test: Remove pointless NULL check
  tests/qtest/tpm-util.c: Free memory with correct free function
  tests/migration-test: Fix "true" vs true
  tests/qtest/npcm7xx_pwm-test.c: Avoid g_assert_true() for non-test assertions
  tests/qtest/ahci-test.c: Calculate iso_size with 64-bit arithmetic
  util/compatfd.c: Replaced a malloc call with g_malloc.
  libqtest: refuse QTEST_QEMU_BINARY=qemu-kvm
  docs/devel/qgraph: add troubleshooting information
  libqos/qgraph: fix "UNAVAILBLE" typo
  gitlab-ci: Replace YAML anchors by extends (native_test_job)
  gitlab-ci: Replace YAML anchors by extends (native_build_job)
  gitlab-ci: Replace YAML anchors by extends (container_job)
  tests/docker/dockerfiles: Add ccache to containers where it was missing

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-05-14 19:33:23 +01:00
Thomas Huth
43bd0bf30f migration: Move populate_vfio_info() into a separate file
The CONFIG_VFIO switch only works in target specific code. Since
migration/migration.c is common code, the #ifdef does not have
the intended behavior here. Move the related code to a separate
file now which gets compiled via specific_ss instead.

Fixes: 3710586caa ("qapi: Add VFIO devices migration stats in Migration stats")
Message-Id: <20210414112004.943383-3-thuth@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
2021-05-14 12:31:51 +02:00
David Hildenbrand
c7c0e72408 migration/ram: Handle RAM block resizes during precopy
Resizing while migrating is dangerous and does not work as expected.
The whole migration code works on the usable_length of ram blocks and does
not expect this to change at random points in time.

In the case of precopy, the ram block size must not change on the source,
after syncing the RAM block list in ram_save_setup(), so as long as the
guest is still running on the source.

Resizing can be trigger *after* (but not during) a reset in
ACPI code by the guest
- hw/arm/virt-acpi-build.c:acpi_ram_update()
- hw/i386/acpi-build.c:acpi_ram_update()

Use the ram block notifier to get notified about resizes. Let's simply
cancel migration and indicate the reason. We'll continue running on the
source. No harm done.

Update the documentation. Postcopy will be handled separately.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210429112708.12291-5-david@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
  Manual merge
2021-05-13 18:21:13 +01:00
Markus Armbruster
372043f389 migration: Drop redundant query-migrate result @blocked
Result @blocked is redundant.  Unfortunately, we realized this too
close to the release to risk dropping it, so we deprecated it
instead, in commit e11ce6c06.

Since it was deprecated from the start, we can delete it without
the customary grace period.  Do so.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210429140424.2802929-1-armbru@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-05-13 18:21:13 +01:00
Andrey Gruzdev
eeccb99c9d migration: Pre-fault memory before starting background snasphot
This commit solves the issue with userfault_fd WP feature that
background snapshot is based on. For any never poluated or discarded
memory page, the UFFDIO_WRITEPROTECT ioctl() would skip updating
PTE for that page, thereby loosing WP setting for it.

So we need to pre-fault pages for each RAM block to be protected
before making a userfault_fd wr-protect ioctl().

Fixes: 278e2f551a (migration: support
  UFFD write fault processing in ram_save_iterate())
Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210401092226.102804-4-andrey.gruzdev@virtuozzo.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
dgilbert:
  Bodged ifdef __linux__ on ram_write_tracking_prepare, should really
      go in a stub
2021-04-07 18:37:28 +01:00
Andrey Gruzdev
1a8e44a89f migration: Inhibit virtio-balloon for the duration of background snapshot
The same thing as for incoming postcopy - we cannot deal with concurrent
RAM discards when using background snapshot feature in outgoing migration.

Fixes: 8518278a6a (migration: implementation
  of background snapshot thread)
Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Reported-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-Id: <20210401092226.102804-3-andrey.gruzdev@virtuozzo.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-04-06 18:56:01 +01:00
Andrey Gruzdev
ecb23efea0 migration: Fix missing qemu_fflush() on buffer file in bg_migration_thread
Added missing qemu_fflush() on buffer file holding precopy device state.
Increased initial QIOChannelBuffer allocation to 512KB to avoid reallocs.
Typical configurations often require >200KB for device state and VMDESC.

Fixes: 8518278a6a (migration: implementation
  of background snapshot thread)
Signed-off-by: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Message-Id: <20210401092226.102804-2-andrey.gruzdev@virtuozzo.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-04-06 18:56:01 +01:00
Daniel P. Berrangé
cbde7be900 migrate: remove QMP/HMP commands for speed, downtime and cache size
The generic 'migrate_set_parameters' command handle all types of param.

Only the QMP commands were documented in the deprecations page, but the
rationale for deprecating applies equally to HMP, and the replacements
exist. Furthermore the HMP commands are just shims to the QMP commands,
so removing the latter breaks the former unless they get re-implemented.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2021-03-18 09:22:55 +00:00
Mahmoud Mandour
373969507a migration: Replaced qemu_mutex_lock calls with QEMU_LOCK_GUARD
Replaced various qemu_mutex_lock calls and their respective
qemu_mutex_unlock calls with QEMU_LOCK_GUARD macro. This simplifies
the code by eliminating the respective qemu_mutex_unlock calls.

Signed-off-by: Mahmoud Mandour <ma.mandourr@gmail.com>
Message-Id: <20210311031538.5325-7-ma.mandourr@gmail.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-03-15 20:01:55 +00:00