| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  | /*
 | 
					
						
							|  |  |  |  * QEMU coroutines | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Copyright IBM, Corp. 2011 | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * Authors: | 
					
						
							|  |  |  |  *  Stefan Hajnoczi    <stefanha@linux.vnet.ibm.com> | 
					
						
							|  |  |  |  *  Kevin Wolf         <kwolf@redhat.com> | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  * This work is licensed under the terms of the GNU LGPL, version 2 or later. | 
					
						
							|  |  |  |  * See the COPYING.LIB file in the top-level directory. | 
					
						
							|  |  |  |  * | 
					
						
							|  |  |  |  */ | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2016-01-29 17:49:55 +00:00
										 |  |  | #include "qemu/osdep.h"
 | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  | #include "trace.h"
 | 
					
						
							| 
									
										
										
										
											2013-05-17 15:51:25 +02:00
										 |  |  | #include "qemu/thread.h"
 | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  | #include "qemu/atomic.h"
 | 
					
						
							| 
									
										
										
										
											2015-09-01 14:48:02 +01:00
										 |  |  | #include "qemu/coroutine_int.h"
 | 
					
						
							| 
									
										
										
										
											2022-03-07 15:38:52 +00:00
										 |  |  | #include "qemu/coroutine-tls.h"
 | 
					
						
							| 
									
										
										
										
											2017-02-13 14:52:19 +01:00
										 |  |  | #include "block/aio.h"
 | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												coroutine: Revert to constant batch size
Commit 4c41c69e changed the way the coroutine pool is sized because for
virtio-blk devices with a large queue size and heavy I/O, it was just
too small and caused coroutines to be deleted and reallocated soon
afterwards. The change made the size dynamic based on the number of
queues and the queue size of virtio-blk devices.
There are two important numbers here: Slightly simplified, when a
coroutine terminates, it is generally stored in the global release pool
up to a certain pool size, and if the pool is full, it is freed.
Conversely, when allocating a new coroutine, the coroutines in the
release pool are reused if the pool already has reached a certain
minimum size (the batch size), otherwise we allocate new coroutines.
The problem after commit 4c41c69e is that it not only increases the
maximum pool size (which is the intended effect), but also the batch
size for reusing coroutines (which is a bug). It means that in cases
with many devices and/or a large queue size (which defaults to the
number of vcpus for virtio-blk-pci), many thousand coroutines could be
sitting in the release pool without being reused.
This is not only a waste of memory and allocations, but it actually
makes the QEMU process likely to hit the vm.max_map_count limit on Linux
because each coroutine requires two mappings (its stack and the guard
page for the stack), causing it to abort() in qemu_alloc_stack() because
when the limit is hit, mprotect() starts to fail with ENOMEM.
In order to fix the problem, change the batch size back to 64 to avoid
uselessly accumulating coroutines in the release pool, but keep the
dynamic maximum pool size so that coroutines aren't freed too early
in heavy I/O scenarios.
Note that this fix doesn't strictly make it impossible to hit the limit,
but this would only happen if most of the coroutines are actually in use
at the same time, not just sitting in a pool. This is the same behaviour
as we already had before commit 4c41c69e. Fully preventing this would
require allowing qemu_coroutine_create() to return an error, but it
doesn't seem to be a scenario that people hit in practice.
Cc: qemu-stable@nongnu.org
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2079938
Fixes: 4c41c69e05fe28c0f95f8abd2ebf407e95a4f04b
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220510151020.105528-3-kwolf@redhat.com>
Tested-by: Hiroki Narukawa <hnarukaw@yahoo-corp.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
											
										 
											2022-05-10 17:10:20 +02:00
										 |  |  | /**
 | 
					
						
							|  |  |  |  * The minimal batch size is always 64, coroutines from the release_pool are | 
					
						
							|  |  |  |  * reused as soon as there are 64 coroutines in it. The maximum pool size starts | 
					
						
							|  |  |  |  * with 64 and is increased on demand so that coroutines are not deleted even if | 
					
						
							|  |  |  |  * they are not immediately reused. | 
					
						
							|  |  |  |  */ | 
					
						
							| 
									
										
										
										
											2013-02-19 11:59:09 +01:00
										 |  |  | enum { | 
					
						
							| 
									
										
											  
											
												coroutine: Revert to constant batch size
Commit 4c41c69e changed the way the coroutine pool is sized because for
virtio-blk devices with a large queue size and heavy I/O, it was just
too small and caused coroutines to be deleted and reallocated soon
afterwards. The change made the size dynamic based on the number of
queues and the queue size of virtio-blk devices.
There are two important numbers here: Slightly simplified, when a
coroutine terminates, it is generally stored in the global release pool
up to a certain pool size, and if the pool is full, it is freed.
Conversely, when allocating a new coroutine, the coroutines in the
release pool are reused if the pool already has reached a certain
minimum size (the batch size), otherwise we allocate new coroutines.
The problem after commit 4c41c69e is that it not only increases the
maximum pool size (which is the intended effect), but also the batch
size for reusing coroutines (which is a bug). It means that in cases
with many devices and/or a large queue size (which defaults to the
number of vcpus for virtio-blk-pci), many thousand coroutines could be
sitting in the release pool without being reused.
This is not only a waste of memory and allocations, but it actually
makes the QEMU process likely to hit the vm.max_map_count limit on Linux
because each coroutine requires two mappings (its stack and the guard
page for the stack), causing it to abort() in qemu_alloc_stack() because
when the limit is hit, mprotect() starts to fail with ENOMEM.
In order to fix the problem, change the batch size back to 64 to avoid
uselessly accumulating coroutines in the release pool, but keep the
dynamic maximum pool size so that coroutines aren't freed too early
in heavy I/O scenarios.
Note that this fix doesn't strictly make it impossible to hit the limit,
but this would only happen if most of the coroutines are actually in use
at the same time, not just sitting in a pool. This is the same behaviour
as we already had before commit 4c41c69e. Fully preventing this would
require allowing qemu_coroutine_create() to return an error, but it
doesn't seem to be a scenario that people hit in practice.
Cc: qemu-stable@nongnu.org
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2079938
Fixes: 4c41c69e05fe28c0f95f8abd2ebf407e95a4f04b
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220510151020.105528-3-kwolf@redhat.com>
Tested-by: Hiroki Narukawa <hnarukaw@yahoo-corp.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
											
										 
											2022-05-10 17:10:20 +02:00
										 |  |  |     POOL_MIN_BATCH_SIZE = 64, | 
					
						
							|  |  |  |     POOL_INITIAL_MAX_SIZE = 64, | 
					
						
							| 
									
										
										
										
											2013-02-19 11:59:09 +01:00
										 |  |  | }; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | /** Free list to speed up creation */ | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  | static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); | 
					
						
							| 
									
										
											  
											
												coroutine: Revert to constant batch size
Commit 4c41c69e changed the way the coroutine pool is sized because for
virtio-blk devices with a large queue size and heavy I/O, it was just
too small and caused coroutines to be deleted and reallocated soon
afterwards. The change made the size dynamic based on the number of
queues and the queue size of virtio-blk devices.
There are two important numbers here: Slightly simplified, when a
coroutine terminates, it is generally stored in the global release pool
up to a certain pool size, and if the pool is full, it is freed.
Conversely, when allocating a new coroutine, the coroutines in the
release pool are reused if the pool already has reached a certain
minimum size (the batch size), otherwise we allocate new coroutines.
The problem after commit 4c41c69e is that it not only increases the
maximum pool size (which is the intended effect), but also the batch
size for reusing coroutines (which is a bug). It means that in cases
with many devices and/or a large queue size (which defaults to the
number of vcpus for virtio-blk-pci), many thousand coroutines could be
sitting in the release pool without being reused.
This is not only a waste of memory and allocations, but it actually
makes the QEMU process likely to hit the vm.max_map_count limit on Linux
because each coroutine requires two mappings (its stack and the guard
page for the stack), causing it to abort() in qemu_alloc_stack() because
when the limit is hit, mprotect() starts to fail with ENOMEM.
In order to fix the problem, change the batch size back to 64 to avoid
uselessly accumulating coroutines in the release pool, but keep the
dynamic maximum pool size so that coroutines aren't freed too early
in heavy I/O scenarios.
Note that this fix doesn't strictly make it impossible to hit the limit,
but this would only happen if most of the coroutines are actually in use
at the same time, not just sitting in a pool. This is the same behaviour
as we already had before commit 4c41c69e. Fully preventing this would
require allowing qemu_coroutine_create() to return an error, but it
doesn't seem to be a scenario that people hit in practice.
Cc: qemu-stable@nongnu.org
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2079938
Fixes: 4c41c69e05fe28c0f95f8abd2ebf407e95a4f04b
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220510151020.105528-3-kwolf@redhat.com>
Tested-by: Hiroki Narukawa <hnarukaw@yahoo-corp.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
											
										 
											2022-05-10 17:10:20 +02:00
										 |  |  | static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE; | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  | static unsigned int release_pool_size; | 
					
						
							| 
									
										
										
										
											2022-03-07 15:38:52 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | typedef QSLIST_HEAD(, Coroutine) CoroutineQSList; | 
					
						
							|  |  |  | QEMU_DEFINE_STATIC_CO_TLS(CoroutineQSList, alloc_pool); | 
					
						
							|  |  |  | QEMU_DEFINE_STATIC_CO_TLS(unsigned int, alloc_pool_size); | 
					
						
							|  |  |  | QEMU_DEFINE_STATIC_CO_TLS(Notifier, coroutine_pool_cleanup_notifier); | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  | 
 | 
					
						
							|  |  |  | static void coroutine_pool_cleanup(Notifier *n, void *value) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     Coroutine *co; | 
					
						
							|  |  |  |     Coroutine *tmp; | 
					
						
							| 
									
										
										
										
											2022-03-07 15:38:52 +00:00
										 |  |  |     CoroutineQSList *alloc_pool = get_ptr_alloc_pool(); | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-03-07 15:38:52 +00:00
										 |  |  |     QSLIST_FOREACH_SAFE(co, alloc_pool, pool_next, tmp) { | 
					
						
							|  |  |  |         QSLIST_REMOVE_HEAD(alloc_pool, pool_next); | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  |         qemu_coroutine_delete(co); | 
					
						
							|  |  |  |     } | 
					
						
							|  |  |  | } | 
					
						
							| 
									
										
										
										
											2013-02-19 11:59:09 +01:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												coroutine: move entry argument to qemu_coroutine_create
In practice the entry argument is always known at creation time, and
it is confusing that sometimes qemu_coroutine_enter is used with a
non-NULL argument to re-enter a coroutine (this happens in
block/sheepdog.c and tests/test-coroutine.c).  So pass the opaque value
at creation time, for consistency with e.g. aio_bh_new.
Mostly done with the following semantic patch:
@ entry1 @
expression entry, arg, co;
@@
- co = qemu_coroutine_create(entry);
+ co = qemu_coroutine_create(entry, arg);
  ...
- qemu_coroutine_enter(co, arg);
+ qemu_coroutine_enter(co);
@ entry2 @
expression entry, arg;
identifier co;
@@
- Coroutine *co = qemu_coroutine_create(entry);
+ Coroutine *co = qemu_coroutine_create(entry, arg);
  ...
- qemu_coroutine_enter(co, arg);
+ qemu_coroutine_enter(co);
@ entry3 @
expression entry, arg;
@@
- qemu_coroutine_enter(qemu_coroutine_create(entry), arg);
+ qemu_coroutine_enter(qemu_coroutine_create(entry, arg));
@ reentry @
expression co;
@@
- qemu_coroutine_enter(co, NULL);
+ qemu_coroutine_enter(co);
except for the aforementioned few places where the semantic patch
stumbled (as expected) and for test_co_queue, which would otherwise
produce an uninitialized variable warning.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
											
										 
											2016-07-04 19:10:01 +02:00
										 |  |  | Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque) | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2013-09-11 16:42:35 +02:00
										 |  |  |     Coroutine *co = NULL; | 
					
						
							| 
									
										
										
										
											2013-02-19 11:59:09 +01:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-10-05 14:31:27 +02:00
										 |  |  |     if (IS_ENABLED(CONFIG_COROUTINE_POOL)) { | 
					
						
							| 
									
										
										
										
											2022-03-07 15:38:52 +00:00
										 |  |  |         CoroutineQSList *alloc_pool = get_ptr_alloc_pool(); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         co = QSLIST_FIRST(alloc_pool); | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  |         if (!co) { | 
					
						
							| 
									
										
											  
											
												coroutine: Revert to constant batch size
Commit 4c41c69e changed the way the coroutine pool is sized because for
virtio-blk devices with a large queue size and heavy I/O, it was just
too small and caused coroutines to be deleted and reallocated soon
afterwards. The change made the size dynamic based on the number of
queues and the queue size of virtio-blk devices.
There are two important numbers here: Slightly simplified, when a
coroutine terminates, it is generally stored in the global release pool
up to a certain pool size, and if the pool is full, it is freed.
Conversely, when allocating a new coroutine, the coroutines in the
release pool are reused if the pool already has reached a certain
minimum size (the batch size), otherwise we allocate new coroutines.
The problem after commit 4c41c69e is that it not only increases the
maximum pool size (which is the intended effect), but also the batch
size for reusing coroutines (which is a bug). It means that in cases
with many devices and/or a large queue size (which defaults to the
number of vcpus for virtio-blk-pci), many thousand coroutines could be
sitting in the release pool without being reused.
This is not only a waste of memory and allocations, but it actually
makes the QEMU process likely to hit the vm.max_map_count limit on Linux
because each coroutine requires two mappings (its stack and the guard
page for the stack), causing it to abort() in qemu_alloc_stack() because
when the limit is hit, mprotect() starts to fail with ENOMEM.
In order to fix the problem, change the batch size back to 64 to avoid
uselessly accumulating coroutines in the release pool, but keep the
dynamic maximum pool size so that coroutines aren't freed too early
in heavy I/O scenarios.
Note that this fix doesn't strictly make it impossible to hit the limit,
but this would only happen if most of the coroutines are actually in use
at the same time, not just sitting in a pool. This is the same behaviour
as we already had before commit 4c41c69e. Fully preventing this would
require allowing qemu_coroutine_create() to return an error, but it
doesn't seem to be a scenario that people hit in practice.
Cc: qemu-stable@nongnu.org
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2079938
Fixes: 4c41c69e05fe28c0f95f8abd2ebf407e95a4f04b
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220510151020.105528-3-kwolf@redhat.com>
Tested-by: Hiroki Narukawa <hnarukaw@yahoo-corp.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
											
										 
											2022-05-10 17:10:20 +02:00
										 |  |  |             if (release_pool_size > POOL_MIN_BATCH_SIZE) { | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  |                 /* Slow path; a good place to register the destructor, too.  */ | 
					
						
							| 
									
										
										
										
											2022-03-07 15:38:52 +00:00
										 |  |  |                 Notifier *notifier = get_ptr_coroutine_pool_cleanup_notifier(); | 
					
						
							|  |  |  |                 if (!notifier->notify) { | 
					
						
							|  |  |  |                     notifier->notify = coroutine_pool_cleanup; | 
					
						
							|  |  |  |                     qemu_thread_atexit_add(notifier); | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  |                 } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |                 /* This is not exact; there could be a little skew between
 | 
					
						
							|  |  |  |                  * release_pool_size and the actual size of release_pool.  But | 
					
						
							|  |  |  |                  * it is just a heuristic, it does not need to be perfect. | 
					
						
							|  |  |  |                  */ | 
					
						
							| 
									
										
										
										
											2022-03-07 15:38:52 +00:00
										 |  |  |                 set_alloc_pool_size(qatomic_xchg(&release_pool_size, 0)); | 
					
						
							|  |  |  |                 QSLIST_MOVE_ATOMIC(alloc_pool, &release_pool); | 
					
						
							|  |  |  |                 co = QSLIST_FIRST(alloc_pool); | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  |             } | 
					
						
							|  |  |  |         } | 
					
						
							| 
									
										
										
										
											2013-09-11 16:42:35 +02:00
										 |  |  |         if (co) { | 
					
						
							| 
									
										
										
										
											2022-03-07 15:38:52 +00:00
										 |  |  |             QSLIST_REMOVE_HEAD(alloc_pool, pool_next); | 
					
						
							|  |  |  |             set_alloc_pool_size(get_alloc_pool_size() - 1); | 
					
						
							| 
									
										
										
										
											2013-09-11 16:42:35 +02:00
										 |  |  |         } | 
					
						
							| 
									
										
										
										
											2013-05-17 15:51:25 +02:00
										 |  |  |     } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     if (!co) { | 
					
						
							| 
									
										
										
										
											2013-02-19 11:59:09 +01:00
										 |  |  |         co = qemu_coroutine_new(); | 
					
						
							|  |  |  |     } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  |     co->entry = entry; | 
					
						
							| 
									
										
											  
											
												coroutine: move entry argument to qemu_coroutine_create
In practice the entry argument is always known at creation time, and
it is confusing that sometimes qemu_coroutine_enter is used with a
non-NULL argument to re-enter a coroutine (this happens in
block/sheepdog.c and tests/test-coroutine.c).  So pass the opaque value
at creation time, for consistency with e.g. aio_bh_new.
Mostly done with the following semantic patch:
@ entry1 @
expression entry, arg, co;
@@
- co = qemu_coroutine_create(entry);
+ co = qemu_coroutine_create(entry, arg);
  ...
- qemu_coroutine_enter(co, arg);
+ qemu_coroutine_enter(co);
@ entry2 @
expression entry, arg;
identifier co;
@@
- Coroutine *co = qemu_coroutine_create(entry);
+ Coroutine *co = qemu_coroutine_create(entry, arg);
  ...
- qemu_coroutine_enter(co, arg);
+ qemu_coroutine_enter(co);
@ entry3 @
expression entry, arg;
@@
- qemu_coroutine_enter(qemu_coroutine_create(entry), arg);
+ qemu_coroutine_enter(qemu_coroutine_create(entry, arg));
@ reentry @
expression co;
@@
- qemu_coroutine_enter(co, NULL);
+ qemu_coroutine_enter(co);
except for the aforementioned few places where the semantic patch
stumbled (as expected) and for test_co_queue, which would otherwise
produce an uninitialized variable warning.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
											
										 
											2016-07-04 19:10:01 +02:00
										 |  |  |     co->entry_arg = opaque; | 
					
						
							| 
									
										
										
										
											2016-07-04 19:09:59 +02:00
										 |  |  |     QSIMPLEQ_INIT(&co->co_queue_wakeup); | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  |     return co; | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-02-19 11:59:09 +01:00
										 |  |  | static void coroutine_delete(Coroutine *co) | 
					
						
							|  |  |  | { | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  |     co->caller = NULL; | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-10-05 14:31:27 +02:00
										 |  |  |     if (IS_ENABLED(CONFIG_COROUTINE_POOL)) { | 
					
						
							| 
									
										
											  
											
												coroutine: Revert to constant batch size
Commit 4c41c69e changed the way the coroutine pool is sized because for
virtio-blk devices with a large queue size and heavy I/O, it was just
too small and caused coroutines to be deleted and reallocated soon
afterwards. The change made the size dynamic based on the number of
queues and the queue size of virtio-blk devices.
There are two important numbers here: Slightly simplified, when a
coroutine terminates, it is generally stored in the global release pool
up to a certain pool size, and if the pool is full, it is freed.
Conversely, when allocating a new coroutine, the coroutines in the
release pool are reused if the pool already has reached a certain
minimum size (the batch size), otherwise we allocate new coroutines.
The problem after commit 4c41c69e is that it not only increases the
maximum pool size (which is the intended effect), but also the batch
size for reusing coroutines (which is a bug). It means that in cases
with many devices and/or a large queue size (which defaults to the
number of vcpus for virtio-blk-pci), many thousand coroutines could be
sitting in the release pool without being reused.
This is not only a waste of memory and allocations, but it actually
makes the QEMU process likely to hit the vm.max_map_count limit on Linux
because each coroutine requires two mappings (its stack and the guard
page for the stack), causing it to abort() in qemu_alloc_stack() because
when the limit is hit, mprotect() starts to fail with ENOMEM.
In order to fix the problem, change the batch size back to 64 to avoid
uselessly accumulating coroutines in the release pool, but keep the
dynamic maximum pool size so that coroutines aren't freed too early
in heavy I/O scenarios.
Note that this fix doesn't strictly make it impossible to hit the limit,
but this would only happen if most of the coroutines are actually in use
at the same time, not just sitting in a pool. This is the same behaviour
as we already had before commit 4c41c69e. Fully preventing this would
require allowing qemu_coroutine_create() to return an error, but it
doesn't seem to be a scenario that people hit in practice.
Cc: qemu-stable@nongnu.org
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2079938
Fixes: 4c41c69e05fe28c0f95f8abd2ebf407e95a4f04b
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220510151020.105528-3-kwolf@redhat.com>
Tested-by: Hiroki Narukawa <hnarukaw@yahoo-corp.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
											
										 
											2022-05-10 17:10:20 +02:00
										 |  |  |         if (release_pool_size < qatomic_read(&pool_max_size) * 2) { | 
					
						
							| 
									
										
											  
											
												coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool.  Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU.  Otherwise you have to stick
with adding to a list, and emptying it completely.  This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool.  This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage.  Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg.  Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive.  The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size.  (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous.  But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2014-12-02 12:05:48 +01:00
										 |  |  |             QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); | 
					
						
							| 
									
										
										
										
											2020-09-23 11:56:46 +01:00
										 |  |  |             qatomic_inc(&release_pool_size); | 
					
						
							| 
									
										
										
										
											2013-09-11 16:42:35 +02:00
										 |  |  |             return; | 
					
						
							|  |  |  |         } | 
					
						
							| 
									
										
											  
											
												coroutine: Revert to constant batch size
Commit 4c41c69e changed the way the coroutine pool is sized because for
virtio-blk devices with a large queue size and heavy I/O, it was just
too small and caused coroutines to be deleted and reallocated soon
afterwards. The change made the size dynamic based on the number of
queues and the queue size of virtio-blk devices.
There are two important numbers here: Slightly simplified, when a
coroutine terminates, it is generally stored in the global release pool
up to a certain pool size, and if the pool is full, it is freed.
Conversely, when allocating a new coroutine, the coroutines in the
release pool are reused if the pool already has reached a certain
minimum size (the batch size), otherwise we allocate new coroutines.
The problem after commit 4c41c69e is that it not only increases the
maximum pool size (which is the intended effect), but also the batch
size for reusing coroutines (which is a bug). It means that in cases
with many devices and/or a large queue size (which defaults to the
number of vcpus for virtio-blk-pci), many thousand coroutines could be
sitting in the release pool without being reused.
This is not only a waste of memory and allocations, but it actually
makes the QEMU process likely to hit the vm.max_map_count limit on Linux
because each coroutine requires two mappings (its stack and the guard
page for the stack), causing it to abort() in qemu_alloc_stack() because
when the limit is hit, mprotect() starts to fail with ENOMEM.
In order to fix the problem, change the batch size back to 64 to avoid
uselessly accumulating coroutines in the release pool, but keep the
dynamic maximum pool size so that coroutines aren't freed too early
in heavy I/O scenarios.
Note that this fix doesn't strictly make it impossible to hit the limit,
but this would only happen if most of the coroutines are actually in use
at the same time, not just sitting in a pool. This is the same behaviour
as we already had before commit 4c41c69e. Fully preventing this would
require allowing qemu_coroutine_create() to return an error, but it
doesn't seem to be a scenario that people hit in practice.
Cc: qemu-stable@nongnu.org
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2079938
Fixes: 4c41c69e05fe28c0f95f8abd2ebf407e95a4f04b
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220510151020.105528-3-kwolf@redhat.com>
Tested-by: Hiroki Narukawa <hnarukaw@yahoo-corp.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
											
										 
											2022-05-10 17:10:20 +02:00
										 |  |  |         if (get_alloc_pool_size() < qatomic_read(&pool_max_size)) { | 
					
						
							| 
									
										
										
										
											2022-03-07 15:38:52 +00:00
										 |  |  |             QSLIST_INSERT_HEAD(get_ptr_alloc_pool(), co, pool_next); | 
					
						
							|  |  |  |             set_alloc_pool_size(get_alloc_pool_size() + 1); | 
					
						
							| 
									
										
										
										
											2014-12-02 12:05:50 +01:00
										 |  |  |             return; | 
					
						
							|  |  |  |         } | 
					
						
							| 
									
										
										
										
											2013-02-19 11:59:09 +01:00
										 |  |  |     } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     qemu_coroutine_delete(co); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2017-04-10 20:06:12 +08:00
										 |  |  | void qemu_aio_coroutine_enter(AioContext *ctx, Coroutine *co) | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  | { | 
					
						
							| 
									
										
										
										
											2018-03-22 15:28:33 +00:00
										 |  |  |     QSIMPLEQ_HEAD(, Coroutine) pending = QSIMPLEQ_HEAD_INITIALIZER(pending); | 
					
						
							|  |  |  |     Coroutine *from = qemu_coroutine_self(); | 
					
						
							| 
									
										
										
										
											2017-11-17 22:27:09 -05:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-03-22 15:28:33 +00:00
										 |  |  |     QSIMPLEQ_INSERT_TAIL(&pending, co, co_queue_next); | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-03-22 15:28:33 +00:00
										 |  |  |     /* Run co and any queued coroutines */ | 
					
						
							|  |  |  |     while (!QSIMPLEQ_EMPTY(&pending)) { | 
					
						
							|  |  |  |         Coroutine *to = QSIMPLEQ_FIRST(&pending); | 
					
						
							|  |  |  |         CoroutineAction ret; | 
					
						
							| 
									
										
										
										
											2017-11-17 22:27:09 -05:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-03-03 11:00:43 +01:00
										 |  |  |         /*
 | 
					
						
							|  |  |  |          * Read to before to->scheduled; pairs with qatomic_cmpxchg in | 
					
						
							|  |  |  |          * qemu_co_sleep(), aio_co_schedule() etc. | 
					
						
							|  |  |  |          */ | 
					
						
							|  |  |  |         smp_read_barrier_depends(); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         const char *scheduled = qatomic_read(&to->scheduled); | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-03-22 15:28:33 +00:00
										 |  |  |         QSIMPLEQ_REMOVE_HEAD(&pending, co_queue_next); | 
					
						
							| 
									
										
										
										
											2017-02-13 14:52:19 +01:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-03-22 15:28:33 +00:00
										 |  |  |         trace_qemu_aio_coroutine_enter(ctx, from, to, to->entry_arg); | 
					
						
							| 
									
										
										
										
											2017-02-13 14:52:19 +01:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-03-22 15:28:33 +00:00
										 |  |  |         /* if the Coroutine has already been scheduled, entering it again will
 | 
					
						
							|  |  |  |          * cause us to enter it twice, potentially even after the coroutine has | 
					
						
							|  |  |  |          * been deleted */ | 
					
						
							|  |  |  |         if (scheduled) { | 
					
						
							|  |  |  |             fprintf(stderr, | 
					
						
							|  |  |  |                     "%s: Co-routine was already scheduled in '%s'\n", | 
					
						
							|  |  |  |                     __func__, scheduled); | 
					
						
							|  |  |  |             abort(); | 
					
						
							|  |  |  |         } | 
					
						
							| 
									
										
										
										
											2015-02-10 11:31:52 +01:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-03-22 15:28:33 +00:00
										 |  |  |         if (to->caller) { | 
					
						
							|  |  |  |             fprintf(stderr, "Co-routine re-entered recursively\n"); | 
					
						
							|  |  |  |             abort(); | 
					
						
							|  |  |  |         } | 
					
						
							| 
									
										
											  
											
												coroutine-lock: do not touch coroutine after another one has been entered
Submission of requests on linux aio is a bit tricky and can lead to
requests completions on submission path:
44713c9e8547 ("linux-aio: Handle io_submit() failure gracefully")
0ed93d84edab ("linux-aio: process completions from ioq_submit()")
That means that any coroutine which has been yielded in order to wait
for completion can be resumed from submission path and be eventually
terminated (freed).
The following use-after-free crash was observed when IO throttling
was enabled:
 Program received signal SIGSEGV, Segmentation fault.
 [Switching to Thread 0x7f5813dff700 (LWP 56417)]
 virtqueue_unmap_sg (elem=0x7f5804009a30, len=1, vq=<optimized out>) at virtio.c:252
 (gdb) bt
 #0  virtqueue_unmap_sg (elem=0x7f5804009a30, len=1, vq=<optimized out>) at virtio.c:252
                              ^^^^^^^^^^^^^^
                              remember the address
 #1  virtqueue_fill (vq=0x5598b20d21b0, elem=0x7f5804009a30, len=1, idx=0) at virtio.c:282
 #2  virtqueue_push (vq=0x5598b20d21b0, elem=elem@entry=0x7f5804009a30, len=<optimized out>) at virtio.c:308
 #3  virtio_blk_req_complete (req=req@entry=0x7f5804009a30, status=status@entry=0 '\000') at virtio-blk.c:61
 #4  virtio_blk_rw_complete (opaque=<optimized out>, ret=0) at virtio-blk.c:126
 #5  blk_aio_complete (acb=0x7f58040068d0) at block-backend.c:923
 #6  coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:78
 (gdb) p * elem
 $8 = {index = 77, out_num = 2, in_num = 1,
       in_addr = 0x7f5804009ad8, out_addr = 0x7f5804009ae0,
       in_sg = 0x0, out_sg = 0x7f5804009a50}
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       'in_sg' and 'out_sg' are invalid.
       e.g. it is impossible that 'in_sg' is zero,
       instead its value must be equal to:
       (gdb) p/x 0x7f5804009ad8 + sizeof(elem->in_addr[0]) + 2 * sizeof(elem->out_addr[0])
       $26 = 0x7f5804009af0
Seems 'elem' was corrupted.  Meanwhile another thread raised an abort:
 Thread 12 (Thread 0x7f57f2ffd700 (LWP 56426)):
 #0  raise () from /lib/x86_64-linux-gnu/libc.so.6
 #1  abort () from /lib/x86_64-linux-gnu/libc.so.6
 #2  qemu_coroutine_enter (co=0x7f5804009af0) at qemu-coroutine.c:113
 #3  qemu_co_queue_run_restart (co=0x7f5804009a30) at qemu-coroutine-lock.c:60
 #4  qemu_coroutine_enter (co=0x7f5804009a30) at qemu-coroutine.c:119
                           ^^^^^^^^^^^^^^^^^^
                           WTF?? this is equal to elem from crashed thread
 #5  qemu_co_queue_run_restart (co=0x7f57e7f16ae0) at qemu-coroutine-lock.c:60
 #6  qemu_coroutine_enter (co=0x7f57e7f16ae0) at qemu-coroutine.c:119
 #7  qemu_co_queue_run_restart (co=0x7f5807e112a0) at qemu-coroutine-lock.c:60
 #8  qemu_coroutine_enter (co=0x7f5807e112a0) at qemu-coroutine.c:119
 #9  qemu_co_queue_run_restart (co=0x7f5807f17820) at qemu-coroutine-lock.c:60
 #10 qemu_coroutine_enter (co=0x7f5807f17820) at qemu-coroutine.c:119
 #11 qemu_co_queue_run_restart (co=0x7f57e7f18e10) at qemu-coroutine-lock.c:60
 #12 qemu_coroutine_enter (co=0x7f57e7f18e10) at qemu-coroutine.c:119
 #13 qemu_co_enter_next (queue=queue@entry=0x5598b1e742d0) at qemu-coroutine-lock.c:106
 #14 timer_cb (blk=0x5598b1e74280, is_write=<optimized out>) at throttle-groups.c:419
Crash can be explained by access of 'co' object from the loop inside
qemu_co_queue_run_restart():
  while ((next = QSIMPLEQ_FIRST(&co->co_queue_wakeup))) {
      QSIMPLEQ_REMOVE_HEAD(&co->co_queue_wakeup, co_queue_next);
                           ^^^^^^^^^^^^^^^^^^^^
                           on each iteration 'co' is accessed,
                           but 'co' can be already freed
      qemu_coroutine_enter(next);
  }
When 'next' coroutine is resumed (entered) it can in its turn resume
'co', and eventually free it.  That's why we see 'co' (which was freed)
has the same address as 'elem' from the first backtrace.
The fix is obvious: use temporary queue and do not touch coroutine after
first qemu_coroutine_enter() is invoked.
The issue is quite rare and happens every ~12 hours on very high IO
and CPU load (building linux kernel with -j512 inside guest) when IO
throttling is enabled.  With the fix applied guest is running ~35 hours
and is still alive so far.
Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20170601160847.23720-1-roman.penyaev@profitbricks.com
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Fam Zheng <famz@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-devel@nongnu.org
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
											
										 
											2017-06-01 18:08:47 +02:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-03-22 15:28:33 +00:00
										 |  |  |         to->caller = from; | 
					
						
							|  |  |  |         to->ctx = ctx; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         /* Store to->ctx before anything that stores to.  Matches
 | 
					
						
							|  |  |  |          * barrier in aio_co_wake and qemu_co_mutex_wake. | 
					
						
							|  |  |  |          */ | 
					
						
							|  |  |  |         smp_wmb(); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         ret = qemu_coroutine_switch(from, to, COROUTINE_ENTER); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         /* Queued coroutines are run depth-first; previously pending coroutines
 | 
					
						
							|  |  |  |          * run after those queued more recently. | 
					
						
							|  |  |  |          */ | 
					
						
							|  |  |  |         QSIMPLEQ_PREPEND(&pending, &to->co_queue_wakeup); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         switch (ret) { | 
					
						
							|  |  |  |         case COROUTINE_YIELD: | 
					
						
							|  |  |  |             break; | 
					
						
							|  |  |  |         case COROUTINE_TERMINATE: | 
					
						
							|  |  |  |             assert(!to->locks_held); | 
					
						
							|  |  |  |             trace_qemu_coroutine_terminate(to); | 
					
						
							|  |  |  |             coroutine_delete(to); | 
					
						
							|  |  |  |             break; | 
					
						
							|  |  |  |         default: | 
					
						
							|  |  |  |             abort(); | 
					
						
							|  |  |  |         } | 
					
						
							| 
									
										
										
										
											2015-02-10 11:31:52 +01:00
										 |  |  |     } | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2017-04-10 20:06:12 +08:00
										 |  |  | void qemu_coroutine_enter(Coroutine *co) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     qemu_aio_coroutine_enter(qemu_get_current_aio_context(), co); | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2016-11-07 16:34:35 +01:00
										 |  |  | void qemu_coroutine_enter_if_inactive(Coroutine *co) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     if (!qemu_coroutine_entered(co)) { | 
					
						
							|  |  |  |         qemu_coroutine_enter(co); | 
					
						
							|  |  |  |     } | 
					
						
							|  |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  | void coroutine_fn qemu_coroutine_yield(void) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     Coroutine *self = qemu_coroutine_self(); | 
					
						
							|  |  |  |     Coroutine *to = self->caller; | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     trace_qemu_coroutine_yield(self, to); | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     if (!to) { | 
					
						
							|  |  |  |         fprintf(stderr, "Co-routine is yielding to no one\n"); | 
					
						
							|  |  |  |         abort(); | 
					
						
							|  |  |  |     } | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     self->caller = NULL; | 
					
						
							| 
									
										
										
										
											2015-02-10 11:17:53 +01:00
										 |  |  |     qemu_coroutine_switch(self, to, COROUTINE_YIELD); | 
					
						
							| 
									
										
										
										
											2011-01-17 16:08:14 +00:00
										 |  |  | } | 
					
						
							| 
									
										
										
										
											2016-09-27 16:18:34 +01:00
										 |  |  | 
 | 
					
						
							|  |  |  | bool qemu_coroutine_entered(Coroutine *co) | 
					
						
							|  |  |  | { | 
					
						
							|  |  |  |     return co->caller; | 
					
						
							|  |  |  | } | 
					
						
							| 
									
										
										
										
											2018-08-17 18:54:18 +02:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-09-22 10:49:03 +02:00
										 |  |  | AioContext *qemu_coroutine_get_aio_context(Coroutine *co) | 
					
						
							| 
									
										
										
										
											2018-08-17 18:54:18 +02:00
										 |  |  | { | 
					
						
							|  |  |  |     return co->ctx; | 
					
						
							|  |  |  | } | 
					
						
							| 
									
										
										
										
											2022-02-14 20:53:02 +09:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-05-10 17:10:19 +02:00
										 |  |  | void qemu_coroutine_inc_pool_size(unsigned int additional_pool_size) | 
					
						
							| 
									
										
										
										
											2022-02-14 20:53:02 +09:00
										 |  |  | { | 
					
						
							| 
									
										
											  
											
												coroutine: Revert to constant batch size
Commit 4c41c69e changed the way the coroutine pool is sized because for
virtio-blk devices with a large queue size and heavy I/O, it was just
too small and caused coroutines to be deleted and reallocated soon
afterwards. The change made the size dynamic based on the number of
queues and the queue size of virtio-blk devices.
There are two important numbers here: Slightly simplified, when a
coroutine terminates, it is generally stored in the global release pool
up to a certain pool size, and if the pool is full, it is freed.
Conversely, when allocating a new coroutine, the coroutines in the
release pool are reused if the pool already has reached a certain
minimum size (the batch size), otherwise we allocate new coroutines.
The problem after commit 4c41c69e is that it not only increases the
maximum pool size (which is the intended effect), but also the batch
size for reusing coroutines (which is a bug). It means that in cases
with many devices and/or a large queue size (which defaults to the
number of vcpus for virtio-blk-pci), many thousand coroutines could be
sitting in the release pool without being reused.
This is not only a waste of memory and allocations, but it actually
makes the QEMU process likely to hit the vm.max_map_count limit on Linux
because each coroutine requires two mappings (its stack and the guard
page for the stack), causing it to abort() in qemu_alloc_stack() because
when the limit is hit, mprotect() starts to fail with ENOMEM.
In order to fix the problem, change the batch size back to 64 to avoid
uselessly accumulating coroutines in the release pool, but keep the
dynamic maximum pool size so that coroutines aren't freed too early
in heavy I/O scenarios.
Note that this fix doesn't strictly make it impossible to hit the limit,
but this would only happen if most of the coroutines are actually in use
at the same time, not just sitting in a pool. This is the same behaviour
as we already had before commit 4c41c69e. Fully preventing this would
require allowing qemu_coroutine_create() to return an error, but it
doesn't seem to be a scenario that people hit in practice.
Cc: qemu-stable@nongnu.org
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2079938
Fixes: 4c41c69e05fe28c0f95f8abd2ebf407e95a4f04b
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220510151020.105528-3-kwolf@redhat.com>
Tested-by: Hiroki Narukawa <hnarukaw@yahoo-corp.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
											
										 
											2022-05-10 17:10:20 +02:00
										 |  |  |     qatomic_add(&pool_max_size, additional_pool_size); | 
					
						
							| 
									
										
										
										
											2022-02-14 20:53:02 +09:00
										 |  |  | } | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-05-10 17:10:19 +02:00
										 |  |  | void qemu_coroutine_dec_pool_size(unsigned int removing_pool_size) | 
					
						
							| 
									
										
										
										
											2022-02-14 20:53:02 +09:00
										 |  |  | { | 
					
						
							| 
									
										
											  
											
												coroutine: Revert to constant batch size
Commit 4c41c69e changed the way the coroutine pool is sized because for
virtio-blk devices with a large queue size and heavy I/O, it was just
too small and caused coroutines to be deleted and reallocated soon
afterwards. The change made the size dynamic based on the number of
queues and the queue size of virtio-blk devices.
There are two important numbers here: Slightly simplified, when a
coroutine terminates, it is generally stored in the global release pool
up to a certain pool size, and if the pool is full, it is freed.
Conversely, when allocating a new coroutine, the coroutines in the
release pool are reused if the pool already has reached a certain
minimum size (the batch size), otherwise we allocate new coroutines.
The problem after commit 4c41c69e is that it not only increases the
maximum pool size (which is the intended effect), but also the batch
size for reusing coroutines (which is a bug). It means that in cases
with many devices and/or a large queue size (which defaults to the
number of vcpus for virtio-blk-pci), many thousand coroutines could be
sitting in the release pool without being reused.
This is not only a waste of memory and allocations, but it actually
makes the QEMU process likely to hit the vm.max_map_count limit on Linux
because each coroutine requires two mappings (its stack and the guard
page for the stack), causing it to abort() in qemu_alloc_stack() because
when the limit is hit, mprotect() starts to fail with ENOMEM.
In order to fix the problem, change the batch size back to 64 to avoid
uselessly accumulating coroutines in the release pool, but keep the
dynamic maximum pool size so that coroutines aren't freed too early
in heavy I/O scenarios.
Note that this fix doesn't strictly make it impossible to hit the limit,
but this would only happen if most of the coroutines are actually in use
at the same time, not just sitting in a pool. This is the same behaviour
as we already had before commit 4c41c69e. Fully preventing this would
require allowing qemu_coroutine_create() to return an error, but it
doesn't seem to be a scenario that people hit in practice.
Cc: qemu-stable@nongnu.org
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2079938
Fixes: 4c41c69e05fe28c0f95f8abd2ebf407e95a4f04b
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220510151020.105528-3-kwolf@redhat.com>
Tested-by: Hiroki Narukawa <hnarukaw@yahoo-corp.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
											
										 
											2022-05-10 17:10:20 +02:00
										 |  |  |     qatomic_sub(&pool_max_size, removing_pool_size); | 
					
						
							| 
									
										
										
										
											2022-02-14 20:53:02 +09:00
										 |  |  | } |