spapr: reset DRCs after devices

A DRC with a pending unplug request releases its associated device at machine reset time. In the case of LMB, when all DRCs for a DIMM device have been reset, the DIMM gets unplugged, causing guest memory to disappear. This may be very confusing for anything still using this memory. This is exactly what happens with vhost backends, and QEMU aborts with: qemu-system-ppc64: used ring relocated for ring 2 qemu-system-ppc64: qemu/hw/virtio/vhost.c:649: vhost_commit: Assertion `r >= 0' failed. The issue is that each DRC registers a QEMU reset handler, and we don't control the order in which these handlers are called (ie, a LMB DRC will unplug a DIMM before the virtio device using the memory on this DIMM could stop its vhost backend). To avoid such situations, let's reset DRCs after all devices have been reset. Reported-by: Mallesh N. Koti <mallesh@linux.vnet.ibm.com> Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com> Reviewed-by: Michael Roth <mdroth@linux.vnet.ibm.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
target/ppc: Update setting of cpu features to account for compat modes
2017-11-20 10:10:56 +11:00 · 2017-11-20 10:07:49 +11:00 · 2017-11-17 19:08:07 +00:00 · 2017-11-17 18:24:30 +01:00 · 2017-11-17 18:21:31 +01:00 · 2017-11-17 18:21:31 +01:00
964 changed files with 57702 additions and 18065 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -14,6 +14,8 @@
 /trace/generated-tcg-tracers.h
 /ui/shader/texture-blit-frag.h
 /ui/shader/texture-blit-vert.h
+/ui/shader/texture-blit-flip-vert.h
+/ui/input-keymap-*.c
 *-timestamp
 /*-softmmu
 /*-darwin-user
@@ -44,14 +46,17 @@
 /qemu-io
 /qemu-ga
 /qemu-bridge-helper
+/qemu-keymap
 /qemu-monitor.texi
 /qemu-monitor-info.texi
 /qemu-version.h
 /qemu-version.h.tmp
 /module_block.h
+/scsi/qemu-pr-helper
 /vscclient
 /vhost-user-scsi
 /fsdev/virtfs-proxy-helper
+*.tmp
 *.[1-9]
 *.a
 *.aux
@@ -111,6 +116,7 @@
 /docs/version.texi
 *.tps
 .stgit-*
+.git-submodule-status
 cscope.*
 tags
 TAGS
--- a/.gitmodules
+++ b/.gitmodules
@@ -34,3 +34,9 @@
 [submodule "roms/QemuMacDrivers"]
 	path = roms/QemuMacDrivers
 	url = git://git.qemu.org/QemuMacDrivers.git
+[submodule "ui/keycodemapdb"]
+	path = ui/keycodemapdb
+	url = git://git.qemu.org/keycodemapdb.git
+[submodule "capstone"]
+	path = capstone
+	url = git://git.qemu.org/capstone.git
--- a/.mailmap
+++ b/.mailmap
@@ -8,8 +8,11 @@ Aurelien Jarno <aurelien@aurel32.net> aurel32 <aurel32@c046a42c-6fe2-441c-8c8c-7
 Blue Swirl <blauwirbel@gmail.com> blueswir1 <blueswir1@c046a42c-6fe2-441c-8c8c-71466251a162>
 Edgar E. Iglesias <edgar.iglesias@gmail.com> edgar_igl <edgar_igl@c046a42c-6fe2-441c-8c8c-71466251a162>
 Fabrice Bellard <fabrice@bellard.org> bellard <bellard@c046a42c-6fe2-441c-8c8c-71466251a162>
+James Hogan <jhogan@kernel.org> <james.hogan@imgtec.com>
 Jocelyn Mayer <l_indien@magic.fr> j_mayer <j_mayer@c046a42c-6fe2-441c-8c8c-71466251a162>
 Paul Brook <paul@codesourcery.com> pbrook <pbrook@c046a42c-6fe2-441c-8c8c-71466251a162>
+Paul Burton <paul.burton@mips.com> <paul.burton@imgtec.com>
+Paul Burton <paul.burton@mips.com> <paul@archlinuxmips.org>
 Thiemo Seufer <ths@networkno.de> ths <ths@c046a42c-6fe2-441c-8c8c-71466251a162>
 malc <av1474@comtv.ru> malc <malc@c046a42c-6fe2-441c-8c8c-71466251a162>
 # There is also a:
--- a/112
+++ b/112
@@ -162,7 +162,7 @@ F: disas/microblaze.c

 MIPS
 M: Aurelien Jarno <aurelien@aurel32.net>
-M: Yongbok Kim <yongbok.kim@imgtec.com>
+M: Yongbok Kim <yongbok.kim@mips.com>
 S: Maintained
 F: target/mips/
 F: hw/mips/
@@ -216,6 +216,7 @@ S: Maintained
 F: target/s390x/
 F: hw/s390x/
 F: disas/s390.c
+L: qemu-s390x@nongnu.org

 SH4
 M: Aurelien Jarno <aurelien@aurel32.net>
@@ -284,7 +285,7 @@ S: Maintained
 F: target/arm/kvm.c

 MIPS
-M: James Hogan <james.hogan@imgtec.com>
+M: James Hogan <jhogan@kernel.org>
 S: Maintained
 F: target/mips/kvm.c

@@ -299,14 +300,18 @@ M: Cornelia Huck <cohuck@redhat.com>
 M: Alexander Graf <agraf@suse.de>
 S: Maintained
 F: target/s390x/kvm.c
+F: target/s390x/kvm_s390x.h
+F: target/s390x/kvm-stub.c
 F: target/s390x/ioinst.[ch]
 F: target/s390x/machine.c
+F: target/s390x/sigp.c
 F: hw/intc/s390_flic.c
 F: hw/intc/s390_flic_kvm.c
 F: include/hw/s390x/s390_flic.h
 F: gdb-xml/s390*.xml
 T: git git://github.com/cohuck/qemu.git s390-next
 T: git git://github.com/borntraeger/qemu.git s390-next
+L: qemu-s390x@nongnu.org

 X86
 M: Paolo Bonzini <pbonzini@redhat.com>
@@ -380,6 +385,7 @@ M: Peter Maydell <peter.maydell@linaro.org>
 L: qemu-arm@nongnu.org
 S: Maintained
 F: hw/char/pl011.c
+F: include/hw/char/pl011.h
 F: hw/display/pl110*
 F: hw/dma/pl080.c
 F: hw/dma/pl330.c
@@ -403,13 +409,15 @@ F: hw/intc/gic_internal.h
 F: hw/misc/a9scu.c
 F: hw/misc/arm11scu.c
 F: hw/timer/a9gtimer*
-F: hw/timer/arm_*
-F: include/hw/arm/arm.h
+F: hw/timer/arm*
+F: include/hw/arm/arm*.h
 F: include/hw/intc/arm*
 F: include/hw/misc/a9scu.h
 F: include/hw/misc/arm11scu.h
 F: include/hw/timer/a9gtimer.h
 F: include/hw/timer/arm_mptimer.h
+F: include/hw/timer/armv7m_systick.h
+F: tests/test-arm-mptimer.c

 Exynos
 M: Igor Mitsyanko <i.mitsyanko@gmail.com>
@@ -512,6 +520,7 @@ M: Peter Maydell <peter.maydell@linaro.org>
 L: qemu-arm@nongnu.org
 S: Maintained
 F: hw/*/versatile*
+F: hw/misc/arm_sysctl.c

 Xilinx Zynq
 M: Edgar E. Iglesias <edgar.iglesias@gmail.com>
@@ -548,12 +557,30 @@ F: hw/char/stm32f2xx_usart.c
 F: hw/timer/stm32f2xx_timer.c
 F: hw/adc/*
 F: hw/ssi/stm32f2xx_spi.c
+F: include/hw/*/stm32*.h

 Netduino 2
 M: Alistair Francis <alistair@alistair23.me>
 S: Maintained
 F: hw/arm/netduino2.c

+SmartFusion2
+M: Subbaraya Sundeep <sundeep.lkml@gmail.com>
+S: Maintained
+F: hw/arm/msf2-soc.c
+F: hw/misc/msf2-sysreg.c
+F: hw/timer/mss-timer.c
+F: hw/ssi/mss-spi.c
+F: include/hw/arm/msf2-soc.h
+F: include/hw/misc/msf2-sysreg.h
+F: include/hw/timer/mss-timer.h
+F: include/hw/ssi/mss-spi.h
+
+Emcraft M2S-FG484
+M: Subbaraya Sundeep <sundeep.lkml@gmail.com>
+S: Maintained
+F: hw/arm/msf2-som.c
+
 CRIS Machines
 -------------
 Axis Dev88
@@ -616,7 +643,7 @@ S: Maintained
 F: hw/mips/mips_malta.c

 Mipssim
-M: Yongbok Kim <yongbok.kim@imgtec.com>
+M: Yongbok Kim <yongbok.kim@mips.com>
 S: Odd Fixes
 F: hw/mips/mips_mipssim.c
 F: hw/net/mipsnet.c
@@ -627,12 +654,12 @@ S: Maintained
 F: hw/mips/mips_r4k.c

 Fulong 2E
-M: Yongbok Kim <yongbok.kim@imgtec.com>
+M: Yongbok Kim <yongbok.kim@mips.com>
 S: Odd Fixes
 F: hw/mips/mips_fulong2e.c

 Boston
-M: Paul Burton <paul.burton@imgtec.com>
+M: Paul Burton <paul.burton@mips.com>
 S: Maintained
 F: hw/core/loader-fit.c
 F: hw/mips/boston.c
@@ -771,7 +798,7 @@ F: pc-bios/openbios-sparc64
 Sun4v
 M: Artyom Tarasenko <atar4qemu@gmail.com>
 S: Maintained
-F: hw/sparc64/sun4v.c
+F: hw/sparc64/niagara.c
 F: hw/timer/sun4v-rtc.c
 F: include/hw/timer/sun4v-rtc.h

@@ -800,6 +827,7 @@ F: pc-bios/s390-ccw.img
 F: default-configs/s390x-softmmu.mak
 T: git git://github.com/cohuck/qemu.git s390-next
 T: git git://github.com/borntraeger/qemu.git s390-next
+L: qemu-s390x@nongnu.org

 UniCore32 Machines
 -------------
@@ -925,6 +953,9 @@ F: include/hw/pci/*
 F: hw/misc/pci-testdev.c
 F: hw/pci/*
 F: hw/pci-bridge/*
+F: docs/pci*
+F: docs/specs/*pci*
+F: default-configs/pci.mak

 ACPI/SMBIOS
 M: Michael S. Tsirkin <mst@redhat.com>
@@ -971,22 +1002,19 @@ SCSI
 M: Paolo Bonzini <pbonzini@redhat.com>
 S: Supported
 F: include/hw/scsi/*
-F: include/scsi/*
 F: hw/scsi/*
-F: util/scsi*
 F: tests/virtio-scsi-test.c
 T: git git://github.com/bonzini/qemu.git scsi-next

-LSI53C895A
-S: Orphan
-F: hw/scsi/lsi53c895a.c
-
 SSI
 M: Peter Crosthwaite <crosthwaite.peter@gmail.com>
+M: Alistair Francis <alistair.francis@xilinx.com>
 S: Maintained
 F: hw/ssi/*
 F: hw/block/m25p80.c
+F: include/hw/ssi/ssi.h
 X: hw/ssi/xilinx_*
+F: tests/m25p80-test.c

 Xilinx SPI
 M: Alistair Francis <alistair.francis@xilinx.com>
@@ -1024,11 +1052,13 @@ F: hw/vfio/ccw.c
 F: hw/s390x/s390-ccw.c
 F: include/hw/s390x/s390-ccw.h
 T: git git://github.com/cohuck/qemu.git s390-next
+L: qemu-s390x@nongnu.org

 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
 S: Supported
 F: hw/*/*vhost*
+F: docs/interop/vhost-user.txt

 virtio
 M: Michael S. Tsirkin <mst@redhat.com>
@@ -1066,6 +1096,7 @@ S: Supported
 F: hw/s390x/virtio-ccw.[hc]
 T: git git://github.com/cohuck/qemu.git s390-next
 T: git git://github.com/borntraeger/qemu.git s390-next
+L: qemu-s390x@nongnu.org

 virtio-input
 M: Gerd Hoffmann <kraxel@redhat.com>
@@ -1126,6 +1157,7 @@ M: Dmitry Fleytman <dmitry@daynix.com>
 S: Maintained
 F: hw/net/vmxnet*
 F: hw/scsi/vmw_pvscsi*
+F: tests/vmxnet3-test.c

 Rocker
 M: Jiri Pirko <jiri@resnulli.us>
@@ -1156,6 +1188,7 @@ M: Alistair Francis <alistair.francis@xilinx.com>
 S: Maintained
 F: hw/core/generic-loader.c
 F: include/hw/core/generic-loader.h
+F: docs/generic-loader.txt

 CHRP NVRAM
 M: Thomas Huth <thuth@redhat.com>
@@ -1217,6 +1250,7 @@ F: util/aio-*.c
 F: block/io.c
 F: migration/block*
 F: include/block/aio.h
+F: scripts/qemugdb/aio.py
 T: git git://github.com/stefanha/qemu.git block

 Block SCSI subsystem
@@ -1257,7 +1291,7 @@ F: block/dirty-bitmap.c
 F: include/qemu/hbitmap.h
 F: include/block/dirty-bitmap.h
 F: tests/test-hbitmap.c
-F: docs/bitmaps.md
+F: docs/interop/bitmaps.rst
 T: git git://github.com/famz/qemu.git bitmaps
 T: git git://github.com/jnsnow/qemu.git bitmaps

@@ -1301,6 +1335,17 @@ S: Maintained
 F: device_tree.c
 F: include/sysemu/device_tree.h

+Dump
+S: Supported
+M: Marc-André Lureau <marcandre.lureau@redhat.com>
+F: dump.c
+F: hw/misc/vmcoreinfo.c
+F: include/hw/misc/vmcoreinfo.h
+F: include/sysemu/dump-arch.h
+F: include/sysemu/dump.h
+F: scripts/dump-guest-memory.py
+F: stubs/dump.c
+
 Error reporting
 M: Markus Armbruster <armbru@redhat.com>
 S: Supported
@@ -1426,7 +1471,7 @@ F: tests/test-qapi-*.c
 F: tests/test-qmp-*.c
 F: tests/test-visitor-serialization.c
 F: scripts/qapi*
-F: docs/qapi*
+F: docs/devel/qapi*
 T: git git://repo.or.cz/qemu/armbru.git qapi-next

 QAPI Schema
@@ -1455,6 +1500,10 @@ QEMU Guest Agent
 M: Michael Roth <mdroth@linux.vnet.ibm.com>
 S: Maintained
 F: qga/
+F: qemu-ga.texi
+F: scripts/qemu-guest-agent/
+F: tests/test-qga.c
+F: docs/interop/qemu-ga-ref.texi
 T: git git://github.com/mdroth/qemu.git qga

 QOM
@@ -1474,7 +1523,7 @@ M: Markus Armbruster <armbru@redhat.com>
 S: Supported
 F: qmp.c
 F: monitor.c
-F: docs/*qmp-*
+F: docs/devel/*qmp-*
 F: scripts/qmp/
 F: tests/qmp-test.c
 T: git git://repo.or.cz/qemu/armbru.git qapi-next
@@ -1505,16 +1554,19 @@ S: Maintained
 F: trace/
 F: scripts/tracetool.py
 F: scripts/tracetool/
-F: docs/tracing.txt
+F: docs/devel/tracing.txt
 T: git git://github.com/stefanha/qemu.git tracing

 TPM
-S: Orphan
+M: Stefan Berger <stefanb@linux.vnet.ibm.com>
+S: Maintained
 F: tpm.c
+F: stubs/tpm.c
 F: hw/tpm/*
 F: include/hw/acpi/tpm.h
 F: include/sysemu/tpm*
 F: qapi/tpm.json
+F: backends/tpm.c

 Checkpatch
 S: Odd Fixes
@@ -1528,7 +1580,8 @@ F: include/migration/
 F: migration/
 F: scripts/vmstate-static-checker.py
 F: tests/vmstate-static-checker-data/
-F: docs/migration.txt
+F: tests/migration-test.c
+F: docs/devel/migration.txt
 F: qapi/migration.json

 Seccomp
@@ -1543,6 +1596,7 @@ S: Maintained
 F: crypto/
 F: include/crypto/
 F: tests/test-crypto-*
+F: tests/benchmark-crypto-*
 F: qemu.sasl

 Coroutines
@@ -1579,8 +1633,10 @@ M: Alberto Garcia <berto@igalia.com>
 S: Supported
 F: block/throttle-groups.c
 F: include/block/throttle-groups.h
-F: include/qemu/throttle.h
+F: include/qemu/throttle*.h
 F: util/throttle.c
+F: docs/throttle.txt
+F: tests/test-throttle.c
 L: qemu-block@nongnu.org

 UUID
@@ -1686,6 +1742,7 @@ M: Richard Henderson <rth@twiddle.net>
 S: Maintained
 F: tcg/s390/
 F: disas/s390.c
+L: qemu-s390x@nongnu.org

 SPARC target
 S: Odd Fixes
@@ -1836,7 +1893,7 @@ M: Denis V. Lunev <den@openvz.org>
 L: qemu-block@nongnu.org
 S: Supported
 F: block/parallels.c
-F: docs/specs/parallels.txt
+F: docs/interop/parallels.txt

 qed
 M: Stefan Hajnoczi <stefanha@redhat.com>
@@ -1861,6 +1918,7 @@ M: Max Reitz <mreitz@redhat.com>
 L: qemu-block@nongnu.org
 S: Supported
 F: block/qcow2*
+F: docs/interop/qcow2.txt

 qcow
 M: Kevin Wolf <kwolf@redhat.com>
@@ -1904,6 +1962,7 @@ F: docs/block-replication.txt

 Build and test automation
 -------------------------
+Build and test automation
 M: Alex Bennée <alex.bennee@linaro.org>
 M: Fam Zheng <famz@redhat.com>
 R: Philippe Mathieu-Daudé <f4bug@amsat.org>
@@ -1912,6 +1971,7 @@ S: Maintained
 F: .travis.yml
 F: .shippable.yml
 F: tests/docker/
+F: tests/vm/
 W: https://travis-ci.org/qemu/qemu
 W: https://app.shippable.com/github/qemu/qemu
 W: http://patchew.org/QEMU/
@@ -1921,5 +1981,11 @@ Documentation
 Build system architecture
 M: Daniel P. Berrange <berrange@redhat.com>
 S: Odd Fixes
-F: docs/build-system.txt
+F: docs/devel/build-system.txt

+Build System
+------------
+GIT submodules
+M: Daniel P. Berrange <berrange@redhat.com>
+S: Odd Fixes
+F: scripts/git-submodule.sh
--- a/106
+++ b/106
@@ -6,7 +6,7 @@ BUILD_DIR=$(CURDIR)
 # Before including a proper config-host.mak, assume we are in the source tree
 SRC_PATH=.

-UNCHECKED_GOALS := %clean TAGS cscope ctags docker docker-%
+UNCHECKED_GOALS := %clean TAGS cscope ctags docker docker-% help

 # All following code might depend on configuration variables
 ifneq ($(wildcard config-host.mak),)
@@ -14,6 +14,36 @@ ifneq ($(wildcard config-host.mak),)
 all:
 include config-host.mak

+git-submodule-update:
+
+.PHONY: git-submodule-update
+
+git_module_status := $(shell \
+  cd '$(SRC_PATH)' && \
+  GIT="$(GIT)" ./scripts/git-submodule.sh status $(GIT_SUBMODULES); \
+  echo $$?; \
+)
+
+ifeq (1,$(git_module_status))
+ifeq (no,$(GIT_UPDATE))
+git-submodule-update:
+	$(call quiet-command, \
+            echo && \
+            echo "GIT submodule checkout is out of date. Please run" && \
+            echo "  scripts/git-submodule.sh update $(GIT_SUBMODULES)" && \
+            echo "from the source directory checkout $(SRC_PATH)" && \
+            echo && \
+            exit 1)
+else
+git-submodule-update:
+	$(call quiet-command, \
+          (cd $(SRC_PATH) && GIT="$(GIT)" ./scripts/git-submodule.sh update $(GIT_SUBMODULES)), \
+          "GIT","$(GIT_SUBMODULES)")
+endif
+endif
+
+.git-submodule-status: git-submodule-update config-host.mak
+
 # Check that we're not trying to do an out-of-tree build from
 # a tree that's been used for an in-tree build.
 ifneq ($(realpath $(SRC_PATH)),$(realpath .))
@@ -84,6 +114,7 @@ endif
 GENERATED_FILES += $(TRACE_HEADERS)
 GENERATED_FILES += $(TRACE_SOURCES)
 GENERATED_FILES += $(BUILD_DIR)/trace-events-all
+GENERATED_FILES += .git-submodule-status

 trace-group-name = $(shell dirname $1 | sed -e 's/[^a-zA-Z0-9]/_/g')

@@ -191,6 +222,31 @@ trace-dtrace-root.h: trace-dtrace-root.dtrace

 trace-dtrace-root.o: trace-dtrace-root.dtrace

+KEYCODEMAP_GEN = $(SRC_PATH)/ui/keycodemapdb/tools/keymap-gen
+KEYCODEMAP_CSV = $(SRC_PATH)/ui/keycodemapdb/data/keymaps.csv
+
+KEYCODEMAP_FILES = \
+		 ui/input-keymap-linux-to-qcode.c \
+		 ui/input-keymap-qcode-to-qnum.c \
+		 ui/input-keymap-qnum-to-qcode.c \
+		 $(NULL)
+
+GENERATED_FILES += $(KEYCODEMAP_FILES)
+
+ui/input-keymap-%.c: $(KEYCODEMAP_GEN) $(KEYCODEMAP_CSV) $(SRC_PATH)/ui/Makefile.objs
+	$(call quiet-command,\
+	    src=$$(echo $@ | sed -E -e "s,^ui/input-keymap-(.+)-to-(.+)\.c$$,\1,") && \
+	    dst=$$(echo $@ | sed -E -e "s,^ui/input-keymap-(.+)-to-(.+)\.c$$,\2,") && \
+	    test -e $(KEYCODEMAP_GEN) && \
+	    $(PYTHON) $(KEYCODEMAP_GEN) \
+	          --lang glib2 \
+	          --varname qemu_input_map_$${src}_to_$${dst} \
+	          code-map $(KEYCODEMAP_CSV) $${src} $${dst} \
+	        > $@ || rm -f $@, "GEN", "$@")
+
+$(KEYCODEMAP_GEN): .git-submodule-status
+$(KEYCODEMAP_CSV): .git-submodule-status
+
 # Don't try to regenerate Makefile or configure
 # We don't generate any of them
 Makefile: ;
@@ -209,6 +265,7 @@ ifdef BUILD_DOCS
 DOCS=qemu-doc.html qemu-doc.txt qemu.1 qemu-img.1 qemu-nbd.8 qemu-ga.8
 DOCS+=docs/interop/qemu-qmp-ref.html docs/interop/qemu-qmp-ref.txt docs/interop/qemu-qmp-ref.7
 DOCS+=docs/interop/qemu-ga-ref.html docs/interop/qemu-ga-ref.txt docs/interop/qemu-ga-ref.7
+DOCS+=docs/qemu-block-drivers.7
 ifdef CONFIG_VIRTFS
 DOCS+=fsdev/virtfs-proxy-helper.1
 endif
@@ -329,12 +386,27 @@ DTC_MAKE_ARGS=-I$(SRC_PATH)/dtc VPATH=$(SRC_PATH)/dtc -C dtc V="$(V)" LIBFDT_src
 DTC_CFLAGS=$(CFLAGS) $(QEMU_CFLAGS)
 DTC_CPPFLAGS=-I$(BUILD_DIR)/dtc -I$(SRC_PATH)/dtc -I$(SRC_PATH)/dtc/libfdt

-subdir-dtc:dtc/libfdt dtc/tests
+subdir-dtc: .git-submodule-status dtc/libfdt dtc/tests
 	$(call quiet-command,$(MAKE) $(DTC_MAKE_ARGS) CPPFLAGS="$(DTC_CPPFLAGS)" CFLAGS="$(DTC_CFLAGS)" LDFLAGS="$(LDFLAGS)" ARFLAGS="$(ARFLAGS)" CC="$(CC)" AR="$(AR)" LD="$(LD)" $(SUBDIR_MAKEFLAGS) libfdt/libfdt.a,)

-dtc/%:
+dtc/%: .git-submodule-status
 	mkdir -p $@

+# Overriding CFLAGS causes us to lose defines added in the sub-makefile.
+# Not overriding CFLAGS leads to mis-matches between compilation modes.
+# Therefore we replicate some of the logic in the sub-makefile.
+# Remove all the extra -Warning flags that QEMU uses that Capstone doesn't;
+# no need to annoy QEMU developers with such things.
+CAP_CFLAGS = $(patsubst -W%,,$(CFLAGS) $(QEMU_CFLAGS))
+CAP_CFLAGS += -DCAPSTONE_USE_SYS_DYN_MEM
+CAP_CFLAGS += -DCAPSTONE_HAS_ARM
+CAP_CFLAGS += -DCAPSTONE_HAS_ARM64
+CAP_CFLAGS += -DCAPSTONE_HAS_POWERPC
+CAP_CFLAGS += -DCAPSTONE_HAS_X86
+
+subdir-capstone: .git-submodule-status
+	$(call quiet-command,$(MAKE) -C $(SRC_PATH)/capstone CAPSTONE_SHARED=no BUILDDIR="$(BUILD_DIR)/capstone" CC="$(CC)" AR="$(AR)" LD="$(LD)" RANLIB="$(RANLIB)" CFLAGS="$(CAP_CFLAGS)" $(SUBDIR_MAKEFLAGS) $(BUILD_DIR)/capstone/$(LIBCAPSTONE))
+
 $(SUBDIR_RULES): libqemuutil.a $(common-obj-y) $(chardev-obj-y) \
 	$(qom-obj-y) $(crypto-aes-obj-$(CONFIG_USER_ONLY))

@@ -356,6 +428,7 @@ Makefile: $(version-obj-y)
 # Build libraries

 libqemuutil.a: $(util-obj-y) $(trace-obj-y) $(stub-obj-y)
+libvhost-user.a: $(libvhost-user-obj-y)

 ######################################################################

@@ -369,15 +442,25 @@ qemu-io$(EXESUF): qemu-io.o $(block-obj-y) $(crypto-obj-y) $(io-obj-y) $(qom-obj

 qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o $(COMMON_LDADDS)

+qemu-keymap$(EXESUF): qemu-keymap.o ui/input-keymap.o $(COMMON_LDADDS)
+
 fsdev/virtfs-proxy-helper$(EXESUF): fsdev/virtfs-proxy-helper.o fsdev/9p-marshal.o fsdev/9p-iov-marshal.o $(COMMON_LDADDS)
 fsdev/virtfs-proxy-helper$(EXESUF): LIBS += -lcap

+scsi/qemu-pr-helper$(EXESUF): scsi/qemu-pr-helper.o scsi/utils.o $(crypto-obj-y) $(io-obj-y) $(qom-obj-y) $(COMMON_LDADDS)
+ifdef CONFIG_MPATH
+scsi/qemu-pr-helper$(EXESUF): LIBS += -ludev -lmultipath -lmpathpersist
+endif
+
 qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx $(SRC_PATH)/scripts/hxtool
 	$(call quiet-command,sh $(SRC_PATH)/scripts/hxtool -h < $< > $@,"GEN","$@")

 qemu-ga$(EXESUF): LIBS = $(LIBS_QGA)
 qemu-ga$(EXESUF): QEMU_CFLAGS += -I qga/qapi-generated

+qemu-keymap$(EXESUF): LIBS += $(XKBCOMMON_LIBS)
+qemu-keymap$(EXESUF): QEMU_CFLAGS += $(XKBCOMMON_CFLAGS)
+
 gen-out-type = $(subst .,-,$(suffix $@))

 qapi-py = $(SRC_PATH)/scripts/qapi.py $(SRC_PATH)/scripts/ordereddict.py
@@ -473,7 +556,7 @@ ivshmem-client$(EXESUF): $(ivshmem-client-obj-y) $(COMMON_LDADDS)
 ivshmem-server$(EXESUF): $(ivshmem-server-obj-y) $(COMMON_LDADDS)
 	$(call LINK, $^)
 endif
-vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y)
+vhost-user-scsi$(EXESUF): $(vhost-user-scsi-obj-y) libvhost-user.a
 	$(call LINK, $^)

 module_block.h: $(SRC_PATH)/scripts/modules/module_block.py config-host.mak
@@ -488,7 +571,7 @@ clean:
 	rm -f *.msi
 	find . \( -name '*.so' -o -name '*.dll' -o -name '*.mo' -o -name '*.[oda]' \) -type f -exec rm {} +
 	rm -f $(filter-out %.tlb,$(TOOLS)) $(HELPERS-y) qemu-ga TAGS cscope.* *.pod *~ */*~
-	rm -f fsdev/*.pod
+	rm -f fsdev/*.pod scsi/*.pod
 	rm -f qemu-img-cmds.h
 	rm -f ui/shader/*-vert.h ui/shader/*-frag.h
 	@# May not be present in GENERATED_FILES
@@ -527,6 +610,7 @@ distclean: clean
 	rm -f docs/interop/qemu-qmp-ref.txt docs/interop/qemu-ga-ref.txt
 	rm -f docs/interop/qemu-qmp-ref.pdf docs/interop/qemu-ga-ref.pdf
 	rm -f docs/interop/qemu-qmp-ref.html docs/interop/qemu-ga-ref.html
+	rm -f docs/qemu-block-drivers.7
 	for d in $(TARGET_DIRS); do \
 	rm -rf $$d || exit 1 ; \
        done
@@ -571,6 +655,7 @@ ifdef CONFIG_POSIX
 	$(INSTALL_DATA) qemu.1 "$(DESTDIR)$(mandir)/man1"
 	$(INSTALL_DIR) "$(DESTDIR)$(mandir)/man7"
 	$(INSTALL_DATA) docs/interop/qemu-qmp-ref.7 "$(DESTDIR)$(mandir)/man7"
+	$(INSTALL_DATA) docs/qemu-block-drivers.7 "$(DESTDIR)$(mandir)/man7"
 ifneq ($(TOOLS),)
 	$(INSTALL_DATA) qemu-img.1 "$(DESTDIR)$(mandir)/man1"
 	$(INSTALL_DIR) "$(DESTDIR)$(mandir)/man8"
@@ -663,8 +748,10 @@ ui/shader/%-frag.h: $(SRC_PATH)/ui/shader/%.frag $(SRC_PATH)/scripts/shaderinclu
 		perl $(SRC_PATH)/scripts/shaderinclude.pl $< > $@,\
 		"FRAG","$@")

-ui/console-gl.o: $(SRC_PATH)/ui/console-gl.c \
-	ui/shader/texture-blit-vert.h ui/shader/texture-blit-frag.h
+ui/shader.o: $(SRC_PATH)/ui/shader.c \
+	ui/shader/texture-blit-vert.h \
+	ui/shader/texture-blit-flip-vert.h \
+	ui/shader/texture-blit-frag.h

 # documentation
 MAKEINFO=makeinfo
@@ -716,6 +803,7 @@ qemu-img.1: qemu-img.texi qemu-option-trace.texi qemu-img-cmds.texi
 fsdev/virtfs-proxy-helper.1: fsdev/virtfs-proxy-helper.texi
 qemu-nbd.8: qemu-nbd.texi qemu-option-trace.texi
 qemu-ga.8: qemu-ga.texi
+docs/qemu-block-drivers.7: docs/qemu-block-drivers.texi

 html: qemu-doc.html docs/interop/qemu-qmp-ref.html docs/interop/qemu-ga-ref.html
 info: qemu-doc.info docs/interop/qemu-qmp-ref.info docs/interop/qemu-ga-ref.info
@@ -725,7 +813,7 @@ txt: qemu-doc.txt docs/interop/qemu-qmp-ref.txt docs/interop/qemu-ga-ref.txt
 qemu-doc.html qemu-doc.info qemu-doc.pdf qemu-doc.txt: \
 	qemu-img.texi qemu-nbd.texi qemu-options.texi qemu-option-trace.texi \
 	qemu-monitor.texi qemu-img-cmds.texi qemu-ga.texi \
-	qemu-monitor-info.texi
+	qemu-monitor-info.texi docs/qemu-block-drivers.texi

 docs/interop/qemu-ga-ref.dvi docs/interop/qemu-ga-ref.html \
    docs/interop/qemu-ga-ref.info docs/interop/qemu-ga-ref.pdf \
@@ -811,6 +899,7 @@ endif
 -include $(wildcard *.d tests/*.d)

 include $(SRC_PATH)/tests/docker/Makefile.include
+include $(SRC_PATH)/tests/vm/Makefile.include

 .PHONY: help
 help:
@@ -834,6 +923,7 @@ help:
 	@echo  'Test targets:'
 	@echo  '  check           - Run all tests (check-help for details)'
 	@echo  '  docker          - Help about targets running tests inside Docker containers'
+	@echo  '  vm-test         - Help about targets running tests inside VM'
 	@echo  ''
 	@echo  'Documentation targets:'
 	@echo  '  html info pdf txt'
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -62,7 +62,7 @@ bt-host.o-cflags := $(BLUEZ_CFLAGS)
 common-obj-y += dma-helpers.o
 common-obj-y += vl.o
 vl.o-cflags := $(GPROF_CFLAGS) $(SDL_CFLAGS)
-common-obj-y += tpm.o
+common-obj-$(CONFIG_TPM) += tpm.o

 common-obj-$(CONFIG_SLIRP) += slirp/

@@ -115,7 +115,6 @@ libvhost-user-obj-y = contrib/libvhost-user/
 vhost-user-scsi.o-cflags := $(LIBISCSI_CFLAGS)
 vhost-user-scsi.o-libs := $(LIBISCSI_LIBS)
 vhost-user-scsi-obj-y = contrib/vhost-user-scsi/
-vhost-user-scsi-obj-y += contrib/libvhost-user/libvhost-user.o

 ######################################################################
 trace-events-subdirs =
@@ -171,6 +170,7 @@ trace-events-subdirs += qapi
 trace-events-subdirs += accel/tcg
 trace-events-subdirs += accel/kvm
 trace-events-subdirs += nbd
+trace-events-subdirs += scsi

 trace-events-files = $(SRC_PATH)/trace-events $(trace-events-subdirs:%=$(SRC_PATH)/%/trace-events)

--- a/Makefile.target
+++ b/Makefile.target
@@ -102,12 +102,6 @@ obj-y += target/$(TARGET_BASE_ARCH)/
 obj-y += disas.o
 obj-$(call notempty,$(TARGET_XML_FILES)) += gdbstub-xml.o

-obj-$(CONFIG_LIBDECNUMBER) += libdecnumber/decContext.o
-obj-$(CONFIG_LIBDECNUMBER) += libdecnumber/decNumber.o
-obj-$(CONFIG_LIBDECNUMBER) += libdecnumber/dpd/decimal32.o
-obj-$(CONFIG_LIBDECNUMBER) += libdecnumber/dpd/decimal64.o
-obj-$(CONFIG_LIBDECNUMBER) += libdecnumber/dpd/decimal128.o
-
 #########################################################
 # Linux user emulator target

--- a/2
+++ b/2
@@ -1 +1 @@
-2.10.50
+2.10.91
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -87,6 +87,7 @@ struct KVMState
 #endif
    int many_ioeventfds;
    int intx_set_mask;
+    bool sync_mmu;
    /* The man page (and posix) say ioctl numbers are signed int, but
     * they're not.  Linux, glibc and *BSD all treat ioctl numbers as
     * unsigned, and treating them as signed here can break things */
@@ -196,26 +197,20 @@ static hwaddr kvm_align_section(MemoryRegionSection *section,
                                hwaddr *start)
 {
    hwaddr size = int128_get64(section->size);
-    hwaddr delta;
-
-    *start = section->offset_within_address_space;
+    hwaddr delta, aligned;

    /* kvm works in page size chunks, but the function may be called
       with sub-page size and unaligned start address. Pad the start
       address to next and truncate size to previous page boundary. */
-    delta = qemu_real_host_page_size - (*start & ~qemu_real_host_page_mask);
-    delta &= ~qemu_real_host_page_mask;
-    *start += delta;
+    aligned = ROUND_UP(section->offset_within_address_space,
+                       qemu_real_host_page_size);
+    delta = aligned - section->offset_within_address_space;
+    *start = aligned;
    if (delta > size) {
        return 0;
    }
-    size -= delta;
-    size &= qemu_real_host_page_mask;
-    if (*start & ~qemu_real_host_page_mask) {
-        return 0;
-    }

-    return size;
+    return (size - delta) & qemu_real_host_page_mask;
 }

 int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
@@ -393,8 +388,8 @@ static int kvm_section_update_flags(KVMMemoryListener *kml,

    mem = kvm_lookup_matching_slot(kml, start_addr, size);
    if (!mem) {
-        fprintf(stderr, "%s: error finding slot\n", __func__);
-        abort();
+        /* We don't have a slot if we want to trap every access. */
+        return 0;
    }

    return kvm_slot_update_flags(kml, mem, section->mr);
@@ -469,8 +464,8 @@ static int kvm_physical_sync_dirty_bitmap(KVMMemoryListener *kml,
    if (size) {
        mem = kvm_lookup_matching_slot(kml, start_addr, size);
        if (!mem) {
-            fprintf(stderr, "%s: error finding slot\n", __func__);
-            abort();
+            /* We don't have a slot if we want to trap every access. */
+            return 0;
        }

        /* XXX bad kernel interface alert
@@ -716,13 +711,13 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
        return;
    }

+    /* use aligned delta to align the ram address */
    ram = memory_region_get_ram_ptr(mr) + section->offset_within_region +
-          (section->offset_within_address_space - start_addr);
+          (start_addr - section->offset_within_address_space);

-    mem = kvm_lookup_matching_slot(kml, start_addr, size);
    if (!add) {
+        mem = kvm_lookup_matching_slot(kml, start_addr, size);
        if (!mem) {
-            g_assert(!memory_region_is_ram(mr) && !writeable && !mr->romd_mode);
            return;
        }
        if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
@@ -733,19 +728,13 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
        mem->memory_size = 0;
        err = kvm_set_user_memory_region(kml, mem);
        if (err) {
-            fprintf(stderr, "%s: error unregistering overlapping slot: %s\n",
+            fprintf(stderr, "%s: error unregistering slot: %s\n",
                    __func__, strerror(-err));
            abort();
        }
        return;
    }

-    if (mem) {
-        /* update the slot */
-        kvm_slot_update_flags(kml, mem, mr);
-        return;
-    }
-
    /* register the new slot */
    mem = kvm_alloc_slot(kml);
    mem->memory_size = size;
@@ -1440,7 +1429,7 @@ static void kvm_irqchip_create(MachineState *machine, KVMState *s)
 */
 static int kvm_recommended_vcpus(KVMState *s)
 {
-    int ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS);
+    int ret = kvm_vm_check_extension(s, KVM_CAP_NR_VCPUS);
    return (ret) ? ret : 4;
 }

@@ -1530,26 +1519,6 @@ static int kvm_init(MachineState *ms)
        s->nr_slots = 32;
    }

-    /* check the vcpu limits */
-    soft_vcpus_limit = kvm_recommended_vcpus(s);
-    hard_vcpus_limit = kvm_max_vcpus(s);
-
-    while (nc->name) {
-        if (nc->num > soft_vcpus_limit) {
-            warn_report("Number of %s cpus requested (%d) exceeds "
-                        "the recommended cpus supported by KVM (%d)",
-                        nc->name, nc->num, soft_vcpus_limit);
-
-            if (nc->num > hard_vcpus_limit) {
-                fprintf(stderr, "Number of %s cpus requested (%d) exceeds "
-                        "the maximum cpus supported by KVM (%d)\n",
-                        nc->name, nc->num, hard_vcpus_limit);
-                exit(1);
-            }
-        }
-        nc++;
-    }
-
    kvm_type = qemu_opt_get(qemu_get_machine_opts(), "kvm-type");
    if (mc->kvm_type) {
        type = mc->kvm_type(kvm_type);
@@ -1584,6 +1553,27 @@ static int kvm_init(MachineState *ms)
    }

    s->vmfd = ret;
+
+    /* check the vcpu limits */
+    soft_vcpus_limit = kvm_recommended_vcpus(s);
+    hard_vcpus_limit = kvm_max_vcpus(s);
+
+    while (nc->name) {
+        if (nc->num > soft_vcpus_limit) {
+            warn_report("Number of %s cpus requested (%d) exceeds "
+                        "the recommended cpus supported by KVM (%d)",
+                        nc->name, nc->num, soft_vcpus_limit);
+
+            if (nc->num > hard_vcpus_limit) {
+                fprintf(stderr, "Number of %s cpus requested (%d) exceeds "
+                        "the maximum cpus supported by KVM (%d)\n",
+                        nc->name, nc->num, hard_vcpus_limit);
+                exit(1);
+            }
+        }
+        nc++;
+    }
+
    missing_cap = kvm_check_extension_list(s, kvm_required_capabilites);
    if (!missing_cap) {
        missing_cap =
@@ -1665,6 +1655,8 @@ static int kvm_init(MachineState *ms)

    s->many_ioeventfds = kvm_check_many_ioeventfds();

+    s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
+
    return 0;

 err:
@@ -2131,10 +2123,9 @@ int kvm_device_access(int fd, int group, uint64_t attr,
    return err;
 }

-/* Return 1 on success, 0 on failure */
-int kvm_has_sync_mmu(void)
+bool kvm_has_sync_mmu(void)
 {
-    return kvm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
+    return kvm_state->sync_mmu;
 }

 int kvm_has_vcpu_events(void)
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -64,9 +64,9 @@ int kvm_cpu_exec(CPUState *cpu)
    abort();
 }

-int kvm_has_sync_mmu(void)
+bool kvm_has_sync_mmu(void)
 {
-    return 0;
+    return false;
 }

 int kvm_has_many_ioeventfds(void)
--- a/accel/tcg/atomic_template.h
+++ b/accel/tcg/atomic_template.h
@@ -62,7 +62,9 @@ ABI_TYPE ATOMIC_NAME(cmpxchg)(CPUArchState *env, target_ulong addr,
                              ABI_TYPE cmpv, ABI_TYPE newv EXTRA_ARGS)
 {
    DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
-    return atomic_cmpxchg__nocheck(haddr, cmpv, newv);
+    DATA_TYPE ret = atomic_cmpxchg__nocheck(haddr, cmpv, newv);
+    ATOMIC_MMU_CLEANUP;
+    return ret;
 }

 #if DATA_SIZE >= 16
@@ -70,6 +72,7 @@ ABI_TYPE ATOMIC_NAME(ld)(CPUArchState *env, target_ulong addr EXTRA_ARGS)
 {
    DATA_TYPE val, *haddr = ATOMIC_MMU_LOOKUP;
    __atomic_load(haddr, &val, __ATOMIC_RELAXED);
+    ATOMIC_MMU_CLEANUP;
    return val;
 }

@@ -78,13 +81,16 @@ void ATOMIC_NAME(st)(CPUArchState *env, target_ulong addr,
 {
    DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
    __atomic_store(haddr, &val, __ATOMIC_RELAXED);
+    ATOMIC_MMU_CLEANUP;
 }
 #else
 ABI_TYPE ATOMIC_NAME(xchg)(CPUArchState *env, target_ulong addr,
                           ABI_TYPE val EXTRA_ARGS)
 {
    DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
-    return atomic_xchg__nocheck(haddr, val);
+    DATA_TYPE ret = atomic_xchg__nocheck(haddr, val);
+    ATOMIC_MMU_CLEANUP;
+    return ret;
 }

 #define GEN_ATOMIC_HELPER(X)                                        \
@@ -92,8 +98,10 @@ ABI_TYPE ATOMIC_NAME(X)(CPUArchState *env, target_ulong addr,       \
                 ABI_TYPE val EXTRA_ARGS)                           \
 {                                                                   \
    DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;                           \
-    return atomic_##X(haddr, val);                                  \
-}                                                                   \
+    DATA_TYPE ret = atomic_##X(haddr, val);                         \
+    ATOMIC_MMU_CLEANUP;                                             \
+    return ret;                                                     \
+}

 GEN_ATOMIC_HELPER(fetch_add)
 GEN_ATOMIC_HELPER(fetch_and)
@@ -123,7 +131,9 @@ ABI_TYPE ATOMIC_NAME(cmpxchg)(CPUArchState *env, target_ulong addr,
                              ABI_TYPE cmpv, ABI_TYPE newv EXTRA_ARGS)
 {
    DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
-    return BSWAP(atomic_cmpxchg__nocheck(haddr, BSWAP(cmpv), BSWAP(newv)));
+    DATA_TYPE ret = atomic_cmpxchg__nocheck(haddr, BSWAP(cmpv), BSWAP(newv));
+    ATOMIC_MMU_CLEANUP;
+    return BSWAP(ret);
 }

 #if DATA_SIZE >= 16
@@ -131,6 +141,7 @@ ABI_TYPE ATOMIC_NAME(ld)(CPUArchState *env, target_ulong addr EXTRA_ARGS)
 {
    DATA_TYPE val, *haddr = ATOMIC_MMU_LOOKUP;
    __atomic_load(haddr, &val, __ATOMIC_RELAXED);
+    ATOMIC_MMU_CLEANUP;
    return BSWAP(val);
 }

@@ -140,13 +151,16 @@ void ATOMIC_NAME(st)(CPUArchState *env, target_ulong addr,
    DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
    val = BSWAP(val);
    __atomic_store(haddr, &val, __ATOMIC_RELAXED);
+    ATOMIC_MMU_CLEANUP;
 }
 #else
 ABI_TYPE ATOMIC_NAME(xchg)(CPUArchState *env, target_ulong addr,
                           ABI_TYPE val EXTRA_ARGS)
 {
    DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;
-    return BSWAP(atomic_xchg__nocheck(haddr, BSWAP(val)));
+    ABI_TYPE ret = atomic_xchg__nocheck(haddr, BSWAP(val));
+    ATOMIC_MMU_CLEANUP;
+    return BSWAP(ret);
 }

 #define GEN_ATOMIC_HELPER(X)                                        \
@@ -154,7 +168,9 @@ ABI_TYPE ATOMIC_NAME(X)(CPUArchState *env, target_ulong addr,       \
                 ABI_TYPE val EXTRA_ARGS)                           \
 {                                                                   \
    DATA_TYPE *haddr = ATOMIC_MMU_LOOKUP;                           \
-    return BSWAP(atomic_##X(haddr, BSWAP(val)));                    \
+    DATA_TYPE ret = atomic_##X(haddr, BSWAP(val));                  \
+    ATOMIC_MMU_CLEANUP;                                             \
+    return BSWAP(ret);                                              \
 }

 GEN_ATOMIC_HELPER(fetch_and)
@@ -180,6 +196,7 @@ ABI_TYPE ATOMIC_NAME(fetch_add)(CPUArchState *env, target_ulong addr,
        sto = BSWAP(ret + val);
        ldn = atomic_cmpxchg__nocheck(haddr, ldo, sto);
        if (ldn == ldo) {
+            ATOMIC_MMU_CLEANUP;
            return ret;
        }
        ldo = ldn;
@@ -198,6 +215,7 @@ ABI_TYPE ATOMIC_NAME(add_fetch)(CPUArchState *env, target_ulong addr,
        sto = BSWAP(ret);
        ldn = atomic_cmpxchg__nocheck(haddr, ldo, sto);
        if (ldn == ldo) {
+            ATOMIC_MMU_CLEANUP;
            return ret;
        }
        ldo = ldn;
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -28,6 +28,7 @@
 #include "exec/address-spaces.h"
 #include "qemu/rcu.h"
 #include "exec/tb-hash.h"
+#include "exec/tb-lookup.h"
 #include "exec/log.h"
 #include "qemu/main-loop.h"
 #if defined(TARGET_I386) && !defined(CONFIG_USER_ONLY)
@@ -142,11 +143,11 @@ static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, TranslationBlock *itb)
    uintptr_t ret;
    TranslationBlock *last_tb;
    int tb_exit;
-    uint8_t *tb_ptr = itb->tc_ptr;
+    uint8_t *tb_ptr = itb->tc.ptr;

    qemu_log_mask_and_addr(CPU_LOG_EXEC, itb->pc,
                           "Trace %p [%d: " TARGET_FMT_lx "] %s\n",
-                           itb->tc_ptr, cpu->cpu_index, itb->pc,
+                           itb->tc.ptr, cpu->cpu_index, itb->pc,
                           lookup_symbol(itb->pc));

 #if defined(DEBUG_DISAS)
@@ -178,7 +179,7 @@ static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, TranslationBlock *itb)
        qemu_log_mask_and_addr(CPU_LOG_EXEC, last_tb->pc,
                               "Stopped execution of TB chain before %p ["
                               TARGET_FMT_lx "] %s\n",
-                               last_tb->tc_ptr, last_tb->pc,
+                               last_tb->tc.ptr, last_tb->pc,
                               lookup_symbol(last_tb->pc));
        if (cc->synchronize_from_tb) {
            cc->synchronize_from_tb(cpu, last_tb);
@@ -197,16 +198,19 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
                             TranslationBlock *orig_tb, bool ignore_icount)
 {
    TranslationBlock *tb;
+    uint32_t cflags = curr_cflags() | CF_NOCACHE;
+
+    if (ignore_icount) {
+        cflags &= ~CF_USE_ICOUNT;
+    }

    /* Should never happen.
       We only end up here when an existing TB is too long.  */
-    if (max_cycles > CF_COUNT_MASK)
-        max_cycles = CF_COUNT_MASK;
+    cflags |= MIN(max_cycles, CF_COUNT_MASK);

    tb_lock();
-    tb = tb_gen_code(cpu, orig_tb->pc, orig_tb->cs_base, orig_tb->flags,
-                     max_cycles | CF_NOCACHE
-                         | (ignore_icount ? CF_IGNORE_ICOUNT : 0));
+    tb = tb_gen_code(cpu, orig_tb->pc, orig_tb->cs_base,
+                     orig_tb->flags, cflags);
    tb->orig_tb = orig_tb;
    tb_unlock();

@@ -216,39 +220,45 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,

    tb_lock();
    tb_phys_invalidate(tb, -1);
-    tb_free(tb);
+    tb_remove(tb);
    tb_unlock();
 }
 #endif

-static void cpu_exec_step(CPUState *cpu)
+void cpu_exec_step_atomic(CPUState *cpu)
 {
    CPUClass *cc = CPU_GET_CLASS(cpu);
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
    TranslationBlock *tb;
    target_ulong cs_base, pc;
    uint32_t flags;
+    uint32_t cflags = 1;
+    uint32_t cf_mask = cflags & CF_HASH_MASK;
+    /* volatile because we modify it between setjmp and longjmp */
+    volatile bool in_exclusive_region = false;

-    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
    if (sigsetjmp(cpu->jmp_env, 0) == 0) {
+        tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
+        if (tb == NULL) {
            mmap_lock();
            tb_lock();
-        tb = tb_gen_code(cpu, pc, cs_base, flags,
-                         1 | CF_NOCACHE | CF_IGNORE_ICOUNT);
-        tb->orig_tb = NULL;
+            tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
+            if (likely(tb == NULL)) {
+                tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
+            }
            tb_unlock();
            mmap_unlock();
+        }

+        start_exclusive();
+
+        /* Since we got here, we know that parallel_cpus must be true.  */
+        parallel_cpus = false;
+        in_exclusive_region = true;
        cc->cpu_exec_enter(cpu);
        /* execute the generated code */
-        trace_exec_tb_nocache(tb, pc);
+        trace_exec_tb(tb, pc);
        cpu_tb_exec(cpu, tb);
        cc->cpu_exec_exit(cpu);
-
-        tb_lock();
-        tb_phys_invalidate(tb, -1);
-        tb_free(tb);
-        tb_unlock();
    } else {
        /* We may have exited due to another problem here, so we need
         * to reset any tb_locks we may have taken but didn't release.
@@ -260,18 +270,15 @@ static void cpu_exec_step(CPUState *cpu)
 #endif
        tb_lock_reset();
    }
-}

-void cpu_exec_step_atomic(CPUState *cpu)
-{
-    start_exclusive();
-
-    /* Since we got here, we know that parallel_cpus must be true.  */
-    parallel_cpus = false;
-    cpu_exec_step(cpu);
+    if (in_exclusive_region) {
+        /* We might longjump out of either the codegen or the
+         * execution, so must make sure we only end the exclusive
+         * region if we started it.
+         */
        parallel_cpus = true;
-
        end_exclusive();
+    }
 }

 struct tb_desc {
@@ -280,6 +287,7 @@ struct tb_desc {
    CPUArchState *env;
    tb_page_addr_t phys_page1;
    uint32_t flags;
+    uint32_t cf_mask;
    uint32_t trace_vcpu_dstate;
 };

@@ -293,7 +301,7 @@ static bool tb_cmp(const void *p, const void *d)
        tb->cs_base == desc->cs_base &&
        tb->flags == desc->flags &&
        tb->trace_vcpu_dstate == desc->trace_vcpu_dstate &&
-        !atomic_read(&tb->invalid)) {
+        (tb_cflags(tb) & (CF_HASH_MASK | CF_INVALID)) == desc->cf_mask) {
        /* check next page if needed */
        if (tb->page_addr[1] == -1) {
            return true;
@@ -312,7 +320,8 @@ static bool tb_cmp(const void *p, const void *d)
 }

 TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
-                                   target_ulong cs_base, uint32_t flags)
+                                   target_ulong cs_base, uint32_t flags,
+                                   uint32_t cf_mask)
 {
    tb_page_addr_t phys_pc;
    struct tb_desc desc;
@@ -321,19 +330,20 @@ TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
    desc.env = (CPUArchState *)cpu->env_ptr;
    desc.cs_base = cs_base;
    desc.flags = flags;
+    desc.cf_mask = cf_mask;
    desc.trace_vcpu_dstate = *cpu->trace_dstate;
    desc.pc = pc;
    phys_pc = get_page_addr_code(desc.env, pc);
    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_hash_func(phys_pc, pc, flags, *cpu->trace_dstate);
-    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
+    h = tb_hash_func(phys_pc, pc, flags, cf_mask, *cpu->trace_dstate);
+    return qht_lookup(&tb_ctx.htable, tb_cmp, &desc, h);
 }

 void tb_set_jmp_target(TranslationBlock *tb, int n, uintptr_t addr)
 {
    if (TCG_TARGET_HAS_direct_jump) {
        uintptr_t offset = tb->jmp_target_arg[n];
-        uintptr_t tc_ptr = (uintptr_t)tb->tc_ptr;
+        uintptr_t tc_ptr = (uintptr_t)tb->tc.ptr;
        tb_target_set_jmp_target(tc_ptr, tc_ptr + offset, addr);
    } else {
        tb->jmp_target_arg[n] = addr;
@@ -353,11 +363,11 @@ static inline void tb_add_jump(TranslationBlock *tb, int n,
    qemu_log_mask_and_addr(CPU_LOG_EXEC, tb->pc,
                           "Linking TBs %p [" TARGET_FMT_lx
                           "] index %d -> %p [" TARGET_FMT_lx "]\n",
-                           tb->tc_ptr, tb->pc, n,
-                           tb_next->tc_ptr, tb_next->pc);
+                           tb->tc.ptr, tb->pc, n,
+                           tb_next->tc.ptr, tb_next->pc);

    /* patch the native jump address */
-    tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc_ptr);
+    tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc.ptr);

    /* add in TB jmp circular list */
    tb->jmp_list_next[n] = tb_next->jmp_list_first;
@@ -366,45 +376,33 @@ static inline void tb_add_jump(TranslationBlock *tb, int n,

 static inline TranslationBlock *tb_find(CPUState *cpu,
                                        TranslationBlock *last_tb,
-                                        int tb_exit)
+                                        int tb_exit, uint32_t cf_mask)
 {
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
    TranslationBlock *tb;
    target_ulong cs_base, pc;
    uint32_t flags;
-    bool have_tb_lock = false;
-
-    /* we record a subset of the CPU state. It will
-       always be the same before a given translated block
-       is executed. */
-    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-    tb = atomic_rcu_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
-    if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
-                 tb->flags != flags ||
-                 tb->trace_vcpu_dstate != *cpu->trace_dstate)) {
-        tb = tb_htable_lookup(cpu, pc, cs_base, flags);
-        if (!tb) {
+    bool acquired_tb_lock = false;

+    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
+    if (tb == NULL) {
        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
         * taken outside tb_lock. As system emulation is currently
         * single threaded the locks are NOPs.
         */
        mmap_lock();
        tb_lock();
-            have_tb_lock = true;
+        acquired_tb_lock = true;

        /* There's a chance that our desired tb has been translated while
         * taking the locks so we check again inside the lock.
         */
-            tb = tb_htable_lookup(cpu, pc, cs_base, flags);
-            if (!tb) {
+        tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
+        if (likely(tb == NULL)) {
            /* if no translated code available, then translate it now */
-                tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+            tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
        }

        mmap_unlock();
-        }
-
        /* We add the TB in the virtual pc hash table for the fast lookup */
        atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
    }
@@ -419,15 +417,15 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
 #endif
    /* See if we can patch the calling TB. */
    if (last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
-        if (!have_tb_lock) {
+        if (!acquired_tb_lock) {
            tb_lock();
-            have_tb_lock = true;
+            acquired_tb_lock = true;
        }
-        if (!tb->invalid) {
+        if (!(tb->cflags & CF_INVALID)) {
            tb_add_jump(last_tb, tb_exit, tb);
        }
    }
-    if (have_tb_lock) {
+    if (acquired_tb_lock) {
        tb_unlock();
    }
    return tb;
@@ -472,7 +470,19 @@ static inline void cpu_handle_debug_exception(CPUState *cpu)

 static inline bool cpu_handle_exception(CPUState *cpu, int *ret)
 {
-    if (cpu->exception_index >= 0) {
+    if (cpu->exception_index < 0) {
+#ifndef CONFIG_USER_ONLY
+        if (replay_has_exception()
+               && cpu->icount_decr.u16.low + cpu->icount_extra == 0) {
+            /* try to cause an exception pending in the log */
+            cpu_exec_nocache(cpu, 1, tb_find(cpu, NULL, 0, curr_cflags()), true);
+        }
+#endif
+        if (cpu->exception_index < 0) {
+            return false;
+        }
+    }
+
    if (cpu->exception_index >= EXCP_INTERRUPT) {
        /* exit request from the cpu execution loop */
        *ret = cpu->exception_index;
@@ -505,15 +515,6 @@ static inline bool cpu_handle_exception(CPUState *cpu, int *ret)
            *ret = EXCP_INTERRUPT;
            return true;
        }
-#endif
-        }
-#ifndef CONFIG_USER_ONLY
-    } else if (replay_has_exception()
-               && cpu->icount_decr.u16.low + cpu->icount_extra == 0) {
-        /* try to cause an exception pending in the log */
-        cpu_exec_nocache(cpu, 1, tb_find(cpu, NULL, 0), true);
-        *ret = -1;
-        return true;
 #endif
    }

@@ -524,6 +525,19 @@ static inline bool cpu_handle_interrupt(CPUState *cpu,
                                        TranslationBlock **last_tb)
 {
    CPUClass *cc = CPU_GET_CLASS(cpu);
+    int32_t insns_left;
+
+    /* Clear the interrupt flag now since we're processing
+     * cpu->interrupt_request and cpu->exit_request.
+     */
+    insns_left = atomic_read(&cpu->icount_decr.u32);
+    atomic_set(&cpu->icount_decr.u16.high, 0);
+    if (unlikely(insns_left < 0)) {
+        /* Ensure the zeroing of icount_decr comes before the next read
+         * of cpu->exit_request or cpu->interrupt_request.
+         */
+        smp_mb();
+    }

    if (unlikely(atomic_read(&cpu->interrupt_request))) {
        int interrupt_request;
@@ -596,7 +610,9 @@ static inline bool cpu_handle_interrupt(CPUState *cpu,
    if (unlikely(atomic_read(&cpu->exit_request)
        || (use_icount && cpu->icount_decr.u16.low + cpu->icount_extra == 0))) {
        atomic_set(&cpu->exit_request, 0);
+        if (cpu->exception_index == -1) {
            cpu->exception_index = EXCP_INTERRUPT;
+        }
        return true;
    }

@@ -620,17 +636,14 @@ static inline void cpu_loop_exec_tb(CPUState *cpu, TranslationBlock *tb,

    *last_tb = NULL;
    insns_left = atomic_read(&cpu->icount_decr.u32);
-    atomic_set(&cpu->icount_decr.u16.high, 0);
    if (insns_left < 0) {
        /* Something asked us to stop executing chained TBs; just
         * continue round the main loop. Whatever requested the exit
         * will also have set something else (eg exit_request or
-         * interrupt_request) which we will handle next time around
-         * the loop.  But we need to ensure the zeroing of icount_decr
-         * comes before the next read of cpu->exit_request
-         * or cpu->interrupt_request.
+         * interrupt_request) which will be handled by
+         * cpu_handle_interrupt.  cpu_handle_interrupt will also
+         * clear cpu->icount_decr.u16.high.
         */
-        smp_mb();
        return;
    }

@@ -707,7 +720,21 @@ int cpu_exec(CPUState *cpu)
        int tb_exit = 0;

        while (!cpu_handle_interrupt(cpu, &last_tb)) {
-            TranslationBlock *tb = tb_find(cpu, last_tb, tb_exit);
+            uint32_t cflags = cpu->cflags_next_tb;
+            TranslationBlock *tb;
+
+            /* When requested, use an exact setting for cflags for the next
+               execution.  This is used for icount, precise smc, and stop-
+               after-access watchpoints.  Since this request should never
+               have CF_INVALID set, -1 is a convenient invalid value that
+               does not require tcg headers for cpu_common_reset.  */
+            if (cflags == -1) {
+                cflags = curr_cflags();
+            } else {
+                cpu->cflags_next_tb = -1;
+            }
+
+            tb = tb_find(cpu, last_tb, tb_exit, cflags);
            cpu_loop_exec_tb(cpu, tb, &last_tb, &tb_exit);
            /* Try to align the host and virtual clocks
               if the guest is in advance */
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -92,8 +92,18 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
    }
 }

-/* statistics */
-int tlb_flush_count;
+size_t tlb_flush_count(void)
+{
+    CPUState *cpu;
+    size_t count = 0;
+
+    CPU_FOREACH(cpu) {
+        CPUArchState *env = cpu->env_ptr;
+
+        count += atomic_read(&env->tlb_flush_count);
+    }
+    return count;
+}

 /* This is OK because CPU architectures generally permit an
 * implementation to drop entries from the TLB at any time, so
@@ -112,7 +122,8 @@ static void tlb_flush_nocheck(CPUState *cpu)
    }

    assert_cpu_is_self(cpu);
-    tlb_debug("(count: %d)\n", tlb_flush_count++);
+    atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);
+    tlb_debug("(count: %zu)\n", tlb_flush_count());

    tb_lock();

@@ -683,6 +694,9 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
        } else {
            tn.addr_write = address;
        }
+        if (prot & PAGE_WRITE_INV) {
+            tn.addr_write |= TLB_INVALID_MASK;
+        }
    }

    /* Pairs with flag setting in tlb_reset_dirty_range */
@@ -765,7 +779,7 @@ static uint64_t io_readx(CPUArchState *env, CPUIOTLBEntry *iotlbentry,

    cpu->mem_io_vaddr = addr;

-    if (mr->global_locking) {
+    if (mr->global_locking && !qemu_mutex_iothread_locked()) {
        qemu_mutex_lock_iothread();
        locked = true;
    }
@@ -800,7 +814,7 @@ static void io_writex(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
    cpu->mem_io_vaddr = addr;
    cpu->mem_io_pc = retaddr;

-    if (mr->global_locking) {
+    if (mr->global_locking && !qemu_mutex_iothread_locked()) {
        qemu_mutex_lock_iothread();
        locked = true;
    }
@@ -967,7 +981,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
        if (!VICTIM_TLB_HIT(addr_write, addr)) {
            tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
        }
-        tlb_addr = tlbe->addr_write;
+        tlb_addr = tlbe->addr_write & ~TLB_INVALID_MASK;
    }

    /* Check notdirty */
@@ -1027,6 +1041,7 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
 #define ATOMIC_NAME(X) \
    HELPER(glue(glue(glue(atomic_ ## X, SUFFIX), END), _mmu))
 #define ATOMIC_MMU_LOOKUP  atomic_mmu_lookup(env, addr, oi, retaddr)
+#define ATOMIC_MMU_CLEANUP do { } while (0)

 #define DATA_SIZE 1
 #include "atomic_template.h"
--- a/accel/tcg/softmmu_template.h
+++ b/accel/tcg/softmmu_template.h
@@ -285,7 +285,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
        if (!VICTIM_TLB_HIT(addr_write, addr)) {
            tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
        }
-        tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+        tlb_addr = env->tlb_table[mmu_idx][index].addr_write & ~TLB_INVALID_MASK;
    }

    /* Handle an IO access.  */
@@ -361,7 +361,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
        if (!VICTIM_TLB_HIT(addr_write, addr)) {
            tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE, mmu_idx, retaddr);
        }
-        tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
+        tlb_addr = env->tlb_table[mmu_idx][index].addr_write & ~TLB_INVALID_MASK;
    }

    /* Handle an IO access.  */
--- a/accel/tcg/tcg-runtime.c
+++ b/accel/tcg/tcg-runtime.c
@@ -27,7 +27,7 @@
 #include "exec/helper-proto.h"
 #include "exec/cpu_ldst.h"
 #include "exec/exec-all.h"
-#include "exec/tb-hash.h"
+#include "exec/tb-lookup.h"
 #include "disas/disas.h"
 #include "exec/log.h"

@@ -144,34 +144,22 @@ uint64_t HELPER(ctpop_i64)(uint64_t arg)
    return ctpop64(arg);
 }

-void *HELPER(lookup_tb_ptr)(CPUArchState *env, target_ulong addr)
+void *HELPER(lookup_tb_ptr)(CPUArchState *env)
 {
    CPUState *cpu = ENV_GET_CPU(env);
    TranslationBlock *tb;
    target_ulong cs_base, pc;
-    uint32_t flags, addr_hash;
+    uint32_t flags;

-    addr_hash = tb_jmp_cache_hash_func(addr);
-    tb = atomic_rcu_read(&cpu->tb_jmp_cache[addr_hash]);
-    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-
-    if (unlikely(!(tb
-                   && tb->pc == addr
-                   && tb->cs_base == cs_base
-                   && tb->flags == flags
-                   && tb->trace_vcpu_dstate == *cpu->trace_dstate))) {
-        tb = tb_htable_lookup(cpu, addr, cs_base, flags);
-        if (!tb) {
-            return tcg_ctx.code_gen_epilogue;
+    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, curr_cflags());
+    if (tb == NULL) {
+        return tcg_ctx->code_gen_epilogue;
    }
-        atomic_set(&cpu->tb_jmp_cache[addr_hash], tb);
-    }
-
-    qemu_log_mask_and_addr(CPU_LOG_EXEC, addr,
+    qemu_log_mask_and_addr(CPU_LOG_EXEC, pc,
                           "Chain %p [%d: " TARGET_FMT_lx "] %s\n",
-                           tb->tc_ptr, cpu->cpu_index, addr,
-                           lookup_symbol(addr));
-    return tb->tc_ptr;
+                           tb->tc.ptr, cpu->cpu_index, pc,
+                           lookup_symbol(pc));
+    return tb->tc.ptr;
 }

 void HELPER(exit_atomic)(CPUArchState *env)
--- a/accel/tcg/tcg-runtime.h
+++ b/accel/tcg/tcg-runtime.h
@@ -24,7 +24,7 @@ DEF_HELPER_FLAGS_1(clrsb_i64, TCG_CALL_NO_RWG_SE, i64, i64)
 DEF_HELPER_FLAGS_1(ctpop_i32, TCG_CALL_NO_RWG_SE, i32, i32)
 DEF_HELPER_FLAGS_1(ctpop_i64, TCG_CALL_NO_RWG_SE, i64, i64)

-DEF_HELPER_FLAGS_2(lookup_tb_ptr, TCG_CALL_NO_WG_SE, ptr, env, tl)
+DEF_HELPER_FLAGS_1(lookup_tb_ptr, TCG_CALL_NO_WG_SE, ptr, env)

 DEF_HELPER_FLAGS_1(exit_atomic, TCG_CALL_NO_WG, noreturn, env)

--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
--- a/accel/tcg/translator.c
+++ b/accel/tcg/translator.c
@@ -45,7 +45,7 @@ void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
    db->singlestep_enabled = cpu->singlestep_enabled;

    /* Instruction counting */
-    max_insns = db->tb->cflags & CF_COUNT_MASK;
+    max_insns = tb_cflags(db->tb) & CF_COUNT_MASK;
    if (max_insns == 0) {
        max_insns = CF_COUNT_MASK;
    }
@@ -95,7 +95,7 @@ void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
           update db->pc_next and db->is_jmp to indicate what should be
           done next -- either exiting this loop or locate the start of
           the next instruction.  */
-        if (db->num_insns == max_insns && (db->tb->cflags & CF_LAST_IO)) {
+        if (db->num_insns == max_insns && (tb_cflags(db->tb) & CF_LAST_IO)) {
            /* Accept I/O on the last instruction.  */
            gen_io_start();
            ops->translate_insn(db, cpu);
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -39,6 +39,8 @@
 #include <sys/ucontext.h>
 #endif

+__thread uintptr_t helper_retaddr;
+
 //#define DEBUG_SIGNAL

 /* exit the current TB from a signal handler. The host registers are
@@ -62,6 +64,27 @@ static inline int handle_cpu_signal(uintptr_t pc, unsigned long address,
    CPUClass *cc;
    int ret;

+    /* We must handle PC addresses from two different sources:
+     * a call return address and a signal frame address.
+     *
+     * Within cpu_restore_state_from_tb we assume the former and adjust
+     * the address by -GETPC_ADJ so that the address is within the call
+     * insn so that addr does not accidentally match the beginning of the
+     * next guest insn.
+     *
+     * However, when the PC comes from the signal frame, it points to
+     * the actual faulting host insn and not a call insn.  Subtracting
+     * GETPC_ADJ in that case may accidentally match the previous guest insn.
+     *
+     * So for the later case, adjust forward to compensate for what
+     * will be done later by cpu_restore_state_from_tb.
+     */
+    if (helper_retaddr) {
+        pc = helper_retaddr;
+    } else {
+        pc += GETPC_ADJ;
+    }
+
    /* For synchronous signals we expect to be coming from the vCPU
     * thread (so current_cpu should be valid) and either from running
     * code or during translation which can fault as we cross pages.
@@ -84,21 +107,24 @@ static inline int handle_cpu_signal(uintptr_t pc, unsigned long address,
        switch (page_unprotect(h2g(address), pc)) {
        case 0:
            /* Fault not caused by a page marked unwritable to protect
-             * cached translations, must be the guest binary's problem
+             * cached translations, must be the guest binary's problem.
             */
            break;
        case 1:
            /* Fault caused by protection of cached translation; TBs
-             * invalidated, so resume execution
+             * invalidated, so resume execution.  Retain helper_retaddr
+             * for a possible second fault.
             */
            return 1;
        case 2:
            /* Fault caused by protection of cached translation, and the
             * currently executing TB was modified and must be exited
-             * immediately.
+             * immediately.  Clear helper_retaddr for next execution.
             */
+            helper_retaddr = 0;
            cpu_exit_tb_from_sighandler(cpu, old_set);
-            g_assert_not_reached();
+            /* NORETURN */
+
        default:
            g_assert_not_reached();
        }
@@ -112,17 +138,25 @@ static inline int handle_cpu_signal(uintptr_t pc, unsigned long address,
    /* see if it is an MMU fault */
    g_assert(cc->handle_mmu_fault);
    ret = cc->handle_mmu_fault(cpu, address, is_write, MMU_USER_IDX);
+
+    if (ret == 0) {
+        /* The MMU fault was handled without causing real CPU fault.
+         *  Retain helper_retaddr for a possible second fault.
+         */
+        return 1;
+    }
+
+    /* All other paths lead to cpu_exit; clear helper_retaddr
+     * for next execution.
+     */
+    helper_retaddr = 0;
+
    if (ret < 0) {
        return 0; /* not an MMU fault */
    }
-    if (ret == 0) {
-        return 1; /* the MMU fault was handled without causing real CPU fault */
-    }

-    /* Now we have a real cpu fault.  Since this is the exact location of
-     * the exception, we must undo the adjustment done by cpu_restore_state
-     * for handling call return addresses.  */
-    cpu_restore_state(cpu, pc + GETPC_ADJ);
+    /* Now we have a real cpu fault.  */
+    cpu_restore_state(cpu, pc);

    sigprocmask(SIG_SETMASK, old_set, NULL);
    cpu_loop_exit(cpu);
@@ -585,11 +619,13 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
    if (unlikely(addr & (size - 1))) {
        cpu_loop_exit_atomic(ENV_GET_CPU(env), retaddr);
    }
+    helper_retaddr = retaddr;
    return g2h(addr);
 }

 /* Macro to call the above, with local variables from the use context.  */
 #define ATOMIC_MMU_LOOKUP  atomic_mmu_lookup(env, addr, DATA_SIZE, GETPC())
+#define ATOMIC_MMU_CLEANUP do { helper_retaddr = 0; } while (0)

 #define ATOMIC_NAME(X)   HELPER(glue(glue(atomic_ ## X, SUFFIX), END))
 #define EXTRA_ARGS
--- a/audio/Makefile.objs
+++ b/audio/Makefile.objs
@@ -11,3 +11,9 @@ common-obj-$(CONFIG_AUDIO_WIN_INT) += audio_win_int.o
 common-obj-y += wavcapture.o

 sdlaudio.o-cflags := $(SDL_CFLAGS)
+sdlaudio.o-libs := $(SDL_LIBS)
+alsaaudio.o-libs := $(ALSA_LIBS)
+paaudio.o-libs := $(PULSE_LIBS)
+coreaudio.o-libs := $(COREAUDIO_LIBS)
+dsoundaudio.o-libs := $(DSOUND_LIBS)
+ossaudio.o-libs := $(OSS_LIBS)
--- a/backends/tpm.c
+++ b/backends/tpm.c
@@ -17,99 +17,128 @@
 #include "qapi/error.h"
 #include "qapi/qmp/qerror.h"
 #include "sysemu/tpm.h"
+#include "hw/tpm/tpm_int.h"
 #include "qemu/thread.h"
-#include "sysemu/tpm_backend_int.h"
+
+static void tpm_backend_worker_thread(gpointer data, gpointer user_data)
+{
+    TPMBackend *s = TPM_BACKEND(user_data);
+    TPMBackendClass *k  = TPM_BACKEND_GET_CLASS(s);
+
+    assert(k->handle_request != NULL);
+    k->handle_request(s, (TPMBackendCmd *)data);
+}
+
+static void tpm_backend_thread_end(TPMBackend *s)
+{
+    if (s->thread_pool) {
+        g_thread_pool_free(s->thread_pool, FALSE, TRUE);
+        s->thread_pool = NULL;
+    }
+}

 enum TpmType tpm_backend_get_type(TPMBackend *s)
 {
    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);

-    return k->ops->type;
+    return k->type;
 }

-const char *tpm_backend_get_desc(TPMBackend *s)
+int tpm_backend_init(TPMBackend *s, TPMState *state)
 {
-    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);
+    s->tpm_state = state;
+    s->had_startup_error = false;

-    return k->ops->desc();
-}
-
-void tpm_backend_destroy(TPMBackend *s)
-{
-    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);
-
-    k->ops->destroy(s);
-}
-
-int tpm_backend_init(TPMBackend *s, TPMState *state,
-                     TPMRecvDataCB *datacb)
-{
-    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);
-
-    return k->ops->init(s, state, datacb);
+    return 0;
 }

 int tpm_backend_startup_tpm(TPMBackend *s)
 {
+    int res = 0;
    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);

-    return k->ops->startup_tpm(s);
+    /* terminate a running TPM */
+    tpm_backend_thread_end(s);
+
+    s->thread_pool = g_thread_pool_new(tpm_backend_worker_thread, s, 1, TRUE,
+                                       NULL);
+
+    res = k->startup_tpm ? k->startup_tpm(s) : 0;
+
+    s->had_startup_error = (res != 0);
+
+    return res;
 }

 bool tpm_backend_had_startup_error(TPMBackend *s)
 {
-    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);
-
-    return k->ops->had_startup_error(s);
+    return s->had_startup_error;
 }

-size_t tpm_backend_realloc_buffer(TPMBackend *s, TPMSizedBuffer *sb)
+void tpm_backend_deliver_request(TPMBackend *s, TPMBackendCmd *cmd)
 {
-    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);
-
-    return k->ops->realloc_buffer(sb);
-}
-
-void tpm_backend_deliver_request(TPMBackend *s)
-{
-    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);
-
-    k->ops->deliver_request(s);
+    g_thread_pool_push(s->thread_pool, cmd, NULL);
 }

 void tpm_backend_reset(TPMBackend *s)
 {
    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);

-    k->ops->reset(s);
+    if (k->reset) {
+        k->reset(s);
+    }
+
+    tpm_backend_thread_end(s);
+
+    s->had_startup_error = false;
 }

 void tpm_backend_cancel_cmd(TPMBackend *s)
 {
    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);

-    k->ops->cancel_cmd(s);
+    assert(k->cancel_cmd);
+
+    k->cancel_cmd(s);
 }

 bool tpm_backend_get_tpm_established_flag(TPMBackend *s)
 {
    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);

-    return k->ops->get_tpm_established_flag(s);
+    return k->get_tpm_established_flag ?
+           k->get_tpm_established_flag(s) : false;
 }

 int tpm_backend_reset_tpm_established_flag(TPMBackend *s, uint8_t locty)
 {
    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);

-    return k->ops->reset_tpm_established_flag(s, locty);
+    return k->reset_tpm_established_flag ?
+           k->reset_tpm_established_flag(s, locty) : 0;
 }

 TPMVersion tpm_backend_get_tpm_version(TPMBackend *s)
 {
    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);

-    return k->ops->get_tpm_version(s);
+    assert(k->get_tpm_version);
+
+    return k->get_tpm_version(s);
+}
+
+TPMInfo *tpm_backend_query_tpm(TPMBackend *s)
+{
+    TPMInfo *info = g_new0(TPMInfo, 1);
+    TPMBackendClass *k = TPM_BACKEND_GET_CLASS(s);
+
+    info->id = g_strdup(s->id);
+    info->model = s->fe_model;
+    if (k->get_tpm_options) {
+        info->options = k->get_tpm_options(s);
+    }
+
+    return info;
 }

 static bool tpm_backend_prop_get_opened(Object *obj, Error **errp)
@@ -152,33 +181,21 @@ static void tpm_backend_prop_set_opened(Object *obj, bool value, Error **errp)

 static void tpm_backend_instance_init(Object *obj)
 {
+    TPMBackend *s = TPM_BACKEND(obj);
+
    object_property_add_bool(obj, "opened",
                             tpm_backend_prop_get_opened,
                             tpm_backend_prop_set_opened,
                             NULL);
+    s->fe_model = -1;
 }

-void tpm_backend_thread_deliver_request(TPMBackendThread *tbt)
+static void tpm_backend_instance_finalize(Object *obj)
 {
-   g_thread_pool_push(tbt->pool, (gpointer)TPM_BACKEND_CMD_PROCESS_CMD, NULL);
-}
+    TPMBackend *s = TPM_BACKEND(obj);

-void tpm_backend_thread_create(TPMBackendThread *tbt,
-                               GFunc func, gpointer user_data)
-{
-    if (!tbt->pool) {
-        tbt->pool = g_thread_pool_new(func, user_data, 1, TRUE, NULL);
-        g_thread_pool_push(tbt->pool, (gpointer)TPM_BACKEND_CMD_INIT, NULL);
-    }
-}
-
-void tpm_backend_thread_end(TPMBackendThread *tbt)
-{
-    if (tbt->pool) {
-        g_thread_pool_push(tbt->pool, (gpointer)TPM_BACKEND_CMD_END, NULL);
-        g_thread_pool_free(tbt->pool, FALSE, TRUE);
-        tbt->pool = NULL;
-    }
+    g_free(s->id);
+    tpm_backend_thread_end(s);
 }

 static const TypeInfo tpm_backend_info = {
@@ -186,13 +203,21 @@ static const TypeInfo tpm_backend_info = {
    .parent = TYPE_OBJECT,
    .instance_size = sizeof(TPMBackend),
    .instance_init = tpm_backend_instance_init,
+    .instance_finalize = tpm_backend_instance_finalize,
    .class_size = sizeof(TPMBackendClass),
    .abstract = true,
 };

+static const TypeInfo tpm_if_info = {
+    .name = TYPE_TPM_IF,
+    .parent = TYPE_INTERFACE,
+    .class_size = sizeof(TPMIfClass),
+};
+
 static void register_types(void)
 {
    type_register_static(&tpm_backend_info);
+    type_register_static(&tpm_if_info);
 }

 type_init(register_types);
--- a/block.c
+++ b/block.c
@@ -239,12 +239,6 @@ bool bdrv_is_read_only(BlockDriverState *bs)
    return bs->read_only;
 }

-/* Returns whether the image file can be written to right now */
-bool bdrv_is_writable(BlockDriverState *bs)
-{
-    return !bdrv_is_read_only(bs) && !(bs->open_flags & BDRV_O_INACTIVE);
-}
-
 int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only,
                           bool ignore_allow_rdw, Error **errp)
 {
@@ -267,6 +261,11 @@ int bdrv_can_set_read_only(BlockDriverState *bs, bool read_only,
    return 0;
 }

+/* TODO Remove (deprecated since 2.11)
+ * Block drivers are not supposed to automatically change bs->read_only.
+ * Instead, they should just check whether they can provide what the user
+ * explicitly requested and error out if read-write is requested, but they can
+ * only provide read-only access. */
 int bdrv_set_read_only(BlockDriverState *bs, bool read_only, Error **errp)
 {
    int ret = 0;
@@ -721,6 +720,10 @@ static int refresh_total_sectors(BlockDriverState *bs, int64_t hint)
 {
    BlockDriver *drv = bs->drv;

+    if (!drv) {
+        return -ENOMEDIUM;
+    }
+
    /* Do not attempt drv->bdrv_getlength() on scsi-generic devices */
    if (bdrv_is_sg(bs))
        return 0;
@@ -987,6 +990,33 @@ static void bdrv_backing_options(int *child_flags, QDict *child_options,
    *child_flags = flags;
 }

+static int bdrv_backing_update_filename(BdrvChild *c, BlockDriverState *base,
+                                        const char *filename, Error **errp)
+{
+    BlockDriverState *parent = c->opaque;
+    int orig_flags = bdrv_get_flags(parent);
+    int ret;
+
+    if (!(orig_flags & BDRV_O_RDWR)) {
+        ret = bdrv_reopen(parent, orig_flags | BDRV_O_RDWR, errp);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    ret = bdrv_change_backing_file(parent, filename,
+                                   base->drv ? base->drv->format_name : "");
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "Could not update backing file link");
+    }
+
+    if (!(orig_flags & BDRV_O_RDWR)) {
+        bdrv_reopen(parent, orig_flags, NULL);
+    }
+
+    return ret;
+}
+
 const BdrvChildRole child_backing = {
    .get_parent_desc = bdrv_child_get_parent_desc,
    .attach          = bdrv_backing_attach,
@@ -995,6 +1025,7 @@ const BdrvChildRole child_backing = {
    .drained_begin   = bdrv_child_cb_drained_begin,
    .drained_end     = bdrv_child_cb_drained_end,
    .inactivate      = bdrv_child_cb_inactivate,
+    .update_filename = bdrv_backing_update_filename,
 };

 static int bdrv_open_flags(BlockDriverState *bs, int flags)
@@ -1531,22 +1562,59 @@ static int bdrv_fill_options(QDict **options, const char *filename,
    return 0;
 }

-static int bdrv_child_check_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
+static int bdrv_child_check_perm(BdrvChild *c, BlockReopenQueue *q,
+                                 uint64_t perm, uint64_t shared,
                                 GSList *ignore_children, Error **errp);
 static void bdrv_child_abort_perm_update(BdrvChild *c);
 static void bdrv_child_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared);

+typedef struct BlockReopenQueueEntry {
+     bool prepared;
+     BDRVReopenState state;
+     QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
+} BlockReopenQueueEntry;
+
+/*
+ * Return the flags that @bs will have after the reopens in @q have
+ * successfully completed. If @q is NULL (or @bs is not contained in @q),
+ * return the current flags.
+ */
+static int bdrv_reopen_get_flags(BlockReopenQueue *q, BlockDriverState *bs)
+{
+    BlockReopenQueueEntry *entry;
+
+    if (q != NULL) {
+        QSIMPLEQ_FOREACH(entry, q, entry) {
+            if (entry->state.bs == bs) {
+                return entry->state.flags;
+            }
+        }
+    }
+
+    return bs->open_flags;
+}
+
+/* Returns whether the image file can be written to after the reopen queue @q
+ * has been successfully applied, or right now if @q is NULL. */
+static bool bdrv_is_writable(BlockDriverState *bs, BlockReopenQueue *q)
+{
+    int flags = bdrv_reopen_get_flags(q, bs);
+
+    return (flags & (BDRV_O_RDWR | BDRV_O_INACTIVE)) == BDRV_O_RDWR;
+}
+
 static void bdrv_child_perm(BlockDriverState *bs, BlockDriverState *child_bs,
-                            BdrvChild *c,
-                            const BdrvChildRole *role,
+                            BdrvChild *c, const BdrvChildRole *role,
+                            BlockReopenQueue *reopen_queue,
                            uint64_t parent_perm, uint64_t parent_shared,
                            uint64_t *nperm, uint64_t *nshared)
 {
    if (bs->drv && bs->drv->bdrv_child_perm) {
-        bs->drv->bdrv_child_perm(bs, c, role,
+        bs->drv->bdrv_child_perm(bs, c, role, reopen_queue,
                                 parent_perm, parent_shared,
                                 nperm, nshared);
    }
+    /* TODO Take force_share from reopen_queue */
    if (child_bs && child_bs->force_share) {
        *nshared = BLK_PERM_ALL;
    }
@@ -1561,7 +1629,8 @@ static void bdrv_child_perm(BlockDriverState *bs, BlockDriverState *child_bs,
 * A call to this function must always be followed by a call to bdrv_set_perm()
 * or bdrv_abort_perm_update().
 */
-static int bdrv_check_perm(BlockDriverState *bs, uint64_t cumulative_perms,
+static int bdrv_check_perm(BlockDriverState *bs, BlockReopenQueue *q,
+                           uint64_t cumulative_perms,
                           uint64_t cumulative_shared_perms,
                           GSList *ignore_children, Error **errp)
 {
@@ -1571,7 +1640,7 @@ static int bdrv_check_perm(BlockDriverState *bs, uint64_t cumulative_perms,

    /* Write permissions never work with read-only images */
    if ((cumulative_perms & (BLK_PERM_WRITE | BLK_PERM_WRITE_UNCHANGED)) &&
-        !bdrv_is_writable(bs))
+        !bdrv_is_writable(bs, q))
    {
        error_setg(errp, "Block node is read-only");
        return -EPERM;
@@ -1596,11 +1665,11 @@ static int bdrv_check_perm(BlockDriverState *bs, uint64_t cumulative_perms,
    /* Check all children */
    QLIST_FOREACH(c, &bs->children, next) {
        uint64_t cur_perm, cur_shared;
-        bdrv_child_perm(bs, c->bs, c, c->role,
+        bdrv_child_perm(bs, c->bs, c, c->role, q,
                        cumulative_perms, cumulative_shared_perms,
                        &cur_perm, &cur_shared);
-        ret = bdrv_child_check_perm(c, cur_perm, cur_shared, ignore_children,
-                                    errp);
+        ret = bdrv_child_check_perm(c, q, cur_perm, cur_shared,
+                                    ignore_children, errp);
        if (ret < 0) {
            return ret;
        }
@@ -1658,7 +1727,7 @@ static void bdrv_set_perm(BlockDriverState *bs, uint64_t cumulative_perms,
    /* Update all children */
    QLIST_FOREACH(c, &bs->children, next) {
        uint64_t cur_perm, cur_shared;
-        bdrv_child_perm(bs, c->bs, c, c->role,
+        bdrv_child_perm(bs, c->bs, c, c->role, NULL,
                        cumulative_perms, cumulative_shared_perms,
                        &cur_perm, &cur_shared);
        bdrv_child_set_perm(c, cur_perm, cur_shared);
@@ -1726,7 +1795,8 @@ char *bdrv_perm_names(uint64_t perm)
 *
 * Needs to be followed by a call to either bdrv_set_perm() or
 * bdrv_abort_perm_update(). */
-static int bdrv_check_update_perm(BlockDriverState *bs, uint64_t new_used_perm,
+static int bdrv_check_update_perm(BlockDriverState *bs, BlockReopenQueue *q,
+                                  uint64_t new_used_perm,
                                  uint64_t new_shared_perm,
                                  GSList *ignore_children, Error **errp)
 {
@@ -1768,19 +1838,20 @@ static int bdrv_check_update_perm(BlockDriverState *bs, uint64_t new_used_perm,
        cumulative_shared_perms &= c->shared_perm;
    }

-    return bdrv_check_perm(bs, cumulative_perms, cumulative_shared_perms,
+    return bdrv_check_perm(bs, q, cumulative_perms, cumulative_shared_perms,
                           ignore_children, errp);
 }

 /* Needs to be followed by a call to either bdrv_child_set_perm() or
 * bdrv_child_abort_perm_update(). */
-static int bdrv_child_check_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
+static int bdrv_child_check_perm(BdrvChild *c, BlockReopenQueue *q,
+                                 uint64_t perm, uint64_t shared,
                                 GSList *ignore_children, Error **errp)
 {
    int ret;

    ignore_children = g_slist_prepend(g_slist_copy(ignore_children), c);
-    ret = bdrv_check_update_perm(c->bs, perm, shared, ignore_children, errp);
+    ret = bdrv_check_update_perm(c->bs, q, perm, shared, ignore_children, errp);
    g_slist_free(ignore_children);

    return ret;
@@ -1808,7 +1879,7 @@ int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,
 {
    int ret;

-    ret = bdrv_child_check_perm(c, perm, shared, NULL, errp);
+    ret = bdrv_child_check_perm(c, NULL, perm, shared, NULL, errp);
    if (ret < 0) {
        bdrv_child_abort_perm_update(c);
        return ret;
@@ -1827,6 +1898,7 @@ int bdrv_child_try_set_perm(BdrvChild *c, uint64_t perm, uint64_t shared,

 void bdrv_filter_default_perms(BlockDriverState *bs, BdrvChild *c,
                               const BdrvChildRole *role,
+                               BlockReopenQueue *reopen_queue,
                               uint64_t perm, uint64_t shared,
                               uint64_t *nperm, uint64_t *nshared)
 {
@@ -1844,6 +1916,7 @@ void bdrv_filter_default_perms(BlockDriverState *bs, BdrvChild *c,

 void bdrv_format_default_perms(BlockDriverState *bs, BdrvChild *c,
                               const BdrvChildRole *role,
+                               BlockReopenQueue *reopen_queue,
                               uint64_t perm, uint64_t shared,
                               uint64_t *nperm, uint64_t *nshared)
 {
@@ -1853,10 +1926,11 @@ void bdrv_format_default_perms(BlockDriverState *bs, BdrvChild *c,
    if (!backing) {
        /* Apart from the modifications below, the same permissions are
         * forwarded and left alone as for filters */
-        bdrv_filter_default_perms(bs, c, role, perm, shared, &perm, &shared);
+        bdrv_filter_default_perms(bs, c, role, reopen_queue, perm, shared,
+                                  &perm, &shared);

        /* Format drivers may touch metadata even if the guest doesn't write */
-        if (bdrv_is_writable(bs)) {
+        if (bdrv_is_writable(bs, reopen_queue)) {
            perm |= BLK_PERM_WRITE | BLK_PERM_RESIZE;
        }

@@ -1945,7 +2019,7 @@ static void bdrv_replace_child(BdrvChild *child, BlockDriverState *new_bs)
         * because we're just taking a parent away, so we're loosening
         * restrictions. */
        bdrv_get_cumulative_perm(old_bs, &perm, &shared_perm);
-        bdrv_check_perm(old_bs, perm, shared_perm, NULL, &error_abort);
+        bdrv_check_perm(old_bs, NULL, perm, shared_perm, NULL, &error_abort);
        bdrv_set_perm(old_bs, perm, shared_perm);
    }

@@ -1964,7 +2038,7 @@ BdrvChild *bdrv_root_attach_child(BlockDriverState *child_bs,
    BdrvChild *child;
    int ret;

-    ret = bdrv_check_update_perm(child_bs, perm, shared_perm, NULL, errp);
+    ret = bdrv_check_update_perm(child_bs, NULL, perm, shared_perm, NULL, errp);
    if (ret < 0) {
        bdrv_abort_perm_update(child_bs);
        return NULL;
@@ -1999,7 +2073,7 @@ BdrvChild *bdrv_attach_child(BlockDriverState *parent_bs,

    assert(parent_bs->drv);
    assert(bdrv_get_aio_context(parent_bs) == bdrv_get_aio_context(child_bs));
-    bdrv_child_perm(parent_bs, child_bs, NULL, child_role,
+    bdrv_child_perm(parent_bs, child_bs, NULL, child_role, NULL,
                    perm, shared_perm, &perm, &shared_perm);

    child = bdrv_root_attach_child(child_bs, child_name, child_role,
@@ -2180,7 +2254,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *parent_options,
        goto free_exit;
    }

-    if (bs->backing_format[0] != '\0' && !qdict_haskey(options, "driver")) {
+    if (!reference &&
+        bs->backing_format[0] != '\0' && !qdict_haskey(options, "driver")) {
        qdict_put_str(options, "driver", bs->backing_format);
    }

@@ -2633,12 +2708,6 @@ BlockDriverState *bdrv_open(const char *filename, const char *reference,
                             NULL, errp);
 }

-typedef struct BlockReopenQueueEntry {
-     bool prepared;
-     BDRVReopenState state;
-     QSIMPLEQ_ENTRY(BlockReopenQueueEntry) entry;
-} BlockReopenQueueEntry;
-
 /*
 * Adds a BlockDriverState to a simple queue for an atomic, transactional
 * reopen of multiple devices.
@@ -2737,6 +2806,23 @@ static BlockReopenQueue *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
        flags |= BDRV_O_ALLOW_RDWR;
    }

+    if (!bs_entry) {
+        bs_entry = g_new0(BlockReopenQueueEntry, 1);
+        QSIMPLEQ_INSERT_TAIL(bs_queue, bs_entry, entry);
+    } else {
+        QDECREF(bs_entry->state.options);
+        QDECREF(bs_entry->state.explicit_options);
+    }
+
+    bs_entry->state.bs = bs;
+    bs_entry->state.options = options;
+    bs_entry->state.explicit_options = explicit_options;
+    bs_entry->state.flags = flags;
+
+    /* This needs to be overwritten in bdrv_reopen_prepare() */
+    bs_entry->state.perm = UINT64_MAX;
+    bs_entry->state.shared_perm = 0;
+
    QLIST_FOREACH(child, &bs->children, next) {
        QDict *new_child_options;
        char *child_key_dot;
@@ -2756,19 +2842,6 @@ static BlockReopenQueue *bdrv_reopen_queue_child(BlockReopenQueue *bs_queue,
                                child->role, options, flags);
    }

-    if (!bs_entry) {
-        bs_entry = g_new0(BlockReopenQueueEntry, 1);
-        QSIMPLEQ_INSERT_TAIL(bs_queue, bs_entry, entry);
-    } else {
-        QDECREF(bs_entry->state.options);
-        QDECREF(bs_entry->state.explicit_options);
-    }
-
-    bs_entry->state.bs = bs;
-    bs_entry->state.options = options;
-    bs_entry->state.explicit_options = explicit_options;
-    bs_entry->state.flags = flags;
-
    return bs_queue;
 }

@@ -2856,6 +2929,52 @@ int bdrv_reopen(BlockDriverState *bs, int bdrv_flags, Error **errp)
    return ret;
 }

+static BlockReopenQueueEntry *find_parent_in_reopen_queue(BlockReopenQueue *q,
+                                                          BdrvChild *c)
+{
+    BlockReopenQueueEntry *entry;
+
+    QSIMPLEQ_FOREACH(entry, q, entry) {
+        BlockDriverState *bs = entry->state.bs;
+        BdrvChild *child;
+
+        QLIST_FOREACH(child, &bs->children, next) {
+            if (child == c) {
+                return entry;
+            }
+        }
+    }
+
+    return NULL;
+}
+
+static void bdrv_reopen_perm(BlockReopenQueue *q, BlockDriverState *bs,
+                             uint64_t *perm, uint64_t *shared)
+{
+    BdrvChild *c;
+    BlockReopenQueueEntry *parent;
+    uint64_t cumulative_perms = 0;
+    uint64_t cumulative_shared_perms = BLK_PERM_ALL;
+
+    QLIST_FOREACH(c, &bs->parents, next_parent) {
+        parent = find_parent_in_reopen_queue(q, c);
+        if (!parent) {
+            cumulative_perms |= c->perm;
+            cumulative_shared_perms &= c->shared_perm;
+        } else {
+            uint64_t nperm, nshared;
+
+            bdrv_child_perm(parent->state.bs, bs, c, c->role, q,
+                            parent->state.perm, parent->state.shared_perm,
+                            &nperm, &nshared);
+
+            cumulative_perms |= nperm;
+            cumulative_shared_perms &= nshared;
+        }
+    }
+    *perm = cumulative_perms;
+    *shared = cumulative_shared_perms;
+}

 /*
 * Prepares a BlockDriverState for reopen. All changes are staged in the
@@ -2921,6 +3040,9 @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, BlockReopenQueue *queue,
        goto error;
    }

+    /* Calculate required permissions after reopening */
+    bdrv_reopen_perm(queue, reopen_state->bs,
+                     &reopen_state->perm, &reopen_state->shared_perm);

    ret = bdrv_flush(reopen_state->bs);
    if (ret) {
@@ -2956,19 +3078,26 @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, BlockReopenQueue *queue,
        const QDictEntry *entry = qdict_first(reopen_state->options);

        do {
-            QString *new_obj = qobject_to_qstring(entry->value);
-            const char *new = qstring_get_str(new_obj);
-            /*
-             * Caution: while qdict_get_try_str() is fine, getting
-             * non-string types would require more care.  When
-             * bs->options come from -blockdev or blockdev_add, its
-             * members are typed according to the QAPI schema, but
-             * when they come from -drive, they're all QString.
-             */
-            const char *old = qdict_get_try_str(reopen_state->bs->options,
-                                                entry->key);
+            QObject *new = entry->value;
+            QObject *old = qdict_get(reopen_state->bs->options, entry->key);

-            if (!old || strcmp(new, old)) {
+            /*
+             * TODO: When using -drive to specify blockdev options, all values
+             * will be strings; however, when using -blockdev, blockdev-add or
+             * filenames using the json:{} pseudo-protocol, they will be
+             * correctly typed.
+             * In contrast, reopening options are (currently) always strings
+             * (because you can only specify them through qemu-io; all other
+             * callers do not specify any options).
+             * Therefore, when using anything other than -drive to create a BDS,
+             * this cannot detect non-string options as unchanged, because
+             * qobject_is_equal() always returns false for objects of different
+             * type.  In the future, this should be remedied by correctly typing
+             * all options.  For now, this is not too big of an issue because
+             * the user can simply omit options which cannot be changed anyway,
+             * so they will stay unchanged.
+             */
+            if (!qobject_is_equal(new, old)) {
                error_setg(errp, "Cannot change the option '%s'", entry->key);
                ret = -EINVAL;
                goto error;
@@ -2976,6 +3105,12 @@ int bdrv_reopen_prepare(BDRVReopenState *reopen_state, BlockReopenQueue *queue,
        } while ((entry = qdict_next(reopen_state->options, entry)));
    }

+    ret = bdrv_check_perm(reopen_state->bs, queue, reopen_state->perm,
+                          reopen_state->shared_perm, NULL, errp);
+    if (ret < 0) {
+        goto error;
+    }
+
    ret = 0;

 error:
@@ -3016,6 +3151,9 @@ void bdrv_reopen_commit(BDRVReopenState *reopen_state)

    bdrv_refresh_limits(bs, NULL);

+    bdrv_set_perm(reopen_state->bs, reopen_state->perm,
+                  reopen_state->shared_perm);
+
    new_can_write =
        !bdrv_is_read_only(bs) && !(bdrv_get_flags(bs) & BDRV_O_INACTIVE);
    if (!old_can_write && new_can_write && drv->bdrv_reopen_bitmaps_rw) {
@@ -3049,6 +3187,8 @@ void bdrv_reopen_abort(BDRVReopenState *reopen_state)
    }

    QDECREF(reopen_state->explicit_options);
+
+    bdrv_abort_perm_update(reopen_state->bs);
 }


@@ -3179,7 +3319,7 @@ void bdrv_replace_node(BlockDriverState *from, BlockDriverState *to,

    /* Check whether the required permissions can be granted on @to, ignoring
     * all BdrvChild in @list so that they can't block themselves. */
-    ret = bdrv_check_update_perm(to, perm, shared, list, errp);
+    ret = bdrv_check_update_perm(to, NULL, perm, shared, list, errp);
    if (ret < 0) {
        bdrv_abort_perm_update(to);
        goto out;
@@ -3295,6 +3435,10 @@ int bdrv_change_backing_file(BlockDriverState *bs,
    BlockDriver *drv = bs->drv;
    int ret;

+    if (!drv) {
+        return -ENOMEDIUM;
+    }
+
    /* Backing file format doesn't make sense without a backing file */
    if (backing_fmt && !backing_file) {
        return -EINVAL;
@@ -3368,53 +3512,62 @@ BlockDriverState *bdrv_find_base(BlockDriverState *bs)
 *  if active == top, that is considered an error
 *
 */
-int bdrv_drop_intermediate(BlockDriverState *active, BlockDriverState *top,
-                           BlockDriverState *base, const char *backing_file_str)
+int bdrv_drop_intermediate(BlockDriverState *top, BlockDriverState *base,
+                           const char *backing_file_str)
 {
-    BlockDriverState *new_top_bs = NULL;
+    BdrvChild *c, *next;
    Error *local_err = NULL;
    int ret = -EIO;

+    bdrv_ref(top);
+
    if (!top->drv || !base->drv) {
        goto exit;
    }

-    new_top_bs = bdrv_find_overlay(active, top);
-
-    if (new_top_bs == NULL) {
-        /* we could not find the image above 'top', this is an error */
-        goto exit;
-    }
-
-    /* special case of new_top_bs->backing->bs already pointing to base - nothing
-     * to do, no intermediate images */
-    if (backing_bs(new_top_bs) == base) {
-        ret = 0;
-        goto exit;
-    }
-
    /* Make sure that base is in the backing chain of top */
    if (!bdrv_chain_contains(top, base)) {
        goto exit;
    }

    /* success - we can delete the intermediate states, and link top->base */
+    /* TODO Check graph modification op blockers (BLK_PERM_GRAPH_MOD) once
+     * we've figured out how they should work. */
    backing_file_str = backing_file_str ? backing_file_str : base->filename;
-    ret = bdrv_change_backing_file(new_top_bs, backing_file_str,
-                                   base->drv ? base->drv->format_name : "");
-    if (ret) {
-        goto exit;
-    }

-    bdrv_set_backing_hd(new_top_bs, base, &local_err);
+    QLIST_FOREACH_SAFE(c, &top->parents, next_parent, next) {
+        /* Check whether we are allowed to switch c from top to base */
+        GSList *ignore_children = g_slist_prepend(NULL, c);
+        bdrv_check_update_perm(base, NULL, c->perm, c->shared_perm,
+                               ignore_children, &local_err);
        if (local_err) {
            ret = -EPERM;
            error_report_err(local_err);
            goto exit;
        }
+        g_slist_free(ignore_children);
+
+        /* If so, update the backing file path in the image file */
+        if (c->role->update_filename) {
+            ret = c->role->update_filename(c, base, backing_file_str,
+                                           &local_err);
+            if (ret < 0) {
+                bdrv_abort_perm_update(base);
+                error_report_err(local_err);
+                goto exit;
+            }
+        }
+
+        /* Do the actual switch in the in-memory graph.
+         * Completes bdrv_check_update_perm() transaction internally. */
+        bdrv_ref(base);
+        bdrv_replace_child(c, base);
+        bdrv_unref(top);
+    }

    ret = 0;
 exit:
+    bdrv_unref(top);
    return ret;
 }

@@ -3450,12 +3603,18 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, PreallocMode prealloc,
    assert(!(bs->open_flags & BDRV_O_INACTIVE));

    ret = drv->bdrv_truncate(bs, offset, prealloc, errp);
-    if (ret == 0) {
+    if (ret < 0) {
+        return ret;
+    }
    ret = refresh_total_sectors(bs, offset >> BDRV_SECTOR_BITS);
-        bdrv_dirty_bitmap_truncate(bs);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "Could not refresh total sector count");
+    } else {
+        offset = bs->total_sectors * BDRV_SECTOR_SIZE;
+    }
+    bdrv_dirty_bitmap_truncate(bs, offset);
    bdrv_parent_cb_resize(bs);
    atomic_inc(&bs->write_gen);
-    }
    return ret;
 }

@@ -3765,7 +3924,9 @@ int bdrv_has_zero_init_1(BlockDriverState *bs)

 int bdrv_has_zero_init(BlockDriverState *bs)
 {
-    assert(bs->drv);
+    if (!bs->drv) {
+        return 0;
+    }

    /* If BS is a copy on write image, it is initialized to
       the contents of the base image, which may not be zeroes.  */
@@ -4030,7 +4191,29 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
        }
    }

+    /*
+     * Update permissions, they may differ for inactive nodes.
+     *
+     * Note that the required permissions of inactive images are always a
+     * subset of the permissions required after activating the image. This
+     * allows us to just get the permissions upfront without restricting
+     * drv->bdrv_invalidate_cache().
+     *
+     * It also means that in error cases, we don't have to try and revert to
+     * the old permissions (which is an operation that could fail, too). We can
+     * just keep the extended permissions for the next time that an activation
+     * of the image is tried.
+     */
    bs->open_flags &= ~BDRV_O_INACTIVE;
+    bdrv_get_cumulative_perm(bs, &perm, &shared_perm);
+    ret = bdrv_check_perm(bs, NULL, perm, shared_perm, NULL, &local_err);
+    if (ret < 0) {
+        bs->open_flags |= BDRV_O_INACTIVE;
+        error_propagate(errp, local_err);
+        return;
+    }
+    bdrv_set_perm(bs, perm, shared_perm);
+
    if (bs->drv->bdrv_invalidate_cache) {
        bs->drv->bdrv_invalidate_cache(bs, &local_err);
        if (local_err) {
@@ -4047,16 +4230,6 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
        return;
    }

-    /* Update permissions, they may differ for inactive nodes */
-    bdrv_get_cumulative_perm(bs, &perm, &shared_perm);
-    ret = bdrv_check_perm(bs, perm, shared_perm, NULL, &local_err);
-    if (ret < 0) {
-        bs->open_flags |= BDRV_O_INACTIVE;
-        error_propagate(errp, local_err);
-        return;
-    }
-    bdrv_set_perm(bs, perm, shared_perm);
-
    QLIST_FOREACH(parent, &bs->parents, next_parent) {
        if (parent->role->activate) {
            parent->role->activate(parent, &local_err);
@@ -4082,6 +4255,7 @@ void bdrv_invalidate_cache_all(Error **errp)
        aio_context_release(aio_context);
        if (local_err) {
            error_propagate(errp, local_err);
+            bdrv_next_cleanup(&it);
            return;
        }
    }
@@ -4093,6 +4267,10 @@ static int bdrv_inactivate_recurse(BlockDriverState *bs,
    BdrvChild *child, *parent;
    int ret;

+    if (!bs->drv) {
+        return -ENOMEDIUM;
+    }
+
    if (!setting_flag && bs->drv->bdrv_inactivate) {
        ret = bs->drv->bdrv_inactivate(bs);
        if (ret < 0) {
@@ -4116,7 +4294,7 @@ static int bdrv_inactivate_recurse(BlockDriverState *bs,

        /* Update permissions, they may differ for inactive nodes */
        bdrv_get_cumulative_perm(bs, &perm, &shared_perm);
-        bdrv_check_perm(bs, perm, shared_perm, NULL, &error_abort);
+        bdrv_check_perm(bs, NULL, perm, shared_perm, NULL, &error_abort);
        bdrv_set_perm(bs, perm, shared_perm);
    }

@@ -4153,6 +4331,7 @@ int bdrv_inactivate_all(void)
        for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
            ret = bdrv_inactivate_recurse(bs, pass);
            if (ret < 0) {
+                bdrv_next_cleanup(&it);
                goto out;
            }
        }
@@ -4393,7 +4572,7 @@ void bdrv_img_create(const char *filename, const char *fmt,

    /* The size for the image must always be specified, unless we have a backing
     * file and we have not been forbidden from opening it. */
-    size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 0);
+    size = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, img_size);
    if (backing_file && !(flags & BDRV_O_NO_BACKING)) {
        BlockDriverState *bs;
        char *full_backing = g_new0(char, PATH_MAX);
@@ -4627,6 +4806,9 @@ void bdrv_remove_aio_context_notifier(BlockDriverState *bs,
 int bdrv_amend_options(BlockDriverState *bs, QemuOpts *opts,
                       BlockDriverAmendStatusCB *status_cb, void *cb_opaque)
 {
+    if (!bs->drv) {
+        return -ENOMEDIUM;
+    }
    if (!bs->drv->bdrv_amend_options) {
        return -ENOTSUP;
    }
@@ -4684,6 +4866,7 @@ bool bdrv_is_first_non_filter(BlockDriverState *candidate)

        /* candidate is the first non filter */
        if (perm) {
+            bdrv_next_cleanup(&it);
            return true;
        }
    }
--- a/block/backup.c
+++ b/block/backup.c
@@ -372,10 +372,10 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)

    granularity = bdrv_dirty_bitmap_granularity(job->sync_bitmap);
    clusters_per_iter = MAX((granularity / job->cluster_size), 1);
-    dbi = bdrv_dirty_iter_new(job->sync_bitmap, 0);
+    dbi = bdrv_dirty_iter_new(job->sync_bitmap);

    /* Find the next dirty sector(s) */
-    while ((offset = bdrv_dirty_iter_next(dbi) * BDRV_SECTOR_SIZE) >= 0) {
+    while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
        cluster = offset / job->cluster_size;

        /* Fake progress updates for any clusters we skipped */
@@ -403,8 +403,7 @@ static int coroutine_fn backup_run_incremental(BackupBlockJob *job)
        /* If the bitmap granularity is smaller than the backup granularity,
         * we need to advance the iterator pointer to the next cluster. */
        if (granularity < job->cluster_size) {
-            bdrv_set_dirty_iter(dbi,
-                                cluster * job->cluster_size / BDRV_SECTOR_SIZE);
+            bdrv_set_dirty_iter(dbi, cluster * job->cluster_size);
        }

        last_cluster = cluster - 1;
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -244,7 +244,6 @@ static int read_config(BDRVBlkdebugState *s, const char *filename,
        ret = qemu_config_parse(f, config_groups, filename);
        if (ret < 0) {
            error_setg(errp, "Could not parse blkdebug config file");
-            ret = -EINVAL;
            goto fail;
        }
    }
@@ -628,6 +627,17 @@ static int coroutine_fn blkdebug_co_pdiscard(BlockDriverState *bs,
    return bdrv_co_pdiscard(bs->file->bs, offset, bytes);
 }

+static int64_t coroutine_fn blkdebug_co_get_block_status(
+    BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum,
+    BlockDriverState **file)
+{
+    assert(QEMU_IS_ALIGNED(sector_num | nb_sectors,
+                           DIV_ROUND_UP(bs->bl.request_alignment,
+                                        BDRV_SECTOR_SIZE)));
+    return bdrv_co_get_block_status_from_file(bs, sector_num, nb_sectors,
+                                              pnum, file);
+}
+
 static void blkdebug_close(BlockDriverState *bs)
 {
    BDRVBlkdebugState *s = bs->opaque;
@@ -897,7 +907,7 @@ static BlockDriver bdrv_blkdebug = {
    .bdrv_co_flush_to_disk  = blkdebug_co_flush,
    .bdrv_co_pwrite_zeroes  = blkdebug_co_pwrite_zeroes,
    .bdrv_co_pdiscard       = blkdebug_co_pdiscard,
-    .bdrv_co_get_block_status = bdrv_co_get_block_status_from_file,
+    .bdrv_co_get_block_status = blkdebug_co_get_block_status,

    .bdrv_debug_event           = blkdebug_debug_event,
    .bdrv_debug_breakpoint      = blkdebug_debug_breakpoint,
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -442,21 +442,37 @@ BlockBackend *blk_next(BlockBackend *blk)
 * the monitor or attached to a BlockBackend */
 BlockDriverState *bdrv_next(BdrvNextIterator *it)
 {
-    BlockDriverState *bs;
+    BlockDriverState *bs, *old_bs;
+
+    /* Must be called from the main loop */
+    assert(qemu_get_current_aio_context() == qemu_get_aio_context());

    /* First, return all root nodes of BlockBackends. In order to avoid
     * returning a BDS twice when multiple BBs refer to it, we only return it
     * if the BB is the first one in the parent list of the BDS. */
    if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
+        BlockBackend *old_blk = it->blk;
+
+        old_bs = old_blk ? blk_bs(old_blk) : NULL;
+
        do {
            it->blk = blk_all_next(it->blk);
            bs = it->blk ? blk_bs(it->blk) : NULL;
        } while (it->blk && (bs == NULL || bdrv_first_blk(bs) != it->blk));

+        if (it->blk) {
+            blk_ref(it->blk);
+        }
+        blk_unref(old_blk);
+
        if (bs) {
+            bdrv_ref(bs);
+            bdrv_unref(old_bs);
            return bs;
        }
        it->phase = BDRV_NEXT_MONITOR_OWNED;
+    } else {
+        old_bs = it->bs;
    }

    /* Then return the monitor-owned BDSes without a BB attached. Ignore all
@@ -467,18 +483,46 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
        bs = it->bs;
    } while (bs && bdrv_has_blk(bs));

+    if (bs) {
+        bdrv_ref(bs);
+    }
+    bdrv_unref(old_bs);
+
    return bs;
 }

-BlockDriverState *bdrv_first(BdrvNextIterator *it)
+static void bdrv_next_reset(BdrvNextIterator *it)
 {
    *it = (BdrvNextIterator) {
        .phase = BDRV_NEXT_BACKEND_ROOTS,
    };
+}

+BlockDriverState *bdrv_first(BdrvNextIterator *it)
+{
+    bdrv_next_reset(it);
    return bdrv_next(it);
 }

+/* Must be called when aborting a bdrv_next() iteration before
+ * bdrv_next() returns NULL */
+void bdrv_next_cleanup(BdrvNextIterator *it)
+{
+    /* Must be called from the main loop */
+    assert(qemu_get_current_aio_context() == qemu_get_aio_context());
+
+    if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
+        if (it->blk) {
+            bdrv_unref(blk_bs(it->blk));
+            blk_unref(it->blk);
+        }
+    } else {
+        bdrv_unref(it->bs);
+    }
+
+    bdrv_next_reset(it);
+}
+
 /*
 * Add a BlockBackend into the list of backends referenced by the monitor, with
 * the given @name acting as the handle for the monitor.
@@ -655,12 +699,16 @@ BlockBackend *blk_by_public(BlockBackendPublic *public)
 */
 void blk_remove_bs(BlockBackend *blk)
 {
-    ThrottleTimers *tt;
+    ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
+    BlockDriverState *bs;

    notifier_list_notify(&blk->remove_bs_notifiers, blk);
-    if (blk->public.throttle_group_member.throttle_state) {
-        tt = &blk->public.throttle_group_member.throttle_timers;
-        throttle_timers_detach_aio_context(tt);
+    if (tgm->throttle_state) {
+        bs = blk_bs(blk);
+        bdrv_drained_begin(bs);
+        throttle_group_detach_aio_context(tgm);
+        throttle_group_attach_aio_context(tgm, qemu_get_aio_context());
+        bdrv_drained_end(bs);
    }

    blk_update_root_state(blk);
@@ -674,6 +722,7 @@ void blk_remove_bs(BlockBackend *blk)
 */
 int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
 {
+    ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
    blk->root = bdrv_root_attach_child(bs, "root", &child_root,
                                       blk->perm, blk->shared_perm, blk, errp);
    if (blk->root == NULL) {
@@ -682,10 +731,9 @@ int blk_insert_bs(BlockBackend *blk, BlockDriverState *bs, Error **errp)
    bdrv_ref(bs);

    notifier_list_notify(&blk->insert_bs_notifiers, blk);
-    if (blk->public.throttle_group_member.throttle_state) {
-        throttle_timers_attach_aio_context(
-            &blk->public.throttle_group_member.throttle_timers,
-            bdrv_get_aio_context(bs));
+    if (tgm->throttle_state) {
+        throttle_group_detach_aio_context(tgm);
+        throttle_group_attach_aio_context(tgm, bdrv_get_aio_context(bs));
    }

    return 0;
@@ -1748,8 +1796,10 @@ void blk_set_aio_context(BlockBackend *blk, AioContext *new_context)

    if (bs) {
        if (tgm->throttle_state) {
+            bdrv_drained_begin(bs);
            throttle_group_detach_aio_context(tgm);
            throttle_group_attach_aio_context(tgm, new_context);
+            bdrv_drained_end(bs);
        }
        bdrv_set_aio_context(bs, new_context);
    }
@@ -1974,10 +2024,16 @@ void blk_set_io_limits(BlockBackend *blk, ThrottleConfig *cfg)

 void blk_io_limits_disable(BlockBackend *blk)
 {
-    assert(blk->public.throttle_group_member.throttle_state);
-    bdrv_drained_begin(blk_bs(blk));
-    throttle_group_unregister_tgm(&blk->public.throttle_group_member);
-    bdrv_drained_end(blk_bs(blk));
+    BlockDriverState *bs = blk_bs(blk);
+    ThrottleGroupMember *tgm = &blk->public.throttle_group_member;
+    assert(tgm->throttle_state);
+    if (bs) {
+        bdrv_drained_begin(bs);
+    }
+    throttle_group_unregister_tgm(tgm);
+    if (bs) {
+        bdrv_drained_end(bs);
+    }
 }

 /* should be called before blk_set_io_limits if a limit is set */
--- a/block/bochs.c
+++ b/block/bochs.c
@@ -28,6 +28,7 @@
 #include "block/block_int.h"
 #include "qemu/module.h"
 #include "qemu/bswap.h"
+#include "qemu/error-report.h"

 /**************************************************************/

@@ -110,10 +111,16 @@ static int bochs_open(BlockDriverState *bs, QDict *options, int flags,
        return -EINVAL;
    }

+    if (!bdrv_is_read_only(bs)) {
+        error_report("Opening bochs images without an explicit read-only=on "
+                     "option is deprecated. Future versions will refuse to "
+                     "open the image instead of automatically marking the "
+                     "image read-only.");
        ret = bdrv_set_read_only(bs, true, errp); /* no write support yet */
        if (ret < 0) {
            return ret;
        }
+    }

    ret = bdrv_pread(bs->file, 0, &bochs, sizeof(bochs));
    if (ret < 0) {
--- a/block/cloop.c
+++ b/block/cloop.c
@@ -23,6 +23,7 @@
 */
 #include "qemu/osdep.h"
 #include "qapi/error.h"
+#include "qemu/error-report.h"
 #include "qemu-common.h"
 #include "block/block_int.h"
 #include "qemu/module.h"
@@ -72,10 +73,16 @@ static int cloop_open(BlockDriverState *bs, QDict *options, int flags,
        return -EINVAL;
    }

+    if (!bdrv_is_read_only(bs)) {
+        error_report("Opening cloop images without an explicit read-only=on "
+                     "option is deprecated. Future versions will refuse to "
+                     "open the image instead of automatically marking the "
+                     "image read-only.");
        ret = bdrv_set_read_only(bs, true, errp);
        if (ret < 0) {
            return ret;
        }
+    }

    /* read header */
    ret = bdrv_pread(bs->file, 128, &s->block_size, 4);
--- a/block/commit.c
+++ b/block/commit.c
@@ -36,13 +36,11 @@ enum {
 typedef struct CommitBlockJob {
    BlockJob common;
    RateLimit limit;
-    BlockDriverState *active;
    BlockDriverState *commit_top_bs;
    BlockBackend *top;
    BlockBackend *base;
    BlockdevOnError on_error;
    int base_flags;
-    int orig_overlay_flags;
    char *backing_file_str;
 } CommitBlockJob;

@@ -81,18 +79,15 @@ static void commit_complete(BlockJob *job, void *opaque)
 {
    CommitBlockJob *s = container_of(job, CommitBlockJob, common);
    CommitCompleteData *data = opaque;
-    BlockDriverState *active = s->active;
    BlockDriverState *top = blk_bs(s->top);
    BlockDriverState *base = blk_bs(s->base);
-    BlockDriverState *overlay_bs = bdrv_find_overlay(active, s->commit_top_bs);
+    BlockDriverState *commit_top_bs = s->commit_top_bs;
    int ret = data->ret;
    bool remove_commit_top_bs = false;

-    /* Make sure overlay_bs and top stay around until bdrv_set_backing_hd() */
+    /* Make sure commit_top_bs and top stay around until bdrv_replace_node() */
    bdrv_ref(top);
-    if (overlay_bs) {
-        bdrv_ref(overlay_bs);
-    }
+    bdrv_ref(commit_top_bs);

    /* Remove base node parent that still uses BLK_PERM_WRITE/RESIZE before
     * the normal backing chain can be restored. */
@@ -100,9 +95,9 @@ static void commit_complete(BlockJob *job, void *opaque)

    if (!block_job_is_cancelled(&s->common) && ret == 0) {
        /* success */
-        ret = bdrv_drop_intermediate(active, s->commit_top_bs, base,
+        ret = bdrv_drop_intermediate(s->commit_top_bs, base,
                                     s->backing_file_str);
-    } else if (overlay_bs) {
+    } else {
        /* XXX Can (or should) we somehow keep 'consistent read' blocked even
         * after the failed/cancelled commit job is gone? If we already wrote
         * something to base, the intermediate images aren't valid any more. */
@@ -115,9 +110,6 @@ static void commit_complete(BlockJob *job, void *opaque)
    if (s->base_flags != bdrv_get_flags(base)) {
        bdrv_reopen(base, s->base_flags, NULL);
    }
-    if (overlay_bs && s->orig_overlay_flags != bdrv_get_flags(overlay_bs)) {
-        bdrv_reopen(overlay_bs, s->orig_overlay_flags, NULL);
-    }
    g_free(s->backing_file_str);
    blk_unref(s->top);

@@ -134,10 +126,13 @@ static void commit_complete(BlockJob *job, void *opaque)
     * filter driver from the backing chain. Do this as the final step so that
     * the 'consistent read' permission can be granted.  */
    if (remove_commit_top_bs) {
-        bdrv_set_backing_hd(overlay_bs, top, &error_abort);
+        bdrv_child_try_set_perm(commit_top_bs->backing, 0, BLK_PERM_ALL,
+                                &error_abort);
+        bdrv_replace_node(commit_top_bs, backing_bs(commit_top_bs),
+                          &error_abort);
    }

-    bdrv_unref(overlay_bs);
+    bdrv_unref(commit_top_bs);
    bdrv_unref(top);
 }

@@ -257,6 +252,7 @@ static void bdrv_commit_top_close(BlockDriverState *bs)

 static void bdrv_commit_top_child_perm(BlockDriverState *bs, BdrvChild *c,
                                       const BdrvChildRole *role,
+                                       BlockReopenQueue *reopen_queue,
                                       uint64_t perm, uint64_t shared,
                                       uint64_t *nperm, uint64_t *nshared)
 {
@@ -282,10 +278,8 @@ void commit_start(const char *job_id, BlockDriverState *bs,
 {
    CommitBlockJob *s;
    BlockReopenQueue *reopen_queue = NULL;
-    int orig_overlay_flags;
    int orig_base_flags;
    BlockDriverState *iter;
-    BlockDriverState *overlay_bs;
    BlockDriverState *commit_top_bs = NULL;
    Error *local_err = NULL;
    int ret;
@@ -296,31 +290,19 @@ void commit_start(const char *job_id, BlockDriverState *bs,
        return;
    }

-    overlay_bs = bdrv_find_overlay(bs, top);
-
-    if (overlay_bs == NULL) {
-        error_setg(errp, "Could not find overlay image for %s:", top->filename);
-        return;
-    }
-
    s = block_job_create(job_id, &commit_job_driver, bs, 0, BLK_PERM_ALL,
                         speed, BLOCK_JOB_DEFAULT, NULL, NULL, errp);
    if (!s) {
        return;
    }

+    /* convert base to r/w, if necessary */
    orig_base_flags = bdrv_get_flags(base);
-    orig_overlay_flags = bdrv_get_flags(overlay_bs);
-
-    /* convert base & overlay_bs to r/w, if necessary */
    if (!(orig_base_flags & BDRV_O_RDWR)) {
        reopen_queue = bdrv_reopen_queue(reopen_queue, base, NULL,
                                         orig_base_flags | BDRV_O_RDWR);
    }
-    if (!(orig_overlay_flags & BDRV_O_RDWR)) {
-        reopen_queue = bdrv_reopen_queue(reopen_queue, overlay_bs, NULL,
-                                         orig_overlay_flags | BDRV_O_RDWR);
-    }
+
    if (reopen_queue) {
        bdrv_reopen_multiple(bdrv_get_aio_context(bs), reopen_queue, &local_err);
        if (local_err != NULL) {
@@ -349,7 +331,7 @@ void commit_start(const char *job_id, BlockDriverState *bs,
        error_propagate(errp, local_err);
        goto fail;
    }
-    bdrv_set_backing_hd(overlay_bs, commit_top_bs, &local_err);
+    bdrv_replace_node(top, commit_top_bs, &local_err);
    if (local_err) {
        bdrv_unref(commit_top_bs);
        commit_top_bs = NULL;
@@ -381,14 +363,6 @@ void commit_start(const char *job_id, BlockDriverState *bs,
        goto fail;
    }

-    /* overlay_bs must be blocked because it needs to be modified to
-     * update the backing image string. */
-    ret = block_job_add_bdrv(&s->common, "overlay of top", overlay_bs,
-                             BLK_PERM_GRAPH_MOD, BLK_PERM_ALL, errp);
-    if (ret < 0) {
-        goto fail;
-    }
-
    s->base = blk_new(BLK_PERM_CONSISTENT_READ
                      | BLK_PERM_WRITE
                      | BLK_PERM_RESIZE,
@@ -407,13 +381,8 @@ void commit_start(const char *job_id, BlockDriverState *bs,
        goto fail;
    }

-    s->active = bs;
-
    s->base_flags = orig_base_flags;
-    s->orig_overlay_flags  = orig_overlay_flags;
-
    s->backing_file_str = g_strdup(backing_file_str);
-
    s->on_error = on_error;

    trace_commit_start(bs, base, top, s);
@@ -428,7 +397,7 @@ fail:
        blk_unref(s->top);
    }
    if (commit_top_bs) {
-        bdrv_set_backing_hd(overlay_bs, top, &error_abort);
+        bdrv_replace_node(commit_top_bs, top, &error_abort);
    }
    block_job_early_fail(&s->common);
 }
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -279,6 +279,9 @@ static int block_crypto_open_generic(QCryptoBlockFormat format,
        return -EINVAL;
    }

+    bs->supported_write_flags = BDRV_REQ_FUA &
+        bs->file->bs->supported_write_flags;
+
    opts = qemu_opts_create(opts_spec, NULL, 0, &error_abort);
    qemu_opts_absorb_qdict(opts, options, &local_err);
    if (local_err) {
@@ -364,8 +367,9 @@ static int block_crypto_truncate(BlockDriverState *bs, int64_t offset,
                                 PreallocMode prealloc, Error **errp)
 {
    BlockCrypto *crypto = bs->opaque;
-    size_t payload_offset =
+    uint64_t payload_offset =
        qcrypto_block_get_payload_offset(crypto->block);
+    assert(payload_offset < (INT64_MAX - offset));

    offset += payload_offset;

@@ -379,66 +383,65 @@ static void block_crypto_close(BlockDriverState *bs)
 }


-#define BLOCK_CRYPTO_MAX_SECTORS 32
+/*
+ * 1 MB bounce buffer gives good performance / memory tradeoff
+ * when using cache=none|directsync.
+ */
+#define BLOCK_CRYPTO_MAX_IO_SIZE (1024 * 1024)

 static coroutine_fn int
-block_crypto_co_readv(BlockDriverState *bs, int64_t sector_num,
-                      int remaining_sectors, QEMUIOVector *qiov)
+block_crypto_co_preadv(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+                       QEMUIOVector *qiov, int flags)
 {
    BlockCrypto *crypto = bs->opaque;
-    int cur_nr_sectors; /* number of sectors in current iteration */
+    uint64_t cur_bytes; /* number of bytes in current iteration */
    uint64_t bytes_done = 0;
    uint8_t *cipher_data = NULL;
    QEMUIOVector hd_qiov;
    int ret = 0;
-    size_t payload_offset =
-        qcrypto_block_get_payload_offset(crypto->block) / 512;
+    uint64_t sector_size = qcrypto_block_get_sector_size(crypto->block);
+    uint64_t payload_offset = qcrypto_block_get_payload_offset(crypto->block);
+
+    assert(!flags);
+    assert(payload_offset < INT64_MAX);
+    assert(QEMU_IS_ALIGNED(offset, sector_size));
+    assert(QEMU_IS_ALIGNED(bytes, sector_size));

    qemu_iovec_init(&hd_qiov, qiov->niov);

-    /* Bounce buffer so we have a linear mem region for
-     * entire sector. XXX optimize so we avoid bounce
-     * buffer in case that qiov->niov == 1
+    /* Bounce buffer because we don't wish to expose cipher text
+     * in qiov which points to guest memory.
     */
    cipher_data =
-        qemu_try_blockalign(bs->file->bs, MIN(BLOCK_CRYPTO_MAX_SECTORS * 512,
+        qemu_try_blockalign(bs->file->bs, MIN(BLOCK_CRYPTO_MAX_IO_SIZE,
                                              qiov->size));
    if (cipher_data == NULL) {
        ret = -ENOMEM;
        goto cleanup;
    }

-    while (remaining_sectors) {
-        cur_nr_sectors = remaining_sectors;
-
-        if (cur_nr_sectors > BLOCK_CRYPTO_MAX_SECTORS) {
-            cur_nr_sectors = BLOCK_CRYPTO_MAX_SECTORS;
-        }
+    while (bytes) {
+        cur_bytes = MIN(bytes, BLOCK_CRYPTO_MAX_IO_SIZE);

        qemu_iovec_reset(&hd_qiov);
-        qemu_iovec_add(&hd_qiov, cipher_data, cur_nr_sectors * 512);
+        qemu_iovec_add(&hd_qiov, cipher_data, cur_bytes);

-        ret = bdrv_co_readv(bs->file,
-                            payload_offset + sector_num,
-                            cur_nr_sectors, &hd_qiov);
+        ret = bdrv_co_preadv(bs->file, payload_offset + offset + bytes_done,
+                             cur_bytes, &hd_qiov, 0);
        if (ret < 0) {
            goto cleanup;
        }

-        if (qcrypto_block_decrypt(crypto->block,
-                                  sector_num,
-                                  cipher_data, cur_nr_sectors * 512,
-                                  NULL) < 0) {
+        if (qcrypto_block_decrypt(crypto->block, offset + bytes_done,
+                                  cipher_data, cur_bytes, NULL) < 0) {
            ret = -EIO;
            goto cleanup;
        }

-        qemu_iovec_from_buf(qiov, bytes_done,
-                            cipher_data, cur_nr_sectors * 512);
+        qemu_iovec_from_buf(qiov, bytes_done, cipher_data, cur_bytes);

-        remaining_sectors -= cur_nr_sectors;
-        sector_num += cur_nr_sectors;
-        bytes_done += cur_nr_sectors * 512;
+        bytes -= cur_bytes;
+        bytes_done += cur_bytes;
    }

 cleanup:
@@ -450,63 +453,58 @@ block_crypto_co_readv(BlockDriverState *bs, int64_t sector_num,


 static coroutine_fn int
-block_crypto_co_writev(BlockDriverState *bs, int64_t sector_num,
-                       int remaining_sectors, QEMUIOVector *qiov)
+block_crypto_co_pwritev(BlockDriverState *bs, uint64_t offset, uint64_t bytes,
+                        QEMUIOVector *qiov, int flags)
 {
    BlockCrypto *crypto = bs->opaque;
-    int cur_nr_sectors; /* number of sectors in current iteration */
+    uint64_t cur_bytes; /* number of bytes in current iteration */
    uint64_t bytes_done = 0;
    uint8_t *cipher_data = NULL;
    QEMUIOVector hd_qiov;
    int ret = 0;
-    size_t payload_offset =
-        qcrypto_block_get_payload_offset(crypto->block) / 512;
+    uint64_t sector_size = qcrypto_block_get_sector_size(crypto->block);
+    uint64_t payload_offset = qcrypto_block_get_payload_offset(crypto->block);
+
+    assert(!(flags & ~BDRV_REQ_FUA));
+    assert(payload_offset < INT64_MAX);
+    assert(QEMU_IS_ALIGNED(offset, sector_size));
+    assert(QEMU_IS_ALIGNED(bytes, sector_size));

    qemu_iovec_init(&hd_qiov, qiov->niov);

-    /* Bounce buffer so we have a linear mem region for
-     * entire sector. XXX optimize so we avoid bounce
-     * buffer in case that qiov->niov == 1
+    /* Bounce buffer because we're not permitted to touch
+     * contents of qiov - it points to guest memory.
     */
    cipher_data =
-        qemu_try_blockalign(bs->file->bs, MIN(BLOCK_CRYPTO_MAX_SECTORS * 512,
+        qemu_try_blockalign(bs->file->bs, MIN(BLOCK_CRYPTO_MAX_IO_SIZE,
                                              qiov->size));
    if (cipher_data == NULL) {
        ret = -ENOMEM;
        goto cleanup;
    }

-    while (remaining_sectors) {
-        cur_nr_sectors = remaining_sectors;
+    while (bytes) {
+        cur_bytes = MIN(bytes, BLOCK_CRYPTO_MAX_IO_SIZE);

-        if (cur_nr_sectors > BLOCK_CRYPTO_MAX_SECTORS) {
-            cur_nr_sectors = BLOCK_CRYPTO_MAX_SECTORS;
-        }
+        qemu_iovec_to_buf(qiov, bytes_done, cipher_data, cur_bytes);

-        qemu_iovec_to_buf(qiov, bytes_done,
-                          cipher_data, cur_nr_sectors * 512);
-
-        if (qcrypto_block_encrypt(crypto->block,
-                                  sector_num,
-                                  cipher_data, cur_nr_sectors * 512,
-                                  NULL) < 0) {
+        if (qcrypto_block_encrypt(crypto->block, offset + bytes_done,
+                                  cipher_data, cur_bytes, NULL) < 0) {
            ret = -EIO;
            goto cleanup;
        }

        qemu_iovec_reset(&hd_qiov);
-        qemu_iovec_add(&hd_qiov, cipher_data, cur_nr_sectors * 512);
+        qemu_iovec_add(&hd_qiov, cipher_data, cur_bytes);

-        ret = bdrv_co_writev(bs->file,
-                             payload_offset + sector_num,
-                             cur_nr_sectors, &hd_qiov);
+        ret = bdrv_co_pwritev(bs->file, payload_offset + offset + bytes_done,
+                              cur_bytes, &hd_qiov, flags);
        if (ret < 0) {
            goto cleanup;
        }

-        remaining_sectors -= cur_nr_sectors;
-        sector_num += cur_nr_sectors;
-        bytes_done += cur_nr_sectors * 512;
+        bytes -= cur_bytes;
+        bytes_done += cur_bytes;
    }

 cleanup:
@@ -516,13 +514,22 @@ block_crypto_co_writev(BlockDriverState *bs, int64_t sector_num,
    return ret;
 }

+static void block_crypto_refresh_limits(BlockDriverState *bs, Error **errp)
+{
+    BlockCrypto *crypto = bs->opaque;
+    uint64_t sector_size = qcrypto_block_get_sector_size(crypto->block);
+    bs->bl.request_alignment = sector_size; /* No sub-sector I/O */
+}
+

 static int64_t block_crypto_getlength(BlockDriverState *bs)
 {
    BlockCrypto *crypto = bs->opaque;
    int64_t len = bdrv_getlength(bs->file->bs);

-    ssize_t offset = qcrypto_block_get_payload_offset(crypto->block);
+    uint64_t offset = qcrypto_block_get_payload_offset(crypto->block);
+    assert(offset < INT64_MAX);
+    assert(offset < len);

    len -= offset;

@@ -613,8 +620,9 @@ BlockDriver bdrv_crypto_luks = {
    .bdrv_truncate      = block_crypto_truncate,
    .create_opts        = &block_crypto_create_opts_luks,

-    .bdrv_co_readv      = block_crypto_co_readv,
-    .bdrv_co_writev     = block_crypto_co_writev,
+    .bdrv_refresh_limits = block_crypto_refresh_limits,
+    .bdrv_co_preadv     = block_crypto_co_preadv,
+    .bdrv_co_pwritev    = block_crypto_co_pwritev,
    .bdrv_getlength     = block_crypto_getlength,
    .bdrv_get_info      = block_crypto_get_info_luks,
    .bdrv_get_specific_info = block_crypto_get_specific_info_luks,
--- a/block/dirty-bitmap.c
+++ b/block/dirty-bitmap.c
@@ -1,7 +1,7 @@
 /*
 * Block Dirty Bitmap
 *
- * Copyright (c) 2016 Red Hat. Inc
+ * Copyright (c) 2016-2017 Red Hat. Inc
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
@@ -38,11 +38,11 @@
 */
 struct BdrvDirtyBitmap {
    QemuMutex *mutex;
-    HBitmap *bitmap;            /* Dirty sector bitmap implementation */
+    HBitmap *bitmap;            /* Dirty bitmap implementation */
    HBitmap *meta;              /* Meta dirty bitmap */
    BdrvDirtyBitmap *successor; /* Anonymous child; implies frozen status */
    char *name;                 /* Optional non-empty unique ID */
-    int64_t size;               /* Size of the bitmap (Number of sectors) */
+    int64_t size;               /* Size of the bitmap, in bytes */
    bool disabled;              /* Bitmap is disabled. It ignores all writes to
                                   the device */
    int active_iterators;       /* How many iterators are active */
@@ -115,17 +115,14 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState *bs,
 {
    int64_t bitmap_size;
    BdrvDirtyBitmap *bitmap;
-    uint32_t sector_granularity;

-    assert((granularity & (granularity - 1)) == 0);
+    assert(is_power_of_2(granularity) && granularity >= BDRV_SECTOR_SIZE);

    if (name && bdrv_find_dirty_bitmap(bs, name)) {
        error_setg(errp, "Bitmap already exists: %s", name);
        return NULL;
    }
-    sector_granularity = granularity >> BDRV_SECTOR_BITS;
-    assert(sector_granularity);
-    bitmap_size = bdrv_nb_sectors(bs);
+    bitmap_size = bdrv_getlength(bs);
    if (bitmap_size < 0) {
        error_setg_errno(errp, -bitmap_size, "could not get length of device");
        errno = -bitmap_size;
@@ -133,7 +130,7 @@ BdrvDirtyBitmap *bdrv_create_dirty_bitmap(BlockDriverState *bs,
    }
    bitmap = g_new0(BdrvDirtyBitmap, 1);
    bitmap->mutex = &bs->dirty_bitmap_mutex;
-    bitmap->bitmap = hbitmap_alloc(bitmap_size, ctz32(sector_granularity));
+    bitmap->bitmap = hbitmap_alloc(bitmap_size, ctz32(granularity));
    bitmap->size = bitmap_size;
    bitmap->name = g_strdup(name);
    bitmap->disabled = false;
@@ -173,45 +170,6 @@ void bdrv_release_meta_dirty_bitmap(BdrvDirtyBitmap *bitmap)
    qemu_mutex_unlock(bitmap->mutex);
 }

-int bdrv_dirty_bitmap_get_meta_locked(BlockDriverState *bs,
-                                      BdrvDirtyBitmap *bitmap, int64_t sector,
-                                      int nb_sectors)
-{
-    uint64_t i;
-    int sectors_per_bit = 1 << hbitmap_granularity(bitmap->meta);
-
-    /* To optimize: we can make hbitmap to internally check the range in a
-     * coarse level, or at least do it word by word. */
-    for (i = sector; i < sector + nb_sectors; i += sectors_per_bit) {
-        if (hbitmap_get(bitmap->meta, i)) {
-            return true;
-        }
-    }
-    return false;
-}
-
-int bdrv_dirty_bitmap_get_meta(BlockDriverState *bs,
-                               BdrvDirtyBitmap *bitmap, int64_t sector,
-                               int nb_sectors)
-{
-    bool dirty;
-
-    qemu_mutex_lock(bitmap->mutex);
-    dirty = bdrv_dirty_bitmap_get_meta_locked(bs, bitmap, sector, nb_sectors);
-    qemu_mutex_unlock(bitmap->mutex);
-
-    return dirty;
-}
-
-void bdrv_dirty_bitmap_reset_meta(BlockDriverState *bs,
-                                  BdrvDirtyBitmap *bitmap, int64_t sector,
-                                  int nb_sectors)
-{
-    qemu_mutex_lock(bitmap->mutex);
-    hbitmap_reset(bitmap->meta, sector, nb_sectors);
-    qemu_mutex_unlock(bitmap->mutex);
-}
-
 int64_t bdrv_dirty_bitmap_size(const BdrvDirtyBitmap *bitmap)
 {
    return bitmap->size;
@@ -341,17 +299,16 @@ BdrvDirtyBitmap *bdrv_reclaim_dirty_bitmap(BlockDriverState *bs,
 * Truncates _all_ bitmaps attached to a BDS.
 * Called with BQL taken.
 */
-void bdrv_dirty_bitmap_truncate(BlockDriverState *bs)
+void bdrv_dirty_bitmap_truncate(BlockDriverState *bs, int64_t bytes)
 {
    BdrvDirtyBitmap *bitmap;
-    uint64_t size = bdrv_nb_sectors(bs);

    bdrv_dirty_bitmaps_lock(bs);
    QLIST_FOREACH(bitmap, &bs->dirty_bitmaps, list) {
        assert(!bdrv_dirty_bitmap_frozen(bitmap));
        assert(!bitmap->active_iterators);
-        hbitmap_truncate(bitmap->bitmap, size);
-        bitmap->size = size;
+        hbitmap_truncate(bitmap->bitmap, bytes);
+        bitmap->size = bytes;
    }
    bdrv_dirty_bitmaps_unlock(bs);
 }
@@ -461,7 +418,7 @@ BlockDirtyInfoList *bdrv_query_dirty_bitmaps(BlockDriverState *bs)
    QLIST_FOREACH(bm, &bs->dirty_bitmaps, list) {
        BlockDirtyInfo *info = g_new0(BlockDirtyInfo, 1);
        BlockDirtyInfoList *entry = g_new0(BlockDirtyInfoList, 1);
-        info->count = bdrv_get_dirty_count(bm) << BDRV_SECTOR_BITS;
+        info->count = bdrv_get_dirty_count(bm);
        info->granularity = bdrv_dirty_bitmap_granularity(bm);
        info->has_name = !!bm->name;
        info->name = g_strdup(bm->name);
@@ -476,13 +433,13 @@ BlockDirtyInfoList *bdrv_query_dirty_bitmaps(BlockDriverState *bs)
 }

 /* Called within bdrv_dirty_bitmap_lock..unlock */
-int bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
-                          int64_t sector)
+bool bdrv_get_dirty_locked(BlockDriverState *bs, BdrvDirtyBitmap *bitmap,
+                           int64_t offset)
 {
    if (bitmap) {
-        return hbitmap_get(bitmap->bitmap, sector);
+        return hbitmap_get(bitmap->bitmap, offset);
    } else {
-        return 0;
+        return false;
    }
 }

@@ -508,19 +465,13 @@ uint32_t bdrv_get_default_bitmap_granularity(BlockDriverState *bs)

 uint32_t bdrv_dirty_bitmap_granularity(const BdrvDirtyBitmap *bitmap)
 {
-    return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->bitmap);
+    return 1U << hbitmap_granularity(bitmap->bitmap);
 }

-uint32_t bdrv_dirty_bitmap_meta_granularity(BdrvDirtyBitmap *bitmap)
-{
-    return BDRV_SECTOR_SIZE << hbitmap_granularity(bitmap->meta);
-}
-
-BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap,
-                                         uint64_t first_sector)
+BdrvDirtyBitmapIter *bdrv_dirty_iter_new(BdrvDirtyBitmap *bitmap)
 {
    BdrvDirtyBitmapIter *iter = g_new(BdrvDirtyBitmapIter, 1);
-    hbitmap_iter_init(&iter->hbi, bitmap->bitmap, first_sector);
+    hbitmap_iter_init(&iter->hbi, bitmap->bitmap, 0);
    iter->bitmap = bitmap;
    bitmap->active_iterators++;
    return iter;
@@ -552,35 +503,35 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)

 /* Called within bdrv_dirty_bitmap_lock..unlock */
 void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
-                                  int64_t cur_sector, int64_t nr_sectors)
+                                  int64_t offset, int64_t bytes)
 {
    assert(bdrv_dirty_bitmap_enabled(bitmap));
    assert(!bdrv_dirty_bitmap_readonly(bitmap));
-    hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
+    hbitmap_set(bitmap->bitmap, offset, bytes);
 }

 void bdrv_set_dirty_bitmap(BdrvDirtyBitmap *bitmap,
-                           int64_t cur_sector, int64_t nr_sectors)
+                           int64_t offset, int64_t bytes)
 {
    bdrv_dirty_bitmap_lock(bitmap);
-    bdrv_set_dirty_bitmap_locked(bitmap, cur_sector, nr_sectors);
+    bdrv_set_dirty_bitmap_locked(bitmap, offset, bytes);
    bdrv_dirty_bitmap_unlock(bitmap);
 }

 /* Called within bdrv_dirty_bitmap_lock..unlock */
 void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
-                                    int64_t cur_sector, int64_t nr_sectors)
+                                    int64_t offset, int64_t bytes)
 {
    assert(bdrv_dirty_bitmap_enabled(bitmap));
    assert(!bdrv_dirty_bitmap_readonly(bitmap));
-    hbitmap_reset(bitmap->bitmap, cur_sector, nr_sectors);
+    hbitmap_reset(bitmap->bitmap, offset, bytes);
 }

 void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
-                             int64_t cur_sector, int64_t nr_sectors)
+                             int64_t offset, int64_t bytes)
 {
    bdrv_dirty_bitmap_lock(bitmap);
-    bdrv_reset_dirty_bitmap_locked(bitmap, cur_sector, nr_sectors);
+    bdrv_reset_dirty_bitmap_locked(bitmap, offset, bytes);
    bdrv_dirty_bitmap_unlock(bitmap);
 }

@@ -610,42 +561,42 @@ void bdrv_undo_clear_dirty_bitmap(BdrvDirtyBitmap *bitmap, HBitmap *in)
 }

 uint64_t bdrv_dirty_bitmap_serialization_size(const BdrvDirtyBitmap *bitmap,
-                                              uint64_t start, uint64_t count)
+                                              uint64_t offset, uint64_t bytes)
 {
-    return hbitmap_serialization_size(bitmap->bitmap, start, count);
+    return hbitmap_serialization_size(bitmap->bitmap, offset, bytes);
 }

 uint64_t bdrv_dirty_bitmap_serialization_align(const BdrvDirtyBitmap *bitmap)
 {
-    return hbitmap_serialization_granularity(bitmap->bitmap);
+    return hbitmap_serialization_align(bitmap->bitmap);
 }

 void bdrv_dirty_bitmap_serialize_part(const BdrvDirtyBitmap *bitmap,
-                                      uint8_t *buf, uint64_t start,
-                                      uint64_t count)
+                                      uint8_t *buf, uint64_t offset,
+                                      uint64_t bytes)
 {
-    hbitmap_serialize_part(bitmap->bitmap, buf, start, count);
+    hbitmap_serialize_part(bitmap->bitmap, buf, offset, bytes);
 }

 void bdrv_dirty_bitmap_deserialize_part(BdrvDirtyBitmap *bitmap,
-                                        uint8_t *buf, uint64_t start,
-                                        uint64_t count, bool finish)
+                                        uint8_t *buf, uint64_t offset,
+                                        uint64_t bytes, bool finish)
 {
-    hbitmap_deserialize_part(bitmap->bitmap, buf, start, count, finish);
+    hbitmap_deserialize_part(bitmap->bitmap, buf, offset, bytes, finish);
 }

 void bdrv_dirty_bitmap_deserialize_zeroes(BdrvDirtyBitmap *bitmap,
-                                          uint64_t start, uint64_t count,
+                                          uint64_t offset, uint64_t bytes,
                                          bool finish)
 {
-    hbitmap_deserialize_zeroes(bitmap->bitmap, start, count, finish);
+    hbitmap_deserialize_zeroes(bitmap->bitmap, offset, bytes, finish);
 }

 void bdrv_dirty_bitmap_deserialize_ones(BdrvDirtyBitmap *bitmap,
-                                        uint64_t start, uint64_t count,
+                                        uint64_t offset, uint64_t bytes,
                                        bool finish)
 {
-    hbitmap_deserialize_ones(bitmap->bitmap, start, count, finish);
+    hbitmap_deserialize_ones(bitmap->bitmap, offset, bytes, finish);
 }

 void bdrv_dirty_bitmap_deserialize_finish(BdrvDirtyBitmap *bitmap)
@@ -653,8 +604,7 @@ void bdrv_dirty_bitmap_deserialize_finish(BdrvDirtyBitmap *bitmap)
    hbitmap_deserialize_finish(bitmap->bitmap);
 }

-void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
-                    int64_t nr_sectors)
+void bdrv_set_dirty(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
    BdrvDirtyBitmap *bitmap;

@@ -668,7 +618,7 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
            continue;
        }
        assert(!bdrv_dirty_bitmap_readonly(bitmap));
-        hbitmap_set(bitmap->bitmap, cur_sector, nr_sectors);
+        hbitmap_set(bitmap->bitmap, offset, bytes);
    }
    bdrv_dirty_bitmaps_unlock(bs);
 }
@@ -676,9 +626,9 @@ void bdrv_set_dirty(BlockDriverState *bs, int64_t cur_sector,
 /**
 * Advance a BdrvDirtyBitmapIter to an arbitrary offset.
 */
-void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t sector_num)
+void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *iter, int64_t offset)
 {
-    hbitmap_iter_init(&iter->hbi, iter->hbi.hb, sector_num);
+    hbitmap_iter_init(&iter->hbi, iter->hbi.hb, offset);
 }

 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap)
--- a/block/dmg.c
+++ b/block/dmg.c
@@ -419,10 +419,16 @@ static int dmg_open(BlockDriverState *bs, QDict *options, int flags,
        return -EINVAL;
    }

+    if (!bdrv_is_read_only(bs)) {
+        error_report("Opening dmg images without an explicit read-only=on "
+                     "option is deprecated. Future versions will refuse to "
+                     "open the image instead of automatically marking the "
+                     "image read-only.");
        ret = bdrv_set_read_only(bs, true, errp);
        if (ret < 0) {
            return ret;
        }
+    }

    block_module_load_one("dmg-bz2");

--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -33,6 +33,9 @@
 #include "block/raw-aio.h"
 #include "qapi/qmp/qstring.h"

+#include "scsi/pr-manager.h"
+#include "scsi/constants.h"
+
 #if defined(__APPLE__) && (__MACH__)
 #include <paths.h>
 #include <sys/param.h>
@@ -155,6 +158,8 @@ typedef struct BDRVRawState {
    bool page_cache_inconsistent:1;
    bool has_fallocate;
    bool needs_alignment;
+
+    PRManager *pr_mgr;
 } BDRVRawState;

 typedef struct BDRVRawReopenState {
@@ -402,6 +407,11 @@ static QemuOptsList raw_runtime_opts = {
            .type = QEMU_OPT_STRING,
            .help = "file locking mode (on/off/auto, default: auto)",
        },
+        {
+            .name = "pr-manager",
+            .type = QEMU_OPT_STRING,
+            .help = "id of persistent reservation manager object (default: none)",
+        },
        { /* end of list */ }
    },
 };
@@ -413,6 +423,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
    QemuOpts *opts;
    Error *local_err = NULL;
    const char *filename = NULL;
+    const char *str;
    BlockdevAioOptions aio, aio_default;
    int fd, ret;
    struct stat st;
@@ -476,6 +487,16 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
        abort();
    }

+    str = qemu_opt_get(opts, "pr-manager");
+    if (str) {
+        s->pr_mgr = pr_manager_lookup(str, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            ret = -EINVAL;
+            goto fail;
+        }
+    }
+
    s->open_flags = open_flags;
    raw_parse_flags(bdrv_flags, &s->open_flags);

@@ -2597,6 +2618,15 @@ static BlockAIOCB *hdev_aio_ioctl(BlockDriverState *bs,
    if (fd_open(bs) < 0)
        return NULL;

+    if (req == SG_IO && s->pr_mgr) {
+        struct sg_io_hdr *io_hdr = buf;
+        if (io_hdr->cmdp[0] == PERSISTENT_RESERVE_OUT ||
+            io_hdr->cmdp[0] == PERSISTENT_RESERVE_IN) {
+            return pr_manager_execute(s->pr_mgr, bdrv_get_aio_context(bs),
+                                      s->fd, io_hdr, cb, opaque);
+        }
+    }
+
    acb = g_new(RawPosixAIOData, 1);
    acb->bs = bs;
    acb->aio_type = QEMU_AIO_IOCTL;
@@ -2700,6 +2730,16 @@ static int hdev_create(const char *filename, QemuOpts *opts,
        ret = -ENOSPC;
    }

+    if (!ret && total_size) {
+        uint8_t buf[BDRV_SECTOR_SIZE] = { 0 };
+        int64_t zero_size = MIN(BDRV_SECTOR_SIZE, total_size);
+        if (lseek(fd, 0, SEEK_SET) == -1) {
+            ret = -errno;
+        } else {
+            ret = qemu_write_full(fd, buf, zero_size);
+            ret = ret == zero_size ? 0 : -errno;
+        }
+    }
    qemu_close(fd);
    return ret;
 }
--- a/block/io.c
+++ b/block/io.c
@@ -34,6 +34,9 @@

 #define NOT_DONE 0x7fffffff /* used while emulated sync operation in progress */

+/* Maximum bounce buffer for copy-on-read and write zeroes, in bytes */
+#define MAX_BOUNCE_BUFFER (32768 << BDRV_SECTOR_BITS)
+
 static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
    int64_t offset, int bytes, BdrvRequestFlags flags);

@@ -153,6 +156,7 @@ typedef struct {
    Coroutine *co;
    BlockDriverState *bs;
    bool done;
+    bool begin;
 } BdrvCoDrainData;

 static void coroutine_fn bdrv_drain_invoke_entry(void *opaque)
@@ -160,18 +164,23 @@ static void coroutine_fn bdrv_drain_invoke_entry(void *opaque)
    BdrvCoDrainData *data = opaque;
    BlockDriverState *bs = data->bs;

-    bs->drv->bdrv_co_drain(bs);
+    if (data->begin) {
+        bs->drv->bdrv_co_drain_begin(bs);
+    } else {
+        bs->drv->bdrv_co_drain_end(bs);
+    }

    /* Set data->done before reading bs->wakeup.  */
    atomic_mb_set(&data->done, true);
    bdrv_wakeup(bs);
 }

-static void bdrv_drain_invoke(BlockDriverState *bs)
+static void bdrv_drain_invoke(BlockDriverState *bs, bool begin)
 {
-    BdrvCoDrainData data = { .bs = bs, .done = false };
+    BdrvCoDrainData data = { .bs = bs, .done = false, .begin = begin};

-    if (!bs->drv || !bs->drv->bdrv_co_drain) {
+    if (!bs->drv || (begin && !bs->drv->bdrv_co_drain_begin) ||
+            (!begin && !bs->drv->bdrv_co_drain_end)) {
        return;
    }

@@ -180,15 +189,16 @@ static void bdrv_drain_invoke(BlockDriverState *bs)
    BDRV_POLL_WHILE(bs, !data.done);
 }

-static bool bdrv_drain_recurse(BlockDriverState *bs)
+static bool bdrv_drain_recurse(BlockDriverState *bs, bool begin)
 {
    BdrvChild *child, *tmp;
    bool waited;

-    waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);
-
    /* Ensure any pending metadata writes are submitted to bs->file.  */
-    bdrv_drain_invoke(bs);
+    bdrv_drain_invoke(bs, begin);
+
+    /* Wait for drained requests to finish */
+    waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);

    QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
        BlockDriverState *bs = child->bs;
@@ -205,7 +215,7 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
             */
            bdrv_ref(bs);
        }
-        waited |= bdrv_drain_recurse(bs);
+        waited |= bdrv_drain_recurse(bs, begin);
        if (in_main_loop) {
            bdrv_unref(bs);
        }
@@ -221,12 +231,18 @@ static void bdrv_co_drain_bh_cb(void *opaque)
    BlockDriverState *bs = data->bs;

    bdrv_dec_in_flight(bs);
+    if (data->begin) {
        bdrv_drained_begin(bs);
+    } else {
+        bdrv_drained_end(bs);
+    }
+
    data->done = true;
    aio_co_wake(co);
 }

-static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs)
+static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs,
+                                                bool begin)
 {
    BdrvCoDrainData data;

@@ -239,6 +255,7 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs)
        .co = qemu_coroutine_self(),
        .bs = bs,
        .done = false,
+        .begin = begin,
    };
    bdrv_inc_in_flight(bs);
    aio_bh_schedule_oneshot(bdrv_get_aio_context(bs),
@@ -253,7 +270,7 @@ static void coroutine_fn bdrv_co_yield_to_drain(BlockDriverState *bs)
 void bdrv_drained_begin(BlockDriverState *bs)
 {
    if (qemu_in_coroutine()) {
-        bdrv_co_yield_to_drain(bs);
+        bdrv_co_yield_to_drain(bs, true);
        return;
    }

@@ -262,17 +279,22 @@ void bdrv_drained_begin(BlockDriverState *bs)
        bdrv_parent_drained_begin(bs);
    }

-    bdrv_drain_recurse(bs);
+    bdrv_drain_recurse(bs, true);
 }

 void bdrv_drained_end(BlockDriverState *bs)
 {
+    if (qemu_in_coroutine()) {
+        bdrv_co_yield_to_drain(bs, false);
+        return;
+    }
    assert(bs->quiesce_counter > 0);
    if (atomic_fetch_dec(&bs->quiesce_counter) > 1) {
        return;
    }

    bdrv_parent_drained_end(bs);
+    bdrv_drain_recurse(bs, false);
    aio_enable_external(bdrv_get_aio_context(bs));
 }

@@ -350,7 +372,7 @@ void bdrv_drain_all_begin(void)
            aio_context_acquire(aio_context);
            for (bs = bdrv_first(&it); bs; bs = bdrv_next(&it)) {
                if (aio_context == bdrv_get_aio_context(bs)) {
-                    waited |= bdrv_drain_recurse(bs);
+                    waited |= bdrv_drain_recurse(bs, true);
                }
            }
            aio_context_release(aio_context);
@@ -371,6 +393,7 @@ void bdrv_drain_all_end(void)
        aio_context_acquire(aio_context);
        aio_enable_external(aio_context);
        bdrv_parent_drained_end(bs);
+        bdrv_drain_recurse(bs, false);
        aio_context_release(aio_context);
    }

@@ -446,9 +469,9 @@ static void mark_request_serialising(BdrvTrackedRequest *req, uint64_t align)
 * Round a region to cluster boundaries
 */
 void bdrv_round_to_clusters(BlockDriverState *bs,
-                            int64_t offset, unsigned int bytes,
+                            int64_t offset, int64_t bytes,
                            int64_t *cluster_offset,
-                            unsigned int *cluster_bytes)
+                            int64_t *cluster_bytes)
 {
    BlockDriverInfo bdi;

@@ -693,39 +716,37 @@ int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
 */
 int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
 {
-    int64_t target_sectors, ret, nb_sectors, sector_num = 0;
+    int ret;
+    int64_t target_size, bytes, offset = 0;
    BlockDriverState *bs = child->bs;
-    BlockDriverState *file;
-    int n;

-    target_sectors = bdrv_nb_sectors(bs);
-    if (target_sectors < 0) {
-        return target_sectors;
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
    }

    for (;;) {
-        nb_sectors = MIN(target_sectors - sector_num, BDRV_REQUEST_MAX_SECTORS);
-        if (nb_sectors <= 0) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
+        if (bytes <= 0) {
            return 0;
        }
-        ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &n, &file);
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
        if (ret < 0) {
-            error_report("error getting block status at sector %" PRId64 ": %s",
-                         sector_num, strerror(-ret));
+            error_report("error getting block status at offset %" PRId64 ": %s",
+                         offset, strerror(-ret));
            return ret;
        }
        if (ret & BDRV_BLOCK_ZERO) {
-            sector_num += n;
+            offset += bytes;
            continue;
        }
-        ret = bdrv_pwrite_zeroes(child, sector_num << BDRV_SECTOR_BITS,
-                                 n << BDRV_SECTOR_BITS, flags);
+        ret = bdrv_pwrite_zeroes(child, offset, bytes, flags);
        if (ret < 0) {
-            error_report("error writing zeroes at sector %" PRId64 ": %s",
-                         sector_num, strerror(-ret));
+            error_report("error writing zeroes at offset %" PRId64 ": %s",
+                         offset, strerror(-ret));
            return ret;
        }
-        sector_num += n;
+        offset += bytes;
    }
 }

@@ -832,6 +853,10 @@ static int coroutine_fn bdrv_driver_preadv(BlockDriverState *bs,

    assert(!(flags & ~BDRV_REQ_MASK));

+    if (!drv) {
+        return -ENOMEDIUM;
+    }
+
    if (drv->bdrv_co_preadv) {
        return drv->bdrv_co_preadv(bs, offset, bytes, qiov, flags);
    }
@@ -873,6 +898,10 @@ static int coroutine_fn bdrv_driver_pwritev(BlockDriverState *bs,

    assert(!(flags & ~BDRV_REQ_MASK));

+    if (!drv) {
+        return -ENOMEDIUM;
+    }
+
    if (drv->bdrv_co_pwritev) {
        ret = drv->bdrv_co_pwritev(bs, offset, bytes, qiov,
                                   flags & bs->supported_write_flags);
@@ -924,6 +953,10 @@ bdrv_driver_pwritev_compressed(BlockDriverState *bs, uint64_t offset,
 {
    BlockDriver *drv = bs->drv;

+    if (!drv) {
+        return -ENOMEDIUM;
+    }
+
    if (!drv->bdrv_co_pwritev_compressed) {
        return -ENOTSUP;
    }
@@ -945,68 +978,118 @@ static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild *child,

    BlockDriver *drv = bs->drv;
    struct iovec iov;
-    QEMUIOVector bounce_qiov;
+    QEMUIOVector local_qiov;
    int64_t cluster_offset;
-    unsigned int cluster_bytes;
+    int64_t cluster_bytes;
    size_t skip_bytes;
    int ret;
+    int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+                                    BDRV_REQUEST_MAX_BYTES);
+    unsigned int progress = 0;
+
+    if (!drv) {
+        return -ENOMEDIUM;
+    }

    /* FIXME We cannot require callers to have write permissions when all they
     * are doing is a read request. If we did things right, write permissions
     * would be obtained anyway, but internally by the copy-on-read code. As
-     * long as it is implemented here rather than in a separat filter driver,
+     * long as it is implemented here rather than in a separate filter driver,
     * the copy-on-read code doesn't have its own BdrvChild, however, for which
     * it could request permissions. Therefore we have to bypass the permission
     * system for the moment. */
    // assert(child->perm & (BLK_PERM_WRITE_UNCHANGED | BLK_PERM_WRITE));

    /* Cover entire cluster so no additional backing file I/O is required when
-     * allocating cluster in the image file.
+     * allocating cluster in the image file.  Note that this value may exceed
+     * BDRV_REQUEST_MAX_BYTES (even when the original read did not), which
+     * is one reason we loop rather than doing it all at once.
     */
    bdrv_round_to_clusters(bs, offset, bytes, &cluster_offset, &cluster_bytes);
+    skip_bytes = offset - cluster_offset;

    trace_bdrv_co_do_copy_on_readv(bs, offset, bytes,
                                   cluster_offset, cluster_bytes);

-    iov.iov_len = cluster_bytes;
-    iov.iov_base = bounce_buffer = qemu_try_blockalign(bs, iov.iov_len);
+    bounce_buffer = qemu_try_blockalign(bs,
+                                        MIN(MIN(max_transfer, cluster_bytes),
+                                            MAX_BOUNCE_BUFFER));
    if (bounce_buffer == NULL) {
        ret = -ENOMEM;
        goto err;
    }

-    qemu_iovec_init_external(&bounce_qiov, &iov, 1);
+    while (cluster_bytes) {
+        int64_t pnum;

-    ret = bdrv_driver_preadv(bs, cluster_offset, cluster_bytes,
-                             &bounce_qiov, 0);
+        ret = bdrv_is_allocated(bs, cluster_offset,
+                                MIN(cluster_bytes, max_transfer), &pnum);
+        if (ret < 0) {
+            /* Safe to treat errors in querying allocation as if
+             * unallocated; we'll probably fail again soon on the
+             * read, but at least that will set a decent errno.
+             */
+            pnum = MIN(cluster_bytes, max_transfer);
+        }
+
+        assert(skip_bytes < pnum);
+
+        if (ret <= 0) {
+            /* Must copy-on-read; use the bounce buffer */
+            iov.iov_base = bounce_buffer;
+            iov.iov_len = pnum = MIN(pnum, MAX_BOUNCE_BUFFER);
+            qemu_iovec_init_external(&local_qiov, &iov, 1);
+
+            ret = bdrv_driver_preadv(bs, cluster_offset, pnum,
+                                     &local_qiov, 0);
            if (ret < 0) {
                goto err;
            }

+            bdrv_debug_event(bs, BLKDBG_COR_WRITE);
            if (drv->bdrv_co_pwrite_zeroes &&
-        buffer_is_zero(bounce_buffer, iov.iov_len)) {
+                buffer_is_zero(bounce_buffer, pnum)) {
                /* FIXME: Should we (perhaps conditionally) be setting
                 * BDRV_REQ_MAY_UNMAP, if it will allow for a sparser copy
                 * that still correctly reads as zero? */
-        ret = bdrv_co_do_pwrite_zeroes(bs, cluster_offset, cluster_bytes, 0);
+                ret = bdrv_co_do_pwrite_zeroes(bs, cluster_offset, pnum, 0);
            } else {
-        /* This does not change the data on the disk, it is not necessary
-         * to flush even in cache=writethrough mode.
+                /* This does not change the data on the disk, it is not
+                 * necessary to flush even in cache=writethrough mode.
                 */
-        ret = bdrv_driver_pwritev(bs, cluster_offset, cluster_bytes,
-                                  &bounce_qiov, 0);
+                ret = bdrv_driver_pwritev(bs, cluster_offset, pnum,
+                                          &local_qiov, 0);
            }

            if (ret < 0) {
-        /* It might be okay to ignore write errors for guest requests.  If this
-         * is a deliberate copy-on-read then we don't want to ignore the error.
-         * Simply report it in all cases.
+                /* It might be okay to ignore write errors for guest
+                 * requests.  If this is a deliberate copy-on-read
+                 * then we don't want to ignore the error.  Simply
+                 * report it in all cases.
                 */
                goto err;
            }

-    skip_bytes = offset - cluster_offset;
-    qemu_iovec_from_buf(qiov, 0, bounce_buffer + skip_bytes, bytes);
+            qemu_iovec_from_buf(qiov, progress, bounce_buffer + skip_bytes,
+                                pnum - skip_bytes);
+        } else {
+            /* Read directly into the destination */
+            qemu_iovec_init(&local_qiov, qiov->niov);
+            qemu_iovec_concat(&local_qiov, qiov, progress, pnum - skip_bytes);
+            ret = bdrv_driver_preadv(bs, offset + progress, local_qiov.size,
+                                     &local_qiov, 0);
+            qemu_iovec_destroy(&local_qiov);
+            if (ret < 0) {
+                goto err;
+            }
+        }
+
+        cluster_offset += pnum;
+        cluster_bytes -= pnum;
+        progress += pnum - skip_bytes;
+        skip_bytes = 0;
+    }
+    ret = 0;

 err:
    qemu_vfree(bounce_buffer);
@@ -1057,18 +1140,14 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
    }

    if (flags & BDRV_REQ_COPY_ON_READ) {
-        /* TODO: Simplify further once bdrv_is_allocated no longer
-         * requires sector alignment */
-        int64_t start = QEMU_ALIGN_DOWN(offset, BDRV_SECTOR_SIZE);
-        int64_t end = QEMU_ALIGN_UP(offset + bytes, BDRV_SECTOR_SIZE);
        int64_t pnum;

-        ret = bdrv_is_allocated(bs, start, end - start, &pnum);
+        ret = bdrv_is_allocated(bs, offset, bytes, &pnum);
        if (ret < 0) {
            goto out;
        }

-        if (!ret || pnum != end - start) {
+        if (!ret || pnum != bytes) {
            ret = bdrv_co_do_copy_on_readv(child, offset, bytes, qiov);
            goto out;
        }
@@ -1212,9 +1291,6 @@ int coroutine_fn bdrv_co_readv(BdrvChild *child, int64_t sector_num,
    return bdrv_co_do_readv(child, sector_num, nb_sectors, qiov, 0);
 }

-/* Maximum buffer for write zeroes fallback, in bytes */
-#define MAX_WRITE_ZEROES_BOUNCE_BUFFER (32768 << BDRV_SECTOR_BITS)
-
 static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
    int64_t offset, int bytes, BdrvRequestFlags flags)
 {
@@ -1229,8 +1305,11 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
    int max_write_zeroes = MIN_NON_ZERO(bs->bl.max_pwrite_zeroes, INT_MAX);
    int alignment = MAX(bs->bl.pwrite_zeroes_alignment,
                        bs->bl.request_alignment);
-    int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
-                                    MAX_WRITE_ZEROES_BOUNCE_BUFFER);
+    int max_transfer = MIN_NON_ZERO(bs->bl.max_transfer, MAX_BOUNCE_BUFFER);
+
+    if (!drv) {
+        return -ENOMEDIUM;
+    }

    assert(alignment % bs->bl.request_alignment == 0);
    head = offset % alignment;
@@ -1334,11 +1413,14 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
    bool waited;
    int ret;

-    int64_t start_sector = offset >> BDRV_SECTOR_BITS;
    int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
    uint64_t bytes_remaining = bytes;
    int max_transfer;

+    if (!drv) {
+        return -ENOMEDIUM;
+    }
+
    if (bdrv_has_readonly_bitmaps(bs)) {
        return -EPERM;
    }
@@ -1409,7 +1491,7 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
    bdrv_debug_event(bs, BLKDBG_PWRITEV_DONE);

    atomic_inc(&bs->write_gen);
-    bdrv_set_dirty(bs, start_sector, end_sector - start_sector);
+    bdrv_set_dirty(bs, offset, bytes);

    stat64_max(&bs->wr_highest_offset, offset + bytes);

@@ -1703,16 +1785,18 @@ int bdrv_flush_all(void)
 }


-typedef struct BdrvCoGetBlockStatusData {
+typedef struct BdrvCoBlockStatusData {
    BlockDriverState *bs;
    BlockDriverState *base;
+    bool want_zero;
+    int64_t offset;
+    int64_t bytes;
+    int64_t *pnum;
+    int64_t *map;
    BlockDriverState **file;
-    int64_t sector_num;
-    int nb_sectors;
-    int *pnum;
-    int64_t ret;
+    int ret;
    bool done;
-} BdrvCoGetBlockStatusData;
+} BdrvCoBlockStatusData;

 int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs,
                                                        int64_t sector_num,
@@ -1745,95 +1829,159 @@ int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs,
 * Drivers not implementing the functionality are assumed to not support
 * backing files, hence all their sectors are reported as allocated.
 *
- * If 'sector_num' is beyond the end of the disk image the return value is
+ * If 'want_zero' is true, the caller is querying for mapping purposes,
+ * and the result should include BDRV_BLOCK_OFFSET_VALID and
+ * BDRV_BLOCK_ZERO where possible; otherwise, the result may omit those
+ * bits particularly if it allows for a larger value in 'pnum'.
+ *
+ * If 'offset' is beyond the end of the disk image the return value is
 * BDRV_BLOCK_EOF and 'pnum' is set to 0.
 *
- * 'pnum' is set to the number of sectors (including and immediately following
- * the specified sector) that are known to be in the same
- * allocated/unallocated state.
- *
- * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
+ * 'bytes' is the max value 'pnum' should be set to.  If bytes goes
 * beyond the end of the disk image it will be clamped; if 'pnum' is set to
 * the end of the image, then the returned value will include BDRV_BLOCK_EOF.
 *
- * If returned value is positive and BDRV_BLOCK_OFFSET_VALID bit is set, 'file'
- * points to the BDS which the sector range is allocated in.
+ * 'pnum' is set to the number of bytes (including and immediately
+ * following the specified offset) that are easily known to be in the
+ * same allocated/unallocated state.  Note that a second call starting
+ * at the original offset plus returned pnum may have the same status.
+ * The returned value is non-zero on success except at end-of-file.
+ *
+ * Returns negative errno on failure.  Otherwise, if the
+ * BDRV_BLOCK_OFFSET_VALID bit is set, 'map' and 'file' (if non-NULL) are
+ * set to the host mapping and BDS corresponding to the guest offset.
 */
-static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
-                                                     int64_t sector_num,
-                                                     int nb_sectors, int *pnum,
+static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
+                                             bool want_zero,
+                                             int64_t offset, int64_t bytes,
+                                             int64_t *pnum, int64_t *map,
                                             BlockDriverState **file)
 {
-    int64_t total_sectors;
-    int64_t n;
-    int64_t ret, ret2;
+    int64_t total_size;
+    int64_t n; /* bytes */
+    int ret;
+    int64_t local_map = 0;
+    BlockDriverState *local_file = NULL;
+    int64_t aligned_offset, aligned_bytes;
+    uint32_t align;

-    *file = NULL;
-    total_sectors = bdrv_nb_sectors(bs);
-    if (total_sectors < 0) {
-        return total_sectors;
-    }
-
-    if (sector_num >= total_sectors) {
+    assert(pnum);
    *pnum = 0;
-        return BDRV_BLOCK_EOF;
+    total_size = bdrv_getlength(bs);
+    if (total_size < 0) {
+        ret = total_size;
+        goto early_out;
    }

-    n = total_sectors - sector_num;
-    if (n < nb_sectors) {
-        nb_sectors = n;
+    if (offset >= total_size) {
+        ret = BDRV_BLOCK_EOF;
+        goto early_out;
+    }
+    if (!bytes) {
+        ret = 0;
+        goto early_out;
    }

+    n = total_size - offset;
+    if (n < bytes) {
+        bytes = n;
+    }
+
+    /* Must be non-NULL or bdrv_getlength() would have failed */
+    assert(bs->drv);
    if (!bs->drv->bdrv_co_get_block_status) {
-        *pnum = nb_sectors;
+        *pnum = bytes;
        ret = BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED;
-        if (sector_num + nb_sectors == total_sectors) {
+        if (offset + bytes == total_size) {
            ret |= BDRV_BLOCK_EOF;
        }
        if (bs->drv->protocol_name) {
-            ret |= BDRV_BLOCK_OFFSET_VALID | (sector_num * BDRV_SECTOR_SIZE);
-            *file = bs;
+            ret |= BDRV_BLOCK_OFFSET_VALID;
+            local_map = offset;
+            local_file = bs;
        }
-        return ret;
+        goto early_out;
    }

    bdrv_inc_in_flight(bs);
-    ret = bs->drv->bdrv_co_get_block_status(bs, sector_num, nb_sectors, pnum,
-                                            file);
-    if (ret < 0) {
-        *pnum = 0;
+
+    /* Round out to request_alignment boundaries */
+    /* TODO: until we have a byte-based driver callback, we also have to
+     * round out to sectors, even if that is bigger than request_alignment */
+    align = MAX(bs->bl.request_alignment, BDRV_SECTOR_SIZE);
+    aligned_offset = QEMU_ALIGN_DOWN(offset, align);
+    aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
+
+    {
+        int count; /* sectors */
+        int64_t longret;
+
+        assert(QEMU_IS_ALIGNED(aligned_offset | aligned_bytes,
+                               BDRV_SECTOR_SIZE));
+        /*
+         * The contract allows us to return pnum smaller than bytes, even
+         * if the next query would see the same status; we truncate the
+         * request to avoid overflowing the driver's 32-bit interface.
+         */
+        longret = bs->drv->bdrv_co_get_block_status(
+            bs, aligned_offset >> BDRV_SECTOR_BITS,
+            MIN(INT_MAX, aligned_bytes) >> BDRV_SECTOR_BITS, &count,
+            &local_file);
+        if (longret < 0) {
+            assert(INT_MIN <= longret);
+            ret = longret;
            goto out;
        }
+        if (longret & BDRV_BLOCK_OFFSET_VALID) {
+            local_map = longret & BDRV_BLOCK_OFFSET_MASK;
+        }
+        ret = longret & ~BDRV_BLOCK_OFFSET_MASK;
+        *pnum = count * BDRV_SECTOR_SIZE;
+    }
+
+    /*
+     * The driver's result must be a multiple of request_alignment.
+     * Clamp pnum and adjust map to original request.
+     */
+    assert(QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset);
+    *pnum -= offset - aligned_offset;
+    if (*pnum > bytes) {
+        *pnum = bytes;
+    }
+    if (ret & BDRV_BLOCK_OFFSET_VALID) {
+        local_map += offset - aligned_offset;
+    }

    if (ret & BDRV_BLOCK_RAW) {
-        assert(ret & BDRV_BLOCK_OFFSET_VALID && *file);
-        ret = bdrv_co_get_block_status(*file, ret >> BDRV_SECTOR_BITS,
-                                       *pnum, pnum, file);
+        assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
+        ret = bdrv_co_block_status(local_file, want_zero, local_map,
+                                   *pnum, pnum, &local_map, &local_file);
        goto out;
    }

    if (ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO)) {
        ret |= BDRV_BLOCK_ALLOCATED;
-    } else {
+    } else if (want_zero) {
        if (bdrv_unallocated_blocks_are_zero(bs)) {
            ret |= BDRV_BLOCK_ZERO;
        } else if (bs->backing) {
            BlockDriverState *bs2 = bs->backing->bs;
-            int64_t nb_sectors2 = bdrv_nb_sectors(bs2);
-            if (nb_sectors2 >= 0 && sector_num >= nb_sectors2) {
+            int64_t size2 = bdrv_getlength(bs2);
+
+            if (size2 >= 0 && offset >= size2) {
                ret |= BDRV_BLOCK_ZERO;
            }
        }
    }

-    if (*file && *file != bs &&
+    if (want_zero && local_file && local_file != bs &&
        (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
        (ret & BDRV_BLOCK_OFFSET_VALID)) {
-        BlockDriverState *file2;
-        int file_pnum;
+        int64_t file_pnum;
+        int ret2;

-        ret2 = bdrv_co_get_block_status(*file, ret >> BDRV_SECTOR_BITS,
-                                        *pnum, &file_pnum, &file2);
+        ret2 = bdrv_co_block_status(local_file, want_zero, local_map,
+                                    *pnum, &file_pnum, NULL, NULL);
        if (ret2 >= 0) {
            /* Ignore errors.  This is just providing extra information, it
             * is useful but not necessary.
@@ -1856,26 +2004,36 @@ static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,

 out:
    bdrv_dec_in_flight(bs);
-    if (ret >= 0 && sector_num + *pnum == total_sectors) {
+    if (ret >= 0 && offset + *pnum == total_size) {
        ret |= BDRV_BLOCK_EOF;
    }
+early_out:
+    if (file) {
+        *file = local_file;
+    }
+    if (map) {
+        *map = local_map;
+    }
    return ret;
 }

-static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
+static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
                                                   BlockDriverState *base,
-        int64_t sector_num,
-        int nb_sectors,
-        int *pnum,
+                                                   bool want_zero,
+                                                   int64_t offset,
+                                                   int64_t bytes,
+                                                   int64_t *pnum,
+                                                   int64_t *map,
                                                   BlockDriverState **file)
 {
    BlockDriverState *p;
-    int64_t ret = 0;
+    int ret = 0;
    bool first = true;

    assert(bs != base);
    for (p = bs; p != base; p = backing_bs(p)) {
-        ret = bdrv_co_get_block_status(p, sector_num, nb_sectors, pnum, file);
+        ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
+                                   file);
        if (ret < 0) {
            break;
        }
@@ -1886,94 +2044,94 @@ static int64_t coroutine_fn bdrv_co_get_block_status_above(BlockDriverState *bs,
             * unallocated length we learned from an earlier
             * iteration.
             */
-            *pnum = nb_sectors;
+            *pnum = bytes;
        }
        if (ret & (BDRV_BLOCK_ZERO | BDRV_BLOCK_DATA)) {
            break;
        }
-        /* [sector_num, pnum] unallocated on this layer, which could be only
-         * the first part of [sector_num, nb_sectors].  */
-        nb_sectors = MIN(nb_sectors, *pnum);
+        /* [offset, pnum] unallocated on this layer, which could be only
+         * the first part of [offset, bytes].  */
+        bytes = MIN(bytes, *pnum);
        first = false;
    }
    return ret;
 }

-/* Coroutine wrapper for bdrv_get_block_status_above() */
-static void coroutine_fn bdrv_get_block_status_above_co_entry(void *opaque)
+/* Coroutine wrapper for bdrv_block_status_above() */
+static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
 {
-    BdrvCoGetBlockStatusData *data = opaque;
+    BdrvCoBlockStatusData *data = opaque;

-    data->ret = bdrv_co_get_block_status_above(data->bs, data->base,
-                                               data->sector_num,
-                                               data->nb_sectors,
-                                               data->pnum,
-                                               data->file);
+    data->ret = bdrv_co_block_status_above(data->bs, data->base,
+                                           data->want_zero,
+                                           data->offset, data->bytes,
+                                           data->pnum, data->map, data->file);
    data->done = true;
 }

 /*
- * Synchronous wrapper around bdrv_co_get_block_status_above().
+ * Synchronous wrapper around bdrv_co_block_status_above().
 *
- * See bdrv_co_get_block_status_above() for details.
+ * See bdrv_co_block_status_above() for details.
 */
-int64_t bdrv_get_block_status_above(BlockDriverState *bs,
+static int bdrv_common_block_status_above(BlockDriverState *bs,
                                          BlockDriverState *base,
-                                    int64_t sector_num,
-                                    int nb_sectors, int *pnum,
+                                          bool want_zero, int64_t offset,
+                                          int64_t bytes, int64_t *pnum,
+                                          int64_t *map,
                                          BlockDriverState **file)
 {
    Coroutine *co;
-    BdrvCoGetBlockStatusData data = {
+    BdrvCoBlockStatusData data = {
        .bs = bs,
        .base = base,
-        .file = file,
-        .sector_num = sector_num,
-        .nb_sectors = nb_sectors,
+        .want_zero = want_zero,
+        .offset = offset,
+        .bytes = bytes,
        .pnum = pnum,
+        .map = map,
+        .file = file,
        .done = false,
    };

    if (qemu_in_coroutine()) {
        /* Fast-path if already in coroutine context */
-        bdrv_get_block_status_above_co_entry(&data);
+        bdrv_block_status_above_co_entry(&data);
    } else {
-        co = qemu_coroutine_create(bdrv_get_block_status_above_co_entry,
-                                   &data);
+        co = qemu_coroutine_create(bdrv_block_status_above_co_entry, &data);
        bdrv_coroutine_enter(bs, co);
        BDRV_POLL_WHILE(bs, !data.done);
    }
    return data.ret;
 }

-int64_t bdrv_get_block_status(BlockDriverState *bs,
-                              int64_t sector_num,
-                              int nb_sectors, int *pnum,
-                              BlockDriverState **file)
+int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
+                            int64_t offset, int64_t bytes, int64_t *pnum,
+                            int64_t *map, BlockDriverState **file)
 {
-    return bdrv_get_block_status_above(bs, backing_bs(bs),
-                                       sector_num, nb_sectors, pnum, file);
+    return bdrv_common_block_status_above(bs, base, true, offset, bytes,
+                                          pnum, map, file);
+}
+
+int bdrv_block_status(BlockDriverState *bs, int64_t offset, int64_t bytes,
+                      int64_t *pnum, int64_t *map, BlockDriverState **file)
+{
+    return bdrv_block_status_above(bs, backing_bs(bs),
+                                   offset, bytes, pnum, map, file);
 }

 int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
                                   int64_t bytes, int64_t *pnum)
 {
-    BlockDriverState *file;
-    int64_t sector_num = offset >> BDRV_SECTOR_BITS;
-    int nb_sectors = bytes >> BDRV_SECTOR_BITS;
-    int64_t ret;
-    int psectors;
+    int ret;
+    int64_t dummy;

-    assert(QEMU_IS_ALIGNED(offset, BDRV_SECTOR_SIZE));
-    assert(QEMU_IS_ALIGNED(bytes, BDRV_SECTOR_SIZE) && bytes < INT_MAX);
-    ret = bdrv_get_block_status(bs, sector_num, nb_sectors, &psectors,
-                                &file);
+    ret = bdrv_common_block_status_above(bs, backing_bs(bs), false, offset,
+                                         bytes, pnum ? pnum : &dummy, NULL,
+                                         NULL);
    if (ret < 0) {
        return ret;
    }
-    if (pnum) {
-        *pnum = psectors * BDRV_SECTOR_SIZE;
-    }
    return !!(ret & BDRV_BLOCK_ALLOCATED);
 }

@@ -2241,6 +2399,12 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
    }

    BLKDBG_EVENT(bs->file, BLKDBG_FLUSH_TO_DISK);
+    if (!bs->drv) {
+        /* bs->drv->bdrv_co_flush() might have ejected the BDS
+         * (even in case of apparent success) */
+        ret = -ENOMEDIUM;
+        goto out;
+    }
    if (bs->drv->bdrv_co_flush_to_disk) {
        ret = bs->drv->bdrv_co_flush_to_disk(bs);
    } else if (bs->drv->bdrv_aio_flush) {
@@ -2410,6 +2574,10 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
            num = max_pdiscard;
        }

+        if (!bs->drv) {
+            ret = -ENOMEDIUM;
+            goto out;
+        }
        if (bs->drv->bdrv_co_pdiscard) {
            ret = bs->drv->bdrv_co_pdiscard(bs, offset, num);
        } else {
@@ -2438,8 +2606,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
    ret = 0;
 out:
    atomic_inc(&bs->write_gen);
-    bdrv_set_dirty(bs, req.offset >> BDRV_SECTOR_BITS,
-                   req.bytes >> BDRV_SECTOR_BITS);
+    bdrv_set_dirty(bs, req.offset, req.bytes);
    tracked_request_end(&req);
    bdrv_dec_in_flight(bs);
    return ret;
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -141,8 +141,7 @@ static void mirror_write_complete(void *opaque, int ret)
    if (ret < 0) {
        BlockErrorAction action;

-        bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset >> BDRV_SECTOR_BITS,
-                              op->bytes >> BDRV_SECTOR_BITS);
+        bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset, op->bytes);
        action = mirror_error_action(s, false, -ret);
        if (action == BLOCK_ERROR_ACTION_REPORT && s->ret >= 0) {
            s->ret = ret;
@@ -161,8 +160,7 @@ static void mirror_read_complete(void *opaque, int ret)
    if (ret < 0) {
        BlockErrorAction action;

-        bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset >> BDRV_SECTOR_BITS,
-                              op->bytes >> BDRV_SECTOR_BITS);
+        bdrv_set_dirty_bitmap(s->dirty_bitmap, op->offset, op->bytes);
        action = mirror_error_action(s, true, -ret);
        if (action == BLOCK_ERROR_ACTION_REPORT && s->ret >= 0) {
            s->ret = ret;
@@ -192,10 +190,9 @@ static int mirror_cow_align(MirrorBlockJob *s, int64_t *offset,
    bool need_cow;
    int ret = 0;
    int64_t align_offset = *offset;
-    unsigned int align_bytes = *bytes;
+    int64_t align_bytes = *bytes;
    int max_bytes = s->granularity * s->max_iov;

-    assert(*bytes < INT_MAX);
    need_cow = !test_bit(*offset / s->granularity, s->cow_bitmap);
    need_cow |= !test_bit((*offset + *bytes - 1) / s->granularity,
                          s->cow_bitmap);
@@ -331,17 +328,15 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
    uint64_t delay_ns = 0;
    /* At least the first dirty chunk is mirrored in one iteration. */
    int nb_chunks = 1;
-    int sectors_per_chunk = s->granularity >> BDRV_SECTOR_BITS;
    bool write_zeroes_ok = bdrv_can_write_zeroes_with_unmap(blk_bs(s->target));
    int max_io_bytes = MAX(s->buf_size / MAX_IN_FLIGHT, MAX_IO_BYTES);

    bdrv_dirty_bitmap_lock(s->dirty_bitmap);
-    offset = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+    offset = bdrv_dirty_iter_next(s->dbi);
    if (offset < 0) {
        bdrv_set_dirty_iter(s->dbi, 0);
-        offset = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
-        trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap) *
-                                  BDRV_SECTOR_SIZE);
+        offset = bdrv_dirty_iter_next(s->dbi);
+        trace_mirror_restart_iter(s, bdrv_get_dirty_count(s->dirty_bitmap));
        assert(offset >= 0);
    }
    bdrv_dirty_bitmap_unlock(s->dirty_bitmap);
@@ -362,39 +357,36 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
        int64_t next_offset = offset + nb_chunks * s->granularity;
        int64_t next_chunk = next_offset / s->granularity;
        if (next_offset >= s->bdev_length ||
-            !bdrv_get_dirty_locked(source, s->dirty_bitmap,
-                                   next_offset >> BDRV_SECTOR_BITS)) {
+            !bdrv_get_dirty_locked(source, s->dirty_bitmap, next_offset)) {
            break;
        }
        if (test_bit(next_chunk, s->in_flight_bitmap)) {
            break;
        }

-        next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+        next_dirty = bdrv_dirty_iter_next(s->dbi);
        if (next_dirty > next_offset || next_dirty < 0) {
            /* The bitmap iterator's cache is stale, refresh it */
-            bdrv_set_dirty_iter(s->dbi, next_offset >> BDRV_SECTOR_BITS);
-            next_dirty = bdrv_dirty_iter_next(s->dbi) * BDRV_SECTOR_SIZE;
+            bdrv_set_dirty_iter(s->dbi, next_offset);
+            next_dirty = bdrv_dirty_iter_next(s->dbi);
        }
        assert(next_dirty == next_offset);
        nb_chunks++;
    }

    /* Clear dirty bits before querying the block status, because
-     * calling bdrv_get_block_status_above could yield - if some blocks are
+     * calling bdrv_block_status_above could yield - if some blocks are
     * marked dirty in this window, we need to know.
     */
-    bdrv_reset_dirty_bitmap_locked(s->dirty_bitmap, offset >> BDRV_SECTOR_BITS,
-                                   nb_chunks * sectors_per_chunk);
+    bdrv_reset_dirty_bitmap_locked(s->dirty_bitmap, offset,
+                                   nb_chunks * s->granularity);
    bdrv_dirty_bitmap_unlock(s->dirty_bitmap);

    bitmap_set(s->in_flight_bitmap, offset / s->granularity, nb_chunks);
    while (nb_chunks > 0 && offset < s->bdev_length) {
-        int64_t ret;
-        int io_sectors;
-        unsigned int io_bytes;
+        int ret;
+        int64_t io_bytes;
        int64_t io_bytes_acct;
-        BlockDriverState *file;
        enum MirrorMethod {
            MIRROR_METHOD_COPY,
            MIRROR_METHOD_ZERO,
@@ -402,11 +394,9 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
        } mirror_method = MIRROR_METHOD_COPY;

        assert(!(offset % s->granularity));
-        ret = bdrv_get_block_status_above(source, NULL,
-                                          offset >> BDRV_SECTOR_BITS,
-                                          nb_chunks * sectors_per_chunk,
-                                          &io_sectors, &file);
-        io_bytes = io_sectors * BDRV_SECTOR_SIZE;
+        ret = bdrv_block_status_above(source, NULL, offset,
+                                      nb_chunks * s->granularity,
+                                      &io_bytes, NULL, NULL);
        if (ret < 0) {
            io_bytes = MIN(nb_chunks * s->granularity, max_io_bytes);
        } else if (ret & BDRV_BLOCK_DATA) {
@@ -418,7 +408,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
            io_bytes = s->granularity;
        } else if (ret >= 0 && !(ret & BDRV_BLOCK_DATA)) {
            int64_t target_offset;
-            unsigned int target_bytes;
+            int64_t target_bytes;
            bdrv_round_to_clusters(blk_bs(s->target), offset, io_bytes,
                                   &target_offset, &target_bytes);
            if (target_offset == offset &&
@@ -616,25 +606,23 @@ static void mirror_throttle(MirrorBlockJob *s)

 static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
 {
-    int64_t sector_num, end;
+    int64_t offset;
    BlockDriverState *base = s->base;
    BlockDriverState *bs = s->source;
    BlockDriverState *target_bs = blk_bs(s->target);
-    int ret, n;
+    int ret;
    int64_t count;

-    end = s->bdev_length / BDRV_SECTOR_SIZE;
-
    if (base == NULL && !bdrv_has_zero_init(target_bs)) {
        if (!bdrv_can_write_zeroes_with_unmap(target_bs)) {
-            bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
+            bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, s->bdev_length);
            return 0;
        }

        s->initial_zeroing_ongoing = true;
-        for (sector_num = 0; sector_num < end; ) {
-            int nb_sectors = MIN(end - sector_num,
-                QEMU_ALIGN_DOWN(INT_MAX, s->granularity) >> BDRV_SECTOR_BITS);
+        for (offset = 0; offset < s->bdev_length; ) {
+            int bytes = MIN(s->bdev_length - offset,
+                            QEMU_ALIGN_DOWN(INT_MAX, s->granularity));

            mirror_throttle(s);

@@ -650,9 +638,8 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
                continue;
            }

-            mirror_do_zero_or_discard(s, sector_num * BDRV_SECTOR_SIZE,
-                                      nb_sectors * BDRV_SECTOR_SIZE, false);
-            sector_num += nb_sectors;
+            mirror_do_zero_or_discard(s, offset, bytes, false);
+            offset += bytes;
        }

        mirror_wait_for_all_io(s);
@@ -660,10 +647,10 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
    }

    /* First part, loop on the sectors and initialize the dirty bitmap.  */
-    for (sector_num = 0; sector_num < end; ) {
+    for (offset = 0; offset < s->bdev_length; ) {
        /* Just to make sure we are not exceeding int limit. */
-        int nb_sectors = MIN(INT_MAX >> BDRV_SECTOR_BITS,
-                             end - sector_num);
+        int bytes = MIN(s->bdev_length - offset,
+                        QEMU_ALIGN_DOWN(INT_MAX, s->granularity));

        mirror_throttle(s);

@@ -671,21 +658,16 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s)
            return 0;
        }

-        ret = bdrv_is_allocated_above(bs, base, sector_num * BDRV_SECTOR_SIZE,
-                                      nb_sectors * BDRV_SECTOR_SIZE, &count);
+        ret = bdrv_is_allocated_above(bs, base, offset, bytes, &count);
        if (ret < 0) {
            return ret;
        }

-        /* TODO: Relax this once bdrv_is_allocated_above and dirty
-         * bitmaps no longer require sector alignment. */
-        assert(QEMU_IS_ALIGNED(count, BDRV_SECTOR_SIZE));
-        n = count >> BDRV_SECTOR_BITS;
-        assert(n > 0);
+        assert(count);
        if (ret == 1) {
-            bdrv_set_dirty_bitmap(s->dirty_bitmap, sector_num, n);
+            bdrv_set_dirty_bitmap(s->dirty_bitmap, offset, count);
        }
-        sector_num += n;
+        offset += count;
    }
    return 0;
 }
@@ -796,7 +778,7 @@ static void coroutine_fn mirror_run(void *opaque)
    }

    assert(!s->dbi);
-    s->dbi = bdrv_dirty_iter_new(s->dirty_bitmap, 0);
+    s->dbi = bdrv_dirty_iter_new(s->dirty_bitmap);
    for (;;) {
        uint64_t delay_ns = 0;
        int64_t cnt, delta;
@@ -811,11 +793,10 @@ static void coroutine_fn mirror_run(void *opaque)

        cnt = bdrv_get_dirty_count(s->dirty_bitmap);
        /* s->common.offset contains the number of bytes already processed so
-         * far, cnt is the number of dirty sectors remaining and
+         * far, cnt is the number of dirty bytes remaining and
         * s->bytes_in_flight is the number of bytes currently being
         * processed; together those are the current total operation length */
-        s->common.len = s->common.offset + s->bytes_in_flight +
-            cnt * BDRV_SECTOR_SIZE;
+        s->common.len = s->common.offset + s->bytes_in_flight + cnt;

        /* Note that even when no rate limit is applied we need to yield
         * periodically with no pending I/O so that bdrv_drain_all() returns.
@@ -827,8 +808,7 @@ static void coroutine_fn mirror_run(void *opaque)
            s->common.iostatus == BLOCK_DEVICE_IO_STATUS_OK) {
            if (s->in_flight >= MAX_IN_FLIGHT || s->buf_free_count == 0 ||
                (cnt == 0 && s->in_flight > 0)) {
-                trace_mirror_yield(s, cnt * BDRV_SECTOR_SIZE,
-                                   s->buf_free_count, s->in_flight);
+                trace_mirror_yield(s, cnt, s->buf_free_count, s->in_flight);
                mirror_wait_for_io(s);
                continue;
            } else if (cnt != 0) {
@@ -869,7 +849,7 @@ static void coroutine_fn mirror_run(void *opaque)
             * whether to switch to target check one last time if I/O has
             * come in the meanwhile, and if not flush the data to disk.
             */
-            trace_mirror_before_drain(s, cnt * BDRV_SECTOR_SIZE);
+            trace_mirror_before_drain(s, cnt);

            bdrv_drained_begin(bs);
            cnt = bdrv_get_dirty_count(s->dirty_bitmap);
@@ -888,8 +868,7 @@ static void coroutine_fn mirror_run(void *opaque)
        }

        ret = 0;
-        trace_mirror_before_sleep(s, cnt * BDRV_SECTOR_SIZE,
-                                  s->synced, delay_ns);
+        trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
        if (!s->synced) {
            block_job_sleep_ns(&s->common, QEMU_CLOCK_REALTIME, delay_ns);
            if (block_job_is_cancelled(&s->common)) {
@@ -1056,6 +1035,10 @@ static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs,

 static int coroutine_fn bdrv_mirror_top_flush(BlockDriverState *bs)
 {
+    if (bs->backing == NULL) {
+        /* we can be here after failed bdrv_append in mirror_start_job */
+        return 0;
+    }
    return bdrv_co_flush(bs->backing->bs);
 }

@@ -1073,6 +1056,11 @@ static int coroutine_fn bdrv_mirror_top_pdiscard(BlockDriverState *bs,

 static void bdrv_mirror_top_refresh_filename(BlockDriverState *bs, QDict *opts)
 {
+    if (bs->backing == NULL) {
+        /* we can be here after failed bdrv_attach_child in
+         * bdrv_set_backing_hd */
+        return;
+    }
    bdrv_refresh_filename(bs->backing->bs);
    pstrcpy(bs->exact_filename, sizeof(bs->exact_filename),
            bs->backing->bs->filename);
@@ -1084,6 +1072,7 @@ static void bdrv_mirror_top_close(BlockDriverState *bs)

 static void bdrv_mirror_top_child_perm(BlockDriverState *bs, BdrvChild *c,
                                       const BdrvChildRole *role,
+                                       BlockReopenQueue *reopen_queue,
                                       uint64_t perm, uint64_t shared,
                                       uint64_t *nperm, uint64_t *nshared)
 {
@@ -1138,9 +1127,7 @@ static void mirror_start_job(const char *job_id, BlockDriverState *bs,
        granularity = bdrv_get_default_bitmap_granularity(target);
    }

-    assert ((granularity & (granularity - 1)) == 0);
-    /* Granularity must be large enough for sector-based dirty bitmap */
-    assert(granularity >= BDRV_SECTOR_SIZE);
+    assert(is_power_of_2(granularity));

    if (buf_size < 0) {
        error_setg(errp, "Invalid parameter 'buf-size'");
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -31,8 +31,8 @@
 #include "qapi/error.h"
 #include "nbd-client.h"

-#define HANDLE_TO_INDEX(bs, handle) ((handle) ^ ((uint64_t)(intptr_t)bs))
-#define INDEX_TO_HANDLE(bs, index)  ((index)  ^ ((uint64_t)(intptr_t)bs))
+#define HANDLE_TO_INDEX(bs, handle) ((handle) ^ (uint64_t)(intptr_t)(bs))
+#define INDEX_TO_HANDLE(bs, index)  ((index)  ^ (uint64_t)(intptr_t)(bs))

 static void nbd_recv_coroutines_wake_all(NBDClientSession *s)
 {
@@ -78,7 +78,7 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
    while (!s->quit) {
        assert(s->reply.handle == 0);
        ret = nbd_receive_reply(s->ioc, &s->reply, &local_err);
-        if (ret < 0) {
+        if (local_err) {
            error_report_err(local_err);
        }
        if (ret <= 0) {
@@ -92,7 +92,9 @@ static coroutine_fn void nbd_read_reply_entry(void *opaque)
        i = HANDLE_TO_INDEX(s, s->reply.handle);
        if (i >= MAX_NBD_REQUESTS ||
            !s->requests[i].coroutine ||
-            !s->requests[i].receiving) {
+            !s->requests[i].receiving ||
+            (nbd_reply_is_structured(&s->reply) && !s->info.structured_reply))
+        {
            break;
        }

@@ -139,6 +141,7 @@ static int nbd_co_send_request(BlockDriverState *bs,
    assert(i < MAX_NBD_REQUESTS);

    s->requests[i].coroutine = qemu_coroutine_self();
+    s->requests[i].offset = request->from;
    s->requests[i].receiving = false;

    request->handle = INDEX_TO_HANDLE(s, i);
@@ -156,11 +159,12 @@ static int nbd_co_send_request(BlockDriverState *bs,
        qio_channel_set_cork(s->ioc, true);
        rc = nbd_send_request(s->ioc, request);
        if (rc >= 0 && !s->quit) {
-            assert(request->len == iov_size(qiov->iov, qiov->niov));
            if (qio_channel_writev_all(s->ioc, qiov->iov, qiov->niov,
                                       NULL) < 0) {
                rc = -EIO;
            }
+        } else if (rc >= 0) {
+            rc = -EIO;
        }
        qio_channel_set_cork(s->ioc, false);
    } else {
@@ -178,71 +182,496 @@ err:
    return rc;
 }

-static void nbd_co_receive_reply(NBDClientSession *s,
-                                 NBDRequest *request,
-                                 NBDReply *reply,
-                                 QEMUIOVector *qiov)
+static inline uint16_t payload_advance16(uint8_t **payload)
 {
-    int i = HANDLE_TO_INDEX(s, request->handle);
+    *payload += 2;
+    return lduw_be_p(*payload - 2);
+}
+
+static inline uint32_t payload_advance32(uint8_t **payload)
+{
+    *payload += 4;
+    return ldl_be_p(*payload - 4);
+}
+
+static inline uint64_t payload_advance64(uint8_t **payload)
+{
+    *payload += 8;
+    return ldq_be_p(*payload - 8);
+}
+
+static int nbd_parse_offset_hole_payload(NBDStructuredReplyChunk *chunk,
+                                         uint8_t *payload, uint64_t orig_offset,
+                                         QEMUIOVector *qiov, Error **errp)
+{
+    uint64_t offset;
+    uint32_t hole_size;
+
+    if (chunk->length != sizeof(offset) + sizeof(hole_size)) {
+        error_setg(errp, "Protocol error: invalid payload for "
+                         "NBD_REPLY_TYPE_OFFSET_HOLE");
+        return -EINVAL;
+    }
+
+    offset = payload_advance64(&payload);
+    hole_size = payload_advance32(&payload);
+
+    if (!hole_size || offset < orig_offset || hole_size > qiov->size ||
+        offset > orig_offset + qiov->size - hole_size) {
+        error_setg(errp, "Protocol error: server sent chunk exceeding requested"
+                         " region");
+        return -EINVAL;
+    }
+
+    qemu_iovec_memset(qiov, offset - orig_offset, 0, hole_size);
+
+    return 0;
+}
+
+/* nbd_parse_error_payload
+ * on success @errp contains message describing nbd error reply
+ */
+static int nbd_parse_error_payload(NBDStructuredReplyChunk *chunk,
+                                   uint8_t *payload, int *request_ret,
+                                   Error **errp)
+{
+    uint32_t error;
+    uint16_t message_size;
+
+    assert(chunk->type & (1 << 15));
+
+    if (chunk->length < sizeof(error) + sizeof(message_size)) {
+        error_setg(errp,
+                   "Protocol error: invalid payload for structured error");
+        return -EINVAL;
+    }
+
+    error = nbd_errno_to_system_errno(payload_advance32(&payload));
+    if (error == 0) {
+        error_setg(errp, "Protocol error: server sent structured error chunk "
+                         "with error = 0");
+        return -EINVAL;
+    }
+
+    *request_ret = -error;
+    message_size = payload_advance16(&payload);
+
+    if (message_size > chunk->length - sizeof(error) - sizeof(message_size)) {
+        error_setg(errp, "Protocol error: server sent structured error chunk "
+                         "with incorrect message size");
+        return -EINVAL;
+    }
+
+    /* TODO: Add a trace point to mention the server complaint */
+
+    /* TODO handle ERROR_OFFSET */
+
+    return 0;
+}
+
+static int nbd_co_receive_offset_data_payload(NBDClientSession *s,
+                                              uint64_t orig_offset,
+                                              QEMUIOVector *qiov, Error **errp)
+{
+    QEMUIOVector sub_qiov;
+    uint64_t offset;
+    size_t data_size;
+    int ret;
+    NBDStructuredReplyChunk *chunk = &s->reply.structured;
+
+    assert(nbd_reply_is_structured(&s->reply));
+
+    /* The NBD spec requires at least one byte of payload */
+    if (chunk->length <= sizeof(offset)) {
+        error_setg(errp, "Protocol error: invalid payload for "
+                         "NBD_REPLY_TYPE_OFFSET_DATA");
+        return -EINVAL;
+    }
+
+    if (nbd_read(s->ioc, &offset, sizeof(offset), errp) < 0) {
+        return -EIO;
+    }
+    be64_to_cpus(&offset);
+
+    data_size = chunk->length - sizeof(offset);
+    assert(data_size);
+    if (offset < orig_offset || data_size > qiov->size ||
+        offset > orig_offset + qiov->size - data_size) {
+        error_setg(errp, "Protocol error: server sent chunk exceeding requested"
+                         " region");
+        return -EINVAL;
+    }
+
+    qemu_iovec_init(&sub_qiov, qiov->niov);
+    qemu_iovec_concat(&sub_qiov, qiov, offset - orig_offset, data_size);
+    ret = qio_channel_readv_all(s->ioc, sub_qiov.iov, sub_qiov.niov, errp);
+    qemu_iovec_destroy(&sub_qiov);
+
+    return ret < 0 ? -EIO : 0;
+}
+
+#define NBD_MAX_MALLOC_PAYLOAD 1000
+/* nbd_co_receive_structured_payload
+ */
+static coroutine_fn int nbd_co_receive_structured_payload(
+        NBDClientSession *s, void **payload, Error **errp)
+{
+    int ret;
+    uint32_t len;
+
+    assert(nbd_reply_is_structured(&s->reply));
+
+    len = s->reply.structured.length;
+
+    if (len == 0) {
+        return 0;
+    }
+
+    if (payload == NULL) {
+        error_setg(errp, "Unexpected structured payload");
+        return -EINVAL;
+    }
+
+    if (len > NBD_MAX_MALLOC_PAYLOAD) {
+        error_setg(errp, "Payload too large");
+        return -EINVAL;
+    }
+
+    *payload = g_new(char, len);
+    ret = nbd_read(s->ioc, *payload, len, errp);
+    if (ret < 0) {
+        g_free(*payload);
+        *payload = NULL;
+        return ret;
+    }
+
+    return 0;
+}
+
+/* nbd_co_do_receive_one_chunk
+ * for simple reply:
+ *   set request_ret to received reply error
+ *   if qiov is not NULL: read payload to @qiov
+ * for structured reply chunk:
+ *   if error chunk: read payload, set @request_ret, do not set @payload
+ *   else if offset_data chunk: read payload data to @qiov, do not set @payload
+ *   else: read payload to @payload
+ *
+ * If function fails, @errp contains corresponding error message, and the
+ * connection with the server is suspect.  If it returns 0, then the
+ * transaction succeeded (although @request_ret may be a negative errno
+ * corresponding to the server's error reply), and errp is unchanged.
+ */
+static coroutine_fn int nbd_co_do_receive_one_chunk(
+        NBDClientSession *s, uint64_t handle, bool only_structured,
+        int *request_ret, QEMUIOVector *qiov, void **payload, Error **errp)
+{
+    int ret;
+    int i = HANDLE_TO_INDEX(s, handle);
+    void *local_payload = NULL;
+    NBDStructuredReplyChunk *chunk;
+
+    if (payload) {
+        *payload = NULL;
+    }
+    *request_ret = 0;

    /* Wait until we're woken up by nbd_read_reply_entry.  */
    s->requests[i].receiving = true;
    qemu_coroutine_yield();
    s->requests[i].receiving = false;
-    *reply = s->reply;
-    if (reply->handle != request->handle || !s->ioc || s->quit) {
-        reply->error = EIO;
-    } else {
-        if (qiov && reply->error == 0) {
-            assert(request->len == iov_size(qiov->iov, qiov->niov));
-            if (qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
-                                      NULL) < 0) {
-                reply->error = EIO;
+    if (!s->ioc || s->quit) {
+        error_setg(errp, "Connection closed");
+        return -EIO;
+    }
+
+    assert(s->reply.handle == handle);
+
+    if (nbd_reply_is_simple(&s->reply)) {
+        if (only_structured) {
+            error_setg(errp, "Protocol error: simple reply when structured "
+                             "reply chunk was expected");
+            return -EINVAL;
+        }
+
+        *request_ret = -nbd_errno_to_system_errno(s->reply.simple.error);
+        if (*request_ret < 0 || !qiov) {
+            return 0;
+        }
+
+        return qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
+                                     errp) < 0 ? -EIO : 0;
+    }
+
+    /* handle structured reply chunk */
+    assert(s->info.structured_reply);
+    chunk = &s->reply.structured;
+
+    if (chunk->type == NBD_REPLY_TYPE_NONE) {
+        if (!(chunk->flags & NBD_REPLY_FLAG_DONE)) {
+            error_setg(errp, "Protocol error: NBD_REPLY_TYPE_NONE chunk without"
+                       " NBD_REPLY_FLAG_DONE flag set");
+            return -EINVAL;
+        }
+        if (chunk->length) {
+            error_setg(errp, "Protocol error: NBD_REPLY_TYPE_NONE chunk with"
+                       " nonzero length");
+            return -EINVAL;
+        }
+        return 0;
+    }
+
+    if (chunk->type == NBD_REPLY_TYPE_OFFSET_DATA) {
+        if (!qiov) {
+            error_setg(errp, "Unexpected NBD_REPLY_TYPE_OFFSET_DATA chunk");
+            return -EINVAL;
+        }
+
+        return nbd_co_receive_offset_data_payload(s, s->requests[i].offset,
+                                                  qiov, errp);
+    }
+
+    if (nbd_reply_type_is_error(chunk->type)) {
+        payload = &local_payload;
+    }
+
+    ret = nbd_co_receive_structured_payload(s, payload, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (nbd_reply_type_is_error(chunk->type)) {
+        ret = nbd_parse_error_payload(chunk, local_payload, request_ret, errp);
+        g_free(local_payload);
+        return ret;
+    }
+
+    return 0;
+}
+
+/* nbd_co_receive_one_chunk
+ * Read reply, wake up read_reply_co and set s->quit if needed.
+ * Return value is a fatal error code or normal nbd reply error code
+ */
+static coroutine_fn int nbd_co_receive_one_chunk(
+        NBDClientSession *s, uint64_t handle, bool only_structured,
+        QEMUIOVector *qiov, NBDReply *reply, void **payload, Error **errp)
+{
+    int request_ret;
+    int ret = nbd_co_do_receive_one_chunk(s, handle, only_structured,
+                                          &request_ret, qiov, payload, errp);
+
+    if (ret < 0) {
        s->quit = true;
+    } else {
+        /* For assert at loop start in nbd_read_reply_entry */
+        if (reply) {
+            *reply = s->reply;
        }
-        }
-
-        /* Tell the read handler to read another header.  */
        s->reply.handle = 0;
+        ret = request_ret;
    }

-    s->requests[i].coroutine = NULL;
-
-    /* Kick the read_reply_co to get the next reply.  */
    if (s->read_reply_co) {
        aio_co_wake(s->read_reply_co);
    }

+    return ret;
+}
+
+typedef struct NBDReplyChunkIter {
+    int ret;
+    Error *err;
+    bool done, only_structured;
+} NBDReplyChunkIter;
+
+static void nbd_iter_error(NBDReplyChunkIter *iter, bool fatal,
+                           int ret, Error **local_err)
+{
+    assert(ret < 0);
+
+    if (fatal || iter->ret == 0) {
+        if (iter->ret != 0) {
+            error_free(iter->err);
+            iter->err = NULL;
+        }
+        iter->ret = ret;
+        error_propagate(&iter->err, *local_err);
+    } else {
+        error_free(*local_err);
+    }
+
+    *local_err = NULL;
+}
+
+/* NBD_FOREACH_REPLY_CHUNK
+ */
+#define NBD_FOREACH_REPLY_CHUNK(s, iter, handle, structured, \
+                                qiov, reply, payload) \
+    for (iter = (NBDReplyChunkIter) { .only_structured = structured }; \
+         nbd_reply_chunk_iter_receive(s, &iter, handle, qiov, reply, payload);)
+
+/* nbd_reply_chunk_iter_receive
+ */
+static bool nbd_reply_chunk_iter_receive(NBDClientSession *s,
+                                         NBDReplyChunkIter *iter,
+                                         uint64_t handle,
+                                         QEMUIOVector *qiov, NBDReply *reply,
+                                         void **payload)
+{
+    int ret;
+    NBDReply local_reply;
+    NBDStructuredReplyChunk *chunk;
+    Error *local_err = NULL;
+    if (s->quit) {
+        error_setg(&local_err, "Connection closed");
+        nbd_iter_error(iter, true, -EIO, &local_err);
+        goto break_loop;
+    }
+
+    if (iter->done) {
+        /* Previous iteration was last. */
+        goto break_loop;
+    }
+
+    if (reply == NULL) {
+        reply = &local_reply;
+    }
+
+    ret = nbd_co_receive_one_chunk(s, handle, iter->only_structured,
+                                   qiov, reply, payload, &local_err);
+    if (ret < 0) {
+        /* If it is a fatal error s->quit is set by nbd_co_receive_one_chunk */
+        nbd_iter_error(iter, s->quit, ret, &local_err);
+    }
+
+    /* Do not execute the body of NBD_FOREACH_REPLY_CHUNK for simple reply. */
+    if (nbd_reply_is_simple(&s->reply) || s->quit) {
+        goto break_loop;
+    }
+
+    chunk = &reply->structured;
+    iter->only_structured = true;
+
+    if (chunk->type == NBD_REPLY_TYPE_NONE) {
+        /* NBD_REPLY_FLAG_DONE is already checked in nbd_co_receive_one_chunk */
+        assert(chunk->flags & NBD_REPLY_FLAG_DONE);
+        goto break_loop;
+    }
+
+    if (chunk->flags & NBD_REPLY_FLAG_DONE) {
+        /* This iteration is last. */
+        iter->done = true;
+    }
+
+    /* Execute the loop body */
+    return true;
+
+break_loop:
+    s->requests[HANDLE_TO_INDEX(s, handle)].coroutine = NULL;
+
    qemu_co_mutex_lock(&s->send_mutex);
    s->in_flight--;
    qemu_co_queue_next(&s->free_sema);
    qemu_co_mutex_unlock(&s->send_mutex);
+
+    return false;
 }

-static int nbd_co_request(BlockDriverState *bs,
-                          NBDRequest *request,
-                          QEMUIOVector *qiov)
+static int nbd_co_receive_return_code(NBDClientSession *s, uint64_t handle,
+                                      Error **errp)
 {
-    NBDClientSession *client = nbd_get_client_session(bs);
-    NBDReply reply;
-    int ret;
+    NBDReplyChunkIter iter;

-    assert(!qiov || request->type == NBD_CMD_WRITE ||
-           request->type == NBD_CMD_READ);
-    ret = nbd_co_send_request(bs, request,
-                              request->type == NBD_CMD_WRITE ? qiov : NULL);
-    if (ret < 0) {
-        reply.error = -ret;
-    } else {
-        nbd_co_receive_reply(client, request, &reply,
-                             request->type == NBD_CMD_READ ? qiov : NULL);
+    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, false, NULL, NULL, NULL) {
+        /* nbd_reply_chunk_iter_receive does all the work */
    }
-    return -reply.error;
+
+    error_propagate(errp, iter.err);
+    return iter.ret;
+}
+
+static int nbd_co_receive_cmdread_reply(NBDClientSession *s, uint64_t handle,
+                                        uint64_t offset, QEMUIOVector *qiov,
+                                        Error **errp)
+{
+    NBDReplyChunkIter iter;
+    NBDReply reply;
+    void *payload = NULL;
+    Error *local_err = NULL;
+
+    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, s->info.structured_reply,
+                            qiov, &reply, &payload)
+    {
+        int ret;
+        NBDStructuredReplyChunk *chunk = &reply.structured;
+
+        assert(nbd_reply_is_structured(&reply));
+
+        switch (chunk->type) {
+        case NBD_REPLY_TYPE_OFFSET_DATA:
+            /* special cased in nbd_co_receive_one_chunk, data is already
+             * in qiov */
+            break;
+        case NBD_REPLY_TYPE_OFFSET_HOLE:
+            ret = nbd_parse_offset_hole_payload(&reply.structured, payload,
+                                                offset, qiov, &local_err);
+            if (ret < 0) {
+                s->quit = true;
+                nbd_iter_error(&iter, true, ret, &local_err);
+            }
+            break;
+        default:
+            if (!nbd_reply_type_is_error(chunk->type)) {
+                /* not allowed reply type */
+                s->quit = true;
+                error_setg(&local_err,
+                           "Unexpected reply type: %d (%s) for CMD_READ",
+                           chunk->type, nbd_reply_type_lookup(chunk->type));
+                nbd_iter_error(&iter, true, -EINVAL, &local_err);
+            }
+        }
+
+        g_free(payload);
+        payload = NULL;
+    }
+
+    error_propagate(errp, iter.err);
+    return iter.ret;
+}
+
+static int nbd_co_request(BlockDriverState *bs, NBDRequest *request,
+                          QEMUIOVector *write_qiov)
+{
+    int ret;
+    Error *local_err = NULL;
+    NBDClientSession *client = nbd_get_client_session(bs);
+
+    assert(request->type != NBD_CMD_READ);
+    if (write_qiov) {
+        assert(request->type == NBD_CMD_WRITE);
+        assert(request->len == iov_size(write_qiov->iov, write_qiov->niov));
+    } else {
+        assert(request->type != NBD_CMD_WRITE);
+    }
+    ret = nbd_co_send_request(bs, request, write_qiov);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = nbd_co_receive_return_code(client, request->handle, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+    }
+    return ret;
 }

 int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
                         uint64_t bytes, QEMUIOVector *qiov, int flags)
 {
+    int ret;
+    Error *local_err = NULL;
+    NBDClientSession *client = nbd_get_client_session(bs);
    NBDRequest request = {
        .type = NBD_CMD_READ,
        .from = offset,
@@ -252,7 +681,20 @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t offset,
    assert(bytes <= NBD_MAX_BUFFER_SIZE);
    assert(!flags);

-    return nbd_co_request(bs, &request, qiov);
+    if (!bytes) {
+        return 0;
+    }
+    ret = nbd_co_send_request(bs, &request, NULL);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = nbd_co_receive_cmdread_reply(client, request.handle, offset, qiov,
+                                       &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+    }
+    return ret;
 }

 int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
@@ -265,6 +707,7 @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
        .len = bytes,
    };

+    assert(!(client->info.flags & NBD_FLAG_READ_ONLY));
    if (flags & BDRV_REQ_FUA) {
        assert(client->info.flags & NBD_FLAG_SEND_FUA);
        request.flags |= NBD_CMD_FLAG_FUA;
@@ -272,6 +715,9 @@ int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,

    assert(bytes <= NBD_MAX_BUFFER_SIZE);

+    if (!bytes) {
+        return 0;
+    }
    return nbd_co_request(bs, &request, qiov);
 }

@@ -285,6 +731,7 @@ int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
        .len = bytes,
    };

+    assert(!(client->info.flags & NBD_FLAG_READ_ONLY));
    if (!(client->info.flags & NBD_FLAG_SEND_WRITE_ZEROES)) {
        return -ENOTSUP;
    }
@@ -297,6 +744,9 @@ int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
        request.flags |= NBD_CMD_FLAG_NO_HOLE;
    }

+    if (!bytes) {
+        return 0;
+    }
    return nbd_co_request(bs, &request, NULL);
 }

@@ -324,7 +774,8 @@ int nbd_client_co_pdiscard(BlockDriverState *bs, int64_t offset, int bytes)
        .len = bytes,
    };

-    if (!(client->info.flags & NBD_FLAG_SEND_TRIM)) {
+    assert(!(client->info.flags & NBD_FLAG_READ_ONLY));
+    if (!(client->info.flags & NBD_FLAG_SEND_TRIM) || !bytes) {
        return 0;
    }

@@ -374,6 +825,7 @@ int nbd_client_init(BlockDriverState *bs,
    qio_channel_set_blocking(QIO_CHANNEL(sioc), true, NULL);

    client->info.request_sizes = true;
+    client->info.structured_reply = true;
    ret = nbd_receive_negotiate(QIO_CHANNEL(sioc), export,
                                tlscreds, hostname,
                                &client->ioc, &client->info, errp);
@@ -381,6 +833,12 @@ int nbd_client_init(BlockDriverState *bs,
        logout("Failed to negotiate with the NBD server\n");
        return ret;
    }
+    if (client->info.flags & NBD_FLAG_READ_ONLY &&
+        !bdrv_is_read_only(bs)) {
+        error_setg(errp,
+                   "request for write access conflicts with read-only export");
+        return -EACCES;
+    }
    if (client->info.flags & NBD_FLAG_SEND_FUA) {
        bs->supported_write_flags = BDRV_REQ_FUA;
        bs->supported_zero_flags |= BDRV_REQ_FUA;
--- a/block/nbd-client.h
+++ b/block/nbd-client.h
@@ -19,6 +19,7 @@

 typedef struct {
    Coroutine *coroutine;
+    uint64_t offset;        /* original offset of the request */
    bool receiving;         /* waiting for read_reply_co? */
 } NBDClientRequest;

--- a/block/parallels.c
+++ b/block/parallels.c
@@ -35,6 +35,7 @@
 #include "qemu/module.h"
 #include "qemu/bswap.h"
 #include "qemu/bitmap.h"
+#include "migration/blocker.h"

 /**************************************************************/

@@ -100,6 +101,7 @@ typedef struct BDRVParallelsState {
    unsigned int tracks;

    unsigned int off_multiplier;
+    Error *migration_blocker;
 } BDRVParallelsState;


@@ -708,7 +710,7 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
        s->prealloc_mode = PRL_PREALLOC_MODE_FALLOCATE;
    }

-    if (flags & BDRV_O_RDWR) {
+    if ((flags & BDRV_O_RDWR) && !(flags & BDRV_O_INACTIVE)) {
        s->header->inuse = cpu_to_le32(HEADER_INUSE_MAGIC);
        ret = parallels_update_header(bs);
        if (ret < 0) {
@@ -720,6 +722,16 @@ static int parallels_open(BlockDriverState *bs, QDict *options, int flags,
    s->bat_dirty_bmap =
        bitmap_new(DIV_ROUND_UP(s->header_size, s->bat_dirty_block));

+    /* Disable migration until bdrv_invalidate_cache method is added */
+    error_setg(&s->migration_blocker, "The Parallels format used by node '%s' "
+               "does not support live migration",
+               bdrv_get_device_or_node_name(bs));
+    ret = migrate_add_blocker(s->migration_blocker, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        error_free(s->migration_blocker);
+        goto fail;
+    }
    qemu_co_mutex_init(&s->lock);
    return 0;

@@ -741,18 +753,18 @@ static void parallels_close(BlockDriverState *bs)
 {
    BDRVParallelsState *s = bs->opaque;

-    if (bs->open_flags & BDRV_O_RDWR) {
+    if ((bs->open_flags & BDRV_O_RDWR) && !(bs->open_flags & BDRV_O_INACTIVE)) {
        s->header->inuse = 0;
        parallels_update_header(bs);
-    }
-
-    if (bs->open_flags & BDRV_O_RDWR) {
        bdrv_truncate(bs->file, s->data_end << BDRV_SECTOR_BITS,
                      PREALLOC_MODE_OFF, NULL);
    }

    g_free(s->bat_dirty_bmap);
    qemu_vfree(s->header);
+
+    migrate_del_blocker(s->migration_blocker);
+    error_free(s->migration_blocker);
 }

 static QemuOptsList parallels_create_opts = {
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -39,8 +39,14 @@ BlockDeviceInfo *bdrv_block_device_info(BlockBackend *blk,
 {
    ImageInfo **p_image_info;
    BlockDriverState *bs0;
-    BlockDeviceInfo *info = g_malloc0(sizeof(*info));
+    BlockDeviceInfo *info;

+    if (!bs->drv) {
+        error_setg(errp, "Block device %s is ejected", bs->node_name);
+        return NULL;
+    }
+
+    info = g_malloc0(sizeof(*info));
    info->file                   = g_strdup(bs->filename);
    info->ro                     = bs->read_only;
    info->drv                    = g_strdup(bs->drv->format_name);
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -478,7 +478,9 @@ static int get_cluster_offset(BlockDriverState *bs,
                    for(i = 0; i < s->cluster_sectors; i++) {
                        if (i < n_start || i >= n_end) {
                            memset(s->cluster_data, 0x00, 512);
-                            if (qcrypto_block_encrypt(s->crypto, start_sect + i,
+                            if (qcrypto_block_encrypt(s->crypto,
+                                                      (start_sect + i) *
+                                                      BDRV_SECTOR_SIZE,
                                                      s->cluster_data,
                                                      BDRV_SECTOR_SIZE,
                                                      NULL) < 0) {
@@ -668,7 +670,8 @@ static coroutine_fn int qcow_co_readv(BlockDriverState *bs, int64_t sector_num,
            }
            if (bs->encrypted) {
                assert(s->crypto);
-                if (qcrypto_block_decrypt(s->crypto, sector_num, buf,
+                if (qcrypto_block_decrypt(s->crypto,
+                                          sector_num * BDRV_SECTOR_SIZE, buf,
                                          n * BDRV_SECTOR_SIZE, NULL) < 0) {
                    ret = -EIO;
                    break;
@@ -740,8 +743,8 @@ static coroutine_fn int qcow_co_writev(BlockDriverState *bs, int64_t sector_num,
        }
        if (bs->encrypted) {
            assert(s->crypto);
-            if (qcrypto_block_encrypt(s->crypto, sector_num, buf,
-                                      n * BDRV_SECTOR_SIZE, NULL) < 0) {
+            if (qcrypto_block_encrypt(s->crypto, sector_num * BDRV_SECTOR_SIZE,
+                                      buf, n * BDRV_SECTOR_SIZE, NULL) < 0) {
                ret = -EIO;
                break;
            }
--- a/block/qcow2-bitmap.c
+++ b/block/qcow2-bitmap.c
@@ -269,15 +269,16 @@ static int free_bitmap_clusters(BlockDriverState *bs, Qcow2BitmapTable *tb)
    return 0;
 }

-/* This function returns the number of disk sectors covered by a single qcow2
- * cluster of bitmap data. */
-static uint64_t sectors_covered_by_bitmap_cluster(const BDRVQcow2State *s,
+/* Return the disk size covered by a single qcow2 cluster of bitmap data. */
+static uint64_t bytes_covered_by_bitmap_cluster(const BDRVQcow2State *s,
                                                const BdrvDirtyBitmap *bitmap)
 {
-    uint32_t sector_granularity =
-            bdrv_dirty_bitmap_granularity(bitmap) >> BDRV_SECTOR_BITS;
+    uint64_t granularity = bdrv_dirty_bitmap_granularity(bitmap);
+    uint64_t limit = granularity * (s->cluster_size << 3);

-    return (uint64_t)sector_granularity * (s->cluster_size << 3);
+    assert(QEMU_IS_ALIGNED(limit,
+                           bdrv_dirty_bitmap_serialization_align(bitmap)));
+    return limit;
 }

 /* load_bitmap_data
@@ -290,7 +291,7 @@ static int load_bitmap_data(BlockDriverState *bs,
 {
    int ret = 0;
    BDRVQcow2State *s = bs->opaque;
-    uint64_t sector, sbc;
+    uint64_t offset, limit;
    uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
    uint8_t *buf = NULL;
    uint64_t i, tab_size =
@@ -302,28 +303,28 @@ static int load_bitmap_data(BlockDriverState *bs,
    }

    buf = g_malloc(s->cluster_size);
-    sbc = sectors_covered_by_bitmap_cluster(s, bitmap);
-    for (i = 0, sector = 0; i < tab_size; ++i, sector += sbc) {
-        uint64_t count = MIN(bm_size - sector, sbc);
+    limit = bytes_covered_by_bitmap_cluster(s, bitmap);
+    for (i = 0, offset = 0; i < tab_size; ++i, offset += limit) {
+        uint64_t count = MIN(bm_size - offset, limit);
        uint64_t entry = bitmap_table[i];
-        uint64_t offset = entry & BME_TABLE_ENTRY_OFFSET_MASK;
+        uint64_t data_offset = entry & BME_TABLE_ENTRY_OFFSET_MASK;

        assert(check_table_entry(entry, s->cluster_size) == 0);

-        if (offset == 0) {
+        if (data_offset == 0) {
            if (entry & BME_TABLE_ENTRY_FLAG_ALL_ONES) {
-                bdrv_dirty_bitmap_deserialize_ones(bitmap, sector, count,
+                bdrv_dirty_bitmap_deserialize_ones(bitmap, offset, count,
                                                   false);
            } else {
                /* No need to deserialize zeros because the dirty bitmap is
                 * already cleared */
            }
        } else {
-            ret = bdrv_pread(bs->file, offset, buf, s->cluster_size);
+            ret = bdrv_pread(bs->file, data_offset, buf, s->cluster_size);
            if (ret < 0) {
                goto finish;
            }
-            bdrv_dirty_bitmap_deserialize_part(bitmap, buf, sector, count,
+            bdrv_dirty_bitmap_deserialize_part(bitmap, buf, offset, count,
                                               false);
        }
    }
@@ -602,7 +603,7 @@ static Qcow2BitmapList *bitmap_list_load(BlockDriverState *bs, uint64_t offset,
            goto fail;
        }

-        bm = g_new(Qcow2Bitmap, 1);
+        bm = g_new0(Qcow2Bitmap, 1);
        bm->table.offset = e->bitmap_table_offset;
        bm->table.size = e->bitmap_table_size;
        bm->flags = e->flags;
@@ -1071,8 +1072,8 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
 {
    int ret;
    BDRVQcow2State *s = bs->opaque;
-    int64_t sector;
-    uint64_t sbc;
+    int64_t offset;
+    uint64_t limit;
    uint64_t bm_size = bdrv_dirty_bitmap_size(bitmap);
    const char *bm_name = bdrv_dirty_bitmap_name(bitmap);
    uint8_t *buf = NULL;
@@ -1095,20 +1096,25 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
        return NULL;
    }

-    dbi = bdrv_dirty_iter_new(bitmap, 0);
+    dbi = bdrv_dirty_iter_new(bitmap);
    buf = g_malloc(s->cluster_size);
-    sbc = sectors_covered_by_bitmap_cluster(s, bitmap);
-    assert(DIV_ROUND_UP(bm_size, sbc) == tb_size);
+    limit = bytes_covered_by_bitmap_cluster(s, bitmap);
+    assert(DIV_ROUND_UP(bm_size, limit) == tb_size);

-    while ((sector = bdrv_dirty_iter_next(dbi)) != -1) {
-        uint64_t cluster = sector / sbc;
+    while ((offset = bdrv_dirty_iter_next(dbi)) >= 0) {
+        uint64_t cluster = offset / limit;
        uint64_t end, write_size;
        int64_t off;

-        sector = cluster * sbc;
-        end = MIN(bm_size, sector + sbc);
-        write_size =
-            bdrv_dirty_bitmap_serialization_size(bitmap, sector, end - sector);
+        /*
+         * We found the first dirty offset, but want to write out the
+         * entire cluster of the bitmap that includes that offset,
+         * including any leading zero bits.
+         */
+        offset = QEMU_ALIGN_DOWN(offset, limit);
+        end = MIN(bm_size, offset + limit);
+        write_size = bdrv_dirty_bitmap_serialization_size(bitmap, offset,
+                                                          end - offset);
        assert(write_size <= s->cluster_size);

        off = qcow2_alloc_clusters(bs, s->cluster_size);
@@ -1120,7 +1126,7 @@ static uint64_t *store_bitmap_data(BlockDriverState *bs,
        }
        tb[cluster] = off;

-        bdrv_dirty_bitmap_serialize_part(bitmap, buf, sector, end - sector);
+        bdrv_dirty_bitmap_serialize_part(bitmap, buf, offset, end - offset);
        if (write_size < s->cluster_size) {
            memset(buf + write_size, 0, s->cluster_size - write_size);
        }
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -62,6 +62,18 @@ static inline int qcow2_cache_get_table_idx(BlockDriverState *bs,
    return idx;
 }

+static inline const char *qcow2_cache_get_name(BDRVQcow2State *s, Qcow2Cache *c)
+{
+    if (c == s->refcount_block_cache) {
+        return "refcount block";
+    } else if (c == s->l2_table_cache) {
+        return "L2 table";
+    } else {
+        /* Do not abort, because this is not critical */
+        return "unknown";
+    }
+}
+
 static void qcow2_cache_table_release(BlockDriverState *bs, Qcow2Cache *c,
                                      int i, int num_tables)
 {
@@ -73,7 +85,7 @@ static void qcow2_cache_table_release(BlockDriverState *bs, Qcow2Cache *c,
    size_t mem_size = (size_t) s->cluster_size * num_tables;
    size_t offset = QEMU_ALIGN_UP((uintptr_t) t, align) - (uintptr_t) t;
    size_t length = QEMU_ALIGN_DOWN(mem_size - offset, align);
-    if (length > 0) {
+    if (mem_size > offset && length > 0) {
        madvise((uint8_t *) t + offset, length, MADV_DONTNEED);
    }
 #endif
@@ -314,9 +326,18 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c,
    uint64_t min_lru_counter = UINT64_MAX;
    int min_lru_index = -1;

+    assert(offset != 0);
+
    trace_qcow2_cache_get(qemu_coroutine_self(), c == s->l2_table_cache,
                          offset, read_from_disk);

+    if (offset_into_cluster(s, offset)) {
+        qcow2_signal_corruption(bs, true, -1, -1, "Cannot get entry from %s "
+                                "cache: Offset %#" PRIx64 " is unaligned",
+                                qcow2_cache_get_name(s, c), offset);
+        return -EIO;
+    }
+
    /* Check if the table is already cached */
    i = lookup_index = (offset / s->cluster_size * 4) % c->size;
    do {
@@ -411,3 +432,29 @@ void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
    assert(c->entries[i].offset != 0);
    c->entries[i].dirty = true;
 }
+
+void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
+                                  uint64_t offset)
+{
+    int i;
+
+    for (i = 0; i < c->size; i++) {
+        if (c->entries[i].offset == offset) {
+            return qcow2_cache_get_table_addr(bs, c, i);
+        }
+    }
+    return NULL;
+}
+
+void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table)
+{
+    int i = qcow2_cache_get_table_idx(bs, c, table);
+
+    assert(c->entries[i].ref == 0);
+
+    c->entries[i].offset = 0;
+    c->entries[i].lru_counter = 0;
+    c->entries[i].dirty = false;
+
+    qcow2_cache_table_release(bs, c, i, 1);
+}
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -32,6 +32,56 @@
 #include "qemu/bswap.h"
 #include "trace.h"

+int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t exact_size)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int new_l1_size, i, ret;
+
+    if (exact_size >= s->l1_size) {
+        return 0;
+    }
+
+    new_l1_size = exact_size;
+
+#ifdef DEBUG_ALLOC2
+    fprintf(stderr, "shrink l1_table from %d to %d\n", s->l1_size, new_l1_size);
+#endif
+
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_WRITE_TABLE);
+    ret = bdrv_pwrite_zeroes(bs->file, s->l1_table_offset +
+                                       new_l1_size * sizeof(uint64_t),
+                             (s->l1_size - new_l1_size) * sizeof(uint64_t), 0);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    ret = bdrv_flush(bs->file->bs);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    BLKDBG_EVENT(bs->file, BLKDBG_L1_SHRINK_FREE_L2_CLUSTERS);
+    for (i = s->l1_size - 1; i > new_l1_size - 1; i--) {
+        if ((s->l1_table[i] & L1E_OFFSET_MASK) == 0) {
+            continue;
+        }
+        qcow2_free_clusters(bs, s->l1_table[i] & L1E_OFFSET_MASK,
+                            s->cluster_size, QCOW2_DISCARD_ALWAYS);
+        s->l1_table[i] = 0;
+    }
+    return 0;
+
+fail:
+    /*
+     * If the write in the l1_table failed the image may contain a partially
+     * overwritten l1_table. In this case it would be better to clear the
+     * l1_table in memory to avoid possible image corruption.
+     */
+    memset(s->l1_table + new_l1_size, 0,
+           (s->l1_size - new_l1_size) * sizeof(uint64_t));
+    return ret;
+}
+
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                        bool exact_size)
 {
@@ -228,6 +278,14 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table)
        goto fail;
    }

+    /* If we're allocating the table at offset 0 then something is wrong */
+    if (l2_offset == 0) {
+        qcow2_signal_corruption(bs, true, -1, -1, "Preventing invalid "
+                                "allocation of L2 table at offset 0");
+        ret = -EIO;
+        goto fail;
+    }
+
    ret = qcow2_cache_flush(bs, s->refcount_block_cache);
    if (ret < 0) {
        goto fail;
@@ -396,15 +454,13 @@ static bool coroutine_fn do_perform_cow_encrypt(BlockDriverState *bs,
 {
    if (bytes && bs->encrypted) {
        BDRVQcow2State *s = bs->opaque;
-        int64_t sector = (s->crypt_physical_offset ?
+        int64_t offset = (s->crypt_physical_offset ?
                          (cluster_offset + offset_in_cluster) :
-                          (src_cluster_offset + offset_in_cluster))
-                         >> BDRV_SECTOR_BITS;
+                          (src_cluster_offset + offset_in_cluster));
        assert((offset_in_cluster & ~BDRV_SECTOR_MASK) == 0);
        assert((bytes & ~BDRV_SECTOR_MASK) == 0);
        assert(s->crypto);
-        if (qcrypto_block_encrypt(s->crypto, sector, buffer,
-                                  bytes, NULL) < 0) {
+        if (qcrypto_block_encrypt(s->crypto, offset, buffer, bytes, NULL) < 0) {
            return false;
        }
    }
@@ -1252,10 +1308,21 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset,
        (!*host_offset ||
         start_of_cluster(s, *host_offset) == (entry & L2E_OFFSET_MASK)))
    {
+        int preallocated_nb_clusters;
+
+        if (offset_into_cluster(s, entry & L2E_OFFSET_MASK)) {
+            qcow2_signal_corruption(bs, true, -1, -1, "Preallocated zero "
+                                    "cluster offset %#llx unaligned (guest "
+                                    "offset: %#" PRIx64 ")",
+                                    entry & L2E_OFFSET_MASK, guest_offset);
+            ret = -EIO;
+            goto fail;
+        }
+
        /* Try to reuse preallocated zero clusters; contiguous normal clusters
         * would be fine, too, but count_cow_clusters() above has limited
         * nb_clusters already to a range of COW clusters */
-        int preallocated_nb_clusters =
+        preallocated_nb_clusters =
            count_contiguous_clusters(nb_clusters, s->cluster_size,
                                      &l2_table[l2_index], QCOW_OFLAG_COPIED);
        assert(preallocated_nb_clusters > 0);
@@ -1584,7 +1651,7 @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
         * cluster is already marked as zero, or if it's unallocated and we
         * don't have a backing file.
         *
-         * TODO We might want to use bdrv_get_block_status(bs) here, but we're
+         * TODO We might want to use bdrv_block_status(bs) here, but we're
         * holding s->lock, so that doesn't work today.
         *
         * If full_discard is true, the sector should not read back as zeroes,
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -29,6 +29,7 @@
 #include "block/qcow2.h"
 #include "qemu/range.h"
 #include "qemu/bswap.h"
+#include "qemu/cutils.h"

 static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
 static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
@@ -366,6 +367,13 @@ static int alloc_refcount_block(BlockDriverState *bs,
        return new_block;
    }

+    /* If we're allocating the block at offset 0 then something is wrong */
+    if (new_block == 0) {
+        qcow2_signal_corruption(bs, true, -1, -1, "Preventing invalid "
+                                "allocation of refcount block at offset 0");
+        return -EIO;
+    }
+
 #ifdef DEBUG_ALLOC2
    fprintf(stderr, "qcow2: Allocate refcount block %d for %" PRIx64
        " at %" PRIx64 "\n",
@@ -861,10 +869,26 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
        }
        s->set_refcount(refcount_block, block_index, refcount);

-        if (refcount == 0 && s->discard_passthrough[type]) {
+        if (refcount == 0) {
+            void *table;
+
+            table = qcow2_cache_is_table_offset(bs, s->refcount_block_cache,
+                                                offset);
+            if (table != NULL) {
+                qcow2_cache_put(bs, s->refcount_block_cache, &refcount_block);
+                qcow2_cache_discard(bs, s->refcount_block_cache, table);
+            }
+
+            table = qcow2_cache_is_table_offset(bs, s->l2_table_cache, offset);
+            if (table != NULL) {
+                qcow2_cache_discard(bs, s->l2_table_cache, table);
+            }
+
+            if (s->discard_passthrough[type]) {
                update_refcount_discard(bs, cluster_offset, s->cluster_size);
            }
        }
+    }

    ret = 0;
 fail:
@@ -1058,6 +1082,13 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
                return new_cluster;
            }

+            if (new_cluster == 0) {
+                qcow2_signal_corruption(bs, true, -1, -1, "Preventing invalid "
+                                        "allocation of compressed cluster "
+                                        "at offset 0");
+                return -EIO;
+            }
+
            if (!offset || ROUND_UP(offset, s->cluster_size) != new_cluster) {
                offset = new_cluster;
                free_in_cluster = s->cluster_size;
@@ -3045,3 +3076,168 @@ done:
    qemu_vfree(new_refblock);
    return ret;
 }
+
+static int64_t get_refblock_offset(BlockDriverState *bs, uint64_t offset)
+{
+    BDRVQcow2State *s = bs->opaque;
+    uint32_t index = offset_to_reftable_index(s, offset);
+    int64_t covering_refblock_offset = 0;
+
+    if (index < s->refcount_table_size) {
+        covering_refblock_offset = s->refcount_table[index] & REFT_OFFSET_MASK;
+    }
+    if (!covering_refblock_offset) {
+        qcow2_signal_corruption(bs, true, -1, -1, "Refblock at %#" PRIx64 " is "
+                                "not covered by the refcount structures",
+                                offset);
+        return -EIO;
+    }
+
+    return covering_refblock_offset;
+}
+
+static int qcow2_discard_refcount_block(BlockDriverState *bs,
+                                        uint64_t discard_block_offs)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int64_t refblock_offs;
+    uint64_t cluster_index = discard_block_offs >> s->cluster_bits;
+    uint32_t block_index = cluster_index & (s->refcount_block_size - 1);
+    void *refblock;
+    int ret;
+
+    refblock_offs = get_refblock_offset(bs, discard_block_offs);
+    if (refblock_offs < 0) {
+        return refblock_offs;
+    }
+
+    assert(discard_block_offs != 0);
+
+    ret = qcow2_cache_get(bs, s->refcount_block_cache, refblock_offs,
+                          &refblock);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (s->get_refcount(refblock, block_index) != 1) {
+        qcow2_signal_corruption(bs, true, -1, -1, "Invalid refcount:"
+                                " refblock offset %#" PRIx64
+                                ", reftable index %u"
+                                ", block offset %#" PRIx64
+                                ", refcount %#" PRIx64,
+                                refblock_offs,
+                                offset_to_reftable_index(s, discard_block_offs),
+                                discard_block_offs,
+                                s->get_refcount(refblock, block_index));
+        qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
+        return -EINVAL;
+    }
+    s->set_refcount(refblock, block_index, 0);
+
+    qcow2_cache_entry_mark_dirty(bs, s->refcount_block_cache, refblock);
+
+    qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
+
+    if (cluster_index < s->free_cluster_index) {
+        s->free_cluster_index = cluster_index;
+    }
+
+    refblock = qcow2_cache_is_table_offset(bs, s->refcount_block_cache,
+                                           discard_block_offs);
+    if (refblock) {
+        /* discard refblock from the cache if refblock is cached */
+        qcow2_cache_discard(bs, s->refcount_block_cache, refblock);
+    }
+    update_refcount_discard(bs, discard_block_offs, s->cluster_size);
+
+    return 0;
+}
+
+int qcow2_shrink_reftable(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+    uint64_t *reftable_tmp =
+        g_malloc(s->refcount_table_size * sizeof(uint64_t));
+    int i, ret;
+
+    for (i = 0; i < s->refcount_table_size; i++) {
+        int64_t refblock_offs = s->refcount_table[i] & REFT_OFFSET_MASK;
+        void *refblock;
+        bool unused_block;
+
+        if (refblock_offs == 0) {
+            reftable_tmp[i] = 0;
+            continue;
+        }
+        ret = qcow2_cache_get(bs, s->refcount_block_cache, refblock_offs,
+                              &refblock);
+        if (ret < 0) {
+            goto out;
+        }
+
+        /* the refblock has own reference */
+        if (i == offset_to_reftable_index(s, refblock_offs)) {
+            uint64_t block_index = (refblock_offs >> s->cluster_bits) &
+                                   (s->refcount_block_size - 1);
+            uint64_t refcount = s->get_refcount(refblock, block_index);
+
+            s->set_refcount(refblock, block_index, 0);
+
+            unused_block = buffer_is_zero(refblock, s->cluster_size);
+
+            s->set_refcount(refblock, block_index, refcount);
+        } else {
+            unused_block = buffer_is_zero(refblock, s->cluster_size);
+        }
+        qcow2_cache_put(bs, s->refcount_block_cache, &refblock);
+
+        reftable_tmp[i] = unused_block ? 0 : cpu_to_be64(s->refcount_table[i]);
+    }
+
+    ret = bdrv_pwrite_sync(bs->file, s->refcount_table_offset, reftable_tmp,
+                           s->refcount_table_size * sizeof(uint64_t));
+    /*
+     * If the write in the reftable failed the image may contain a partially
+     * overwritten reftable. In this case it would be better to clear the
+     * reftable in memory to avoid possible image corruption.
+     */
+    for (i = 0; i < s->refcount_table_size; i++) {
+        if (s->refcount_table[i] && !reftable_tmp[i]) {
+            if (ret == 0) {
+                ret = qcow2_discard_refcount_block(bs, s->refcount_table[i] &
+                                                       REFT_OFFSET_MASK);
+            }
+            s->refcount_table[i] = 0;
+        }
+    }
+
+    if (!s->cache_discards) {
+        qcow2_process_discards(bs, ret);
+    }
+
+out:
+    g_free(reftable_tmp);
+    return ret;
+}
+
+int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int64_t i;
+
+    for (i = size_to_clusters(s, size) - 1; i >= 0; i--) {
+        uint64_t refcount;
+        int ret = qcow2_get_refcount(bs, i, &refcount);
+        if (ret < 0) {
+            fprintf(stderr, "Can't get refcount for cluster %" PRId64 ": %s\n",
+                    i, strerror(-ret));
+            return ret;
+        }
+        if (refcount > 0) {
+            return i;
+        }
+    }
+    qcow2_signal_corruption(bs, true, -1, -1,
+                            "There are no references in the refcount table.");
+    return -EIO;
+}
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -126,6 +126,7 @@ static ssize_t qcow2_crypto_hdr_init_func(QCryptoBlock *block, size_t headerlen,
    /* Zero fill remaining space in cluster so it has predictable
     * content in case of future spec changes */
    clusterlen = size_to_clusters(s, headerlen) * s->cluster_size;
+    assert(qcow2_pre_write_overlap_check(bs, 0, ret, clusterlen) == 0);
    ret = bdrv_pwrite_zeroes(bs->file,
                             ret + headerlen,
                             clusterlen - headerlen, 0);
@@ -375,6 +376,8 @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset,

        default:
            /* unknown magic - save it in case we need to rewrite the header */
+            /* If you add a new feature, make sure to also update the fast
+             * path of qcow2_make_empty() to deal with it. */
            {
                Qcow2UnknownHeaderExtension *uext;

@@ -1139,7 +1142,7 @@ static int qcow2_do_open(BlockDriverState *bs, QDict *options, int flags,

    s->cluster_bits = header.cluster_bits;
    s->cluster_size = 1 << s->cluster_bits;
-    s->cluster_sectors = 1 << (s->cluster_bits - 9);
+    s->cluster_sectors = 1 << (s->cluster_bits - BDRV_SECTOR_BITS);

    /* Initialise version 3 header fields */
    if (header.version == 2) {
@@ -1280,6 +1283,12 @@ static int qcow2_do_open(BlockDriverState *bs, QDict *options, int flags,
        goto fail;
    }

+    if (header.refcount_table_clusters == 0 && !(flags & BDRV_O_CHECK)) {
+        error_setg(errp, "Image does not contain a reference count table");
+        ret = -EINVAL;
+        goto fail;
+    }
+
    ret = validate_table_offset(bs, s->refcount_table_offset,
                                s->refcount_table_size, sizeof(uint64_t));
    if (ret < 0) {
@@ -1468,7 +1477,10 @@ static int qcow2_do_open(BlockDriverState *bs, QDict *options, int flags,
        BdrvCheckResult result = {0};

        ret = qcow2_check(bs, &result, BDRV_FIX_ERRORS | BDRV_FIX_LEAKS);
-        if (ret < 0) {
+        if (ret < 0 || result.check_errors) {
+            if (ret >= 0) {
+                ret = -EIO;
+            }
            error_setg_errno(errp, -ret, "Could not repair dirty image");
            goto fail;
        }
@@ -1636,7 +1648,7 @@ static int64_t coroutine_fn qcow2_co_get_block_status(BlockDriverState *bs,

    bytes = MIN(INT_MAX, nb_sectors * BDRV_SECTOR_SIZE);
    qemu_co_mutex_lock(&s->lock);
-    ret = qcow2_get_cluster_offset(bs, sector_num << 9, &bytes,
+    ret = qcow2_get_cluster_offset(bs, sector_num << BDRV_SECTOR_BITS, &bytes,
                                   &cluster_offset);
    qemu_co_mutex_unlock(&s->lock);
    if (ret < 0) {
@@ -1811,7 +1823,7 @@ static coroutine_fn int qcow2_co_preadv(BlockDriverState *bs, uint64_t offset,
                if (qcrypto_block_decrypt(s->crypto,
                                          (s->crypt_physical_offset ?
                                           cluster_offset + offset_in_cluster :
-                                           offset) >> BDRV_SECTOR_BITS,
+                                           offset),
                                          cluster_data,
                                          cur_bytes,
                                          NULL) < 0) {
@@ -1946,7 +1958,7 @@ static coroutine_fn int qcow2_co_pwritev(BlockDriverState *bs, uint64_t offset,
            if (qcrypto_block_encrypt(s->crypto,
                                      (s->crypt_physical_offset ?
                                       cluster_offset + offset_in_cluster :
-                                       offset) >> BDRV_SECTOR_BITS,
+                                       offset),
                                      cluster_data,
                                      cur_bytes, NULL) < 0) {
                ret = -EIO;
@@ -2460,6 +2472,14 @@ static int qcow2_set_up_encryption(BlockDriverState *bs, const char *encryptfmt,
 }


+typedef struct PreallocCo {
+    BlockDriverState *bs;
+    uint64_t offset;
+    uint64_t new_length;
+
+    int ret;
+} PreallocCo;
+
 /**
 * Preallocates metadata structures for data clusters between @offset (in the
 * guest disk) and @new_length (which is thus generally the new guest disk
@@ -2467,9 +2487,12 @@ static int qcow2_set_up_encryption(BlockDriverState *bs, const char *encryptfmt,
 *
 * Returns: 0 on success, -errno on failure.
 */
-static int preallocate(BlockDriverState *bs,
-                       uint64_t offset, uint64_t new_length)
+static void coroutine_fn preallocate_co(void *opaque)
 {
+    PreallocCo *params = opaque;
+    BlockDriverState *bs = params->bs;
+    uint64_t offset = params->offset;
+    uint64_t new_length = params->new_length;
    BDRVQcow2State *s = bs->opaque;
    uint64_t bytes;
    uint64_t host_offset = 0;
@@ -2477,9 +2500,7 @@ static int preallocate(BlockDriverState *bs,
    int ret;
    QCowL2Meta *meta;

-    if (qemu_in_coroutine()) {
    qemu_co_mutex_lock(&s->lock);
-    }

    assert(offset <= new_length);
    bytes = new_length - offset;
@@ -2533,10 +2554,28 @@ static int preallocate(BlockDriverState *bs,
    ret = 0;

 done:
-    if (qemu_in_coroutine()) {
    qemu_co_mutex_unlock(&s->lock);
+    params->ret = ret;
+}
+
+static int preallocate(BlockDriverState *bs,
+                       uint64_t offset, uint64_t new_length)
+{
+    PreallocCo params = {
+        .bs         = bs,
+        .offset     = offset,
+        .new_length = new_length,
+        .ret        = -EINPROGRESS,
+    };
+
+    if (qemu_in_coroutine()) {
+        preallocate_co(&params);
+    } else {
+        Coroutine *co = qemu_coroutine_create(preallocate_co, &params);
+        bdrv_coroutine_enter(bs, co);
+        BDRV_POLL_WHILE(bs, params.ret == -EINPROGRESS);
    }
-    return ret;
+    return params.ret;
 }

 /* qcow2_refcount_metadata_size:
@@ -2972,23 +3011,21 @@ finish:
 }


-static bool is_zero_sectors(BlockDriverState *bs, int64_t start,
-                            uint32_t count)
+static bool is_zero(BlockDriverState *bs, int64_t offset, int64_t bytes)
 {
-    int nr;
-    BlockDriverState *file;
-    int64_t res;
+    int64_t nr;
+    int res;

-    if (start + count > bs->total_sectors) {
-        count = bs->total_sectors - start;
+    /* Clamp to image length, before checking status of underlying sectors */
+    if (offset + bytes > bs->total_sectors * BDRV_SECTOR_SIZE) {
+        bytes = bs->total_sectors * BDRV_SECTOR_SIZE - offset;
    }

-    if (!count) {
+    if (!bytes) {
        return true;
    }
-    res = bdrv_get_block_status_above(bs, NULL, start, count,
-                                      &nr, &file);
-    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == count;
+    res = bdrv_block_status_above(bs, NULL, offset, bytes, &nr, NULL, NULL);
+    return res >= 0 && (res & BDRV_BLOCK_ZERO) && nr == bytes;
 }

 static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
@@ -3006,24 +3043,21 @@ static coroutine_fn int qcow2_co_pwrite_zeroes(BlockDriverState *bs,
    }

    if (head || tail) {
-        int64_t cl_start = (offset - head) >> BDRV_SECTOR_BITS;
        uint64_t off;
        unsigned int nr;

        assert(head + bytes <= s->cluster_size);

        /* check whether remainder of cluster already reads as zero */
-        if (!(is_zero_sectors(bs, cl_start,
-                              DIV_ROUND_UP(head, BDRV_SECTOR_SIZE)) &&
-              is_zero_sectors(bs, (offset + bytes) >> BDRV_SECTOR_BITS,
-                              DIV_ROUND_UP(-tail & (s->cluster_size - 1),
-                                           BDRV_SECTOR_SIZE)))) {
+        if (!(is_zero(bs, offset - head, head) &&
+              is_zero(bs, offset + bytes,
+                      tail ? s->cluster_size - tail : 0))) {
            return -ENOTSUP;
        }

        qemu_co_mutex_lock(&s->lock);
        /* We can have new write after previous check */
-        offset = cl_start << BDRV_SECTOR_BITS;
+        offset = QEMU_ALIGN_DOWN(offset, s->cluster_size);
        bytes = s->cluster_size;
        nr = s->cluster_size;
        ret = qcow2_get_cluster_offset(bs, offset, &nr, &off);
@@ -3104,19 +3138,68 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset,
    }

    old_length = bs->total_sectors * 512;
+    new_l1_size = size_to_l1(s, offset);

-    /* shrinking is currently not supported */
    if (offset < old_length) {
-        error_setg(errp, "qcow2 doesn't support shrinking images yet");
-        return -ENOTSUP;
+        int64_t last_cluster, old_file_size;
+        if (prealloc != PREALLOC_MODE_OFF) {
+            error_setg(errp,
+                       "Preallocation can't be used for shrinking an image");
+            return -EINVAL;
        }

-    new_l1_size = size_to_l1(s, offset);
+        ret = qcow2_cluster_discard(bs, ROUND_UP(offset, s->cluster_size),
+                                    old_length - ROUND_UP(offset,
+                                                          s->cluster_size),
+                                    QCOW2_DISCARD_ALWAYS, true);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Failed to discard cropped clusters");
+            return ret;
+        }
+
+        ret = qcow2_shrink_l1_table(bs, new_l1_size);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret,
+                             "Failed to reduce the number of L2 tables");
+            return ret;
+        }
+
+        ret = qcow2_shrink_reftable(bs);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret,
+                             "Failed to discard unused refblocks");
+            return ret;
+        }
+
+        old_file_size = bdrv_getlength(bs->file->bs);
+        if (old_file_size < 0) {
+            error_setg_errno(errp, -old_file_size,
+                             "Failed to inquire current file length");
+            return old_file_size;
+        }
+        last_cluster = qcow2_get_last_cluster(bs, old_file_size);
+        if (last_cluster < 0) {
+            error_setg_errno(errp, -last_cluster,
+                             "Failed to find the last cluster");
+            return last_cluster;
+        }
+        if ((last_cluster + 1) * s->cluster_size < old_file_size) {
+            Error *local_err = NULL;
+
+            bdrv_truncate(bs->file, (last_cluster + 1) * s->cluster_size,
+                          PREALLOC_MODE_OFF, &local_err);
+            if (local_err) {
+                warn_reportf_err(local_err,
+                                 "Failed to truncate the tail of the image: ");
+            }
+        }
+    } else {
        ret = qcow2_grow_l1_table(bs, new_l1_size, true);
        if (ret < 0) {
            error_setg_errno(errp, -ret, "Failed to grow the L1 table");
            return ret;
        }
+    }

    switch (prealloc) {
    case PREALLOC_MODE_OFF:
@@ -3142,8 +3225,9 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset,
        if (old_file_size < 0) {
            error_setg_errno(errp, -old_file_size,
                             "Failed to inquire current file length");
-            return ret;
+            return old_file_size;
        }
+        old_file_size = ROUND_UP(old_file_size, s->cluster_size);

        nb_new_data_clusters = DIV_ROUND_UP(offset - old_length,
                                            s->cluster_size);
@@ -3171,7 +3255,7 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset,
        if (allocation_start < 0) {
            error_setg_errno(errp, -allocation_start,
                             "Failed to resize refcount structures");
-            return -allocation_start;
+            return allocation_start;
        }

        clusters_allocated = qcow2_alloc_clusters_at(bs, allocation_start,
@@ -3277,6 +3361,10 @@ qcow2_co_pwritev_compressed(BlockDriverState *bs, uint64_t offset,
        return bdrv_truncate(bs->file, cluster_offset, PREALLOC_MODE_OFF, NULL);
    }

+    if (offset_into_cluster(s, offset)) {
+        return -EINVAL;
+    }
+
    buf = qemu_blockalign(bs, s->cluster_size);
    if (bytes != s->cluster_size) {
        if (bytes > s->cluster_size ||
@@ -3521,13 +3609,16 @@ static int qcow2_make_empty(BlockDriverState *bs)

    l1_clusters = DIV_ROUND_UP(s->l1_size, s->cluster_size / sizeof(uint64_t));

-    if (s->qcow_version >= 3 && !s->snapshots &&
-        3 + l1_clusters <= s->refcount_block_size) {
-        /* The following function only works for qcow2 v3 images (it requires
-         * the dirty flag) and only as long as there are no snapshots (because
-         * it completely empties the image). Furthermore, the L1 table and three
-         * additional clusters (image header, refcount table, one refcount
-         * block) have to fit inside one refcount block. */
+    if (s->qcow_version >= 3 && !s->snapshots && !s->nb_bitmaps &&
+        3 + l1_clusters <= s->refcount_block_size &&
+        s->crypt_method_header != QCOW_CRYPT_LUKS) {
+        /* The following function only works for qcow2 v3 images (it
+         * requires the dirty flag) and only as long as there are no
+         * features that reserve extra clusters (such as snapshots,
+         * LUKS header, or persistent bitmaps), because it completely
+         * empties the image.  Furthermore, the L1 table and three
+         * additional clusters (image header, refcount table, one
+         * refcount block) have to fit inside one refcount block. */
        return make_completely_empty(bs);
    }

@@ -3648,21 +3739,15 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs,
             */
            required = virtual_size;
        } else {
-            int cluster_sectors = cluster_size / BDRV_SECTOR_SIZE;
-            int64_t sector_num;
-            int pnum = 0;
+            int64_t offset;
+            int64_t pnum = 0;

-            for (sector_num = 0;
-                 sector_num < ssize / BDRV_SECTOR_SIZE;
-                 sector_num += pnum) {
-                int nb_sectors = MIN(ssize / BDRV_SECTOR_SIZE - sector_num,
-                                     BDRV_REQUEST_MAX_SECTORS);
-                BlockDriverState *file;
-                int64_t ret;
+            for (offset = 0; offset < ssize; offset += pnum) {
+                int ret;

-                ret = bdrv_get_block_status_above(in_bs, NULL,
-                                                  sector_num, nb_sectors,
-                                                  &pnum, &file);
+                ret = bdrv_block_status_above(in_bs, NULL, offset,
+                                              ssize - offset, &pnum, NULL,
+                                              NULL);
                if (ret < 0) {
                    error_setg_errno(&local_err, -ret,
                                     "Unable to get block status");
@@ -3674,12 +3759,10 @@ static BlockMeasureInfo *qcow2_measure(QemuOpts *opts, BlockDriverState *in_bs,
                } else if ((ret & (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) ==
                           (BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED)) {
                    /* Extend pnum to end of cluster for next iteration */
-                    pnum = ROUND_UP(sector_num + pnum, cluster_sectors) -
-                           sector_num;
+                    pnum = ROUND_UP(offset + pnum, cluster_size) - offset;

                    /* Count clusters we've seen */
-                    required += (sector_num % cluster_sectors + pnum) *
-                                BDRV_SECTOR_SIZE;
+                    required += offset % cluster_size + pnum;
                }
            }
        }
@@ -3998,6 +4081,9 @@ static int qcow2_amend_options(BlockDriverState *bs, QemuOpts *opts,
                error_report("Changing the encryption format is not supported");
                return -ENOTSUP;
            }
+        } else if (g_str_has_prefix(desc->name, "encrypt.")) {
+            error_report("Changing the encryption parameters is not supported");
+            return -ENOTSUP;
        } else if (!strcmp(desc->name, BLOCK_OPT_CLUSTER_SIZE)) {
            cluster_size = qemu_opt_get_size(opts, BLOCK_OPT_CLUSTER_SIZE,
                                             cluster_size);
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -521,6 +521,12 @@ static inline uint64_t refcount_diff(uint64_t r1, uint64_t r2)
    return r1 > r2 ? r1 - r2 : r2 - r1;
 }

+static inline
+uint32_t offset_to_reftable_index(BDRVQcow2State *s, uint64_t offset)
+{
+    return offset >> (s->refcount_block_bits + s->cluster_bits);
+}
+
 /* qcow2.c functions */
 int qcow2_backing_read1(BlockDriverState *bs, QEMUIOVector *qiov,
                  int64_t sector_num, int nb_sectors);
@@ -584,10 +590,13 @@ int qcow2_inc_refcounts_imrt(BlockDriverState *bs, BdrvCheckResult *res,
 int qcow2_change_refcount_order(BlockDriverState *bs, int refcount_order,
                                BlockDriverAmendStatusCB *status_cb,
                                void *cb_opaque, Error **errp);
+int qcow2_shrink_reftable(BlockDriverState *bs);
+int64_t qcow2_get_last_cluster(BlockDriverState *bs, int64_t size);

 /* qcow2-cluster.c functions */
 int qcow2_grow_l1_table(BlockDriverState *bs, uint64_t min_size,
                        bool exact_size);
+int qcow2_shrink_l1_table(BlockDriverState *bs, uint64_t max_size);
 int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index);
 int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset);
 int qcow2_encrypt_sectors(BDRVQcow2State *s, int64_t sector_num,
@@ -649,6 +658,9 @@ int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
 int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
    void **table);
 void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
+void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
+                                  uint64_t offset);
+void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table);

 /* qcow2-bitmap.c functions */
 int qcow2_check_bitmaps_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
--- a/block/qed.c
+++ b/block/qed.c
@@ -265,7 +265,7 @@ static bool qed_plug_allocating_write_reqs(BDRVQEDState *s)
    assert(!s->allocating_write_reqs_plugged);
    if (s->allocating_acb != NULL) {
        /* Another allocating write came concurrently.  This cannot happen
-         * from bdrv_qed_co_drain, but it can happen when the timer runs.
+         * from bdrv_qed_co_drain_begin, but it can happen when the timer runs.
         */
        qemu_co_mutex_unlock(&s->table_lock);
        return false;
@@ -358,7 +358,7 @@ static void bdrv_qed_attach_aio_context(BlockDriverState *bs,
    }
 }

-static void coroutine_fn bdrv_qed_co_drain(BlockDriverState *bs)
+static void coroutine_fn bdrv_qed_co_drain_begin(BlockDriverState *bs)
 {
    BDRVQEDState *s = bs->opaque;

@@ -1608,7 +1608,7 @@ static BlockDriver bdrv_qed = {
    .bdrv_check               = bdrv_qed_check,
    .bdrv_detach_aio_context  = bdrv_qed_detach_aio_context,
    .bdrv_attach_aio_context  = bdrv_qed_attach_aio_context,
-    .bdrv_co_drain            = bdrv_qed_co_drain,
+    .bdrv_co_drain_begin      = bdrv_qed_co_drain_begin,
 };

 static void bdrv_qed_init(void)
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -665,12 +665,18 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
    /* If we are using an rbd snapshot, we must be r/o, otherwise
     * leave as-is */
    if (s->snap != NULL) {
+        if (!bdrv_is_read_only(bs)) {
+            error_report("Opening rbd snapshots without an explicit "
+                         "read-only=on option is deprecated. Future versions "
+                         "will refuse to open the image instead of "
+                         "automatically marking the image read-only.");
            r = bdrv_set_read_only(bs, true, &local_err);
            if (r < 0) {
                error_propagate(errp, local_err);
                goto failed_open;
            }
        }
+    }

    qemu_opts_del(opts);
    return 0;
--- a/block/replication.c
+++ b/block/replication.c
@@ -157,13 +157,17 @@ static void replication_close(BlockDriverState *bs)

 static void replication_child_perm(BlockDriverState *bs, BdrvChild *c,
                                   const BdrvChildRole *role,
+                                   BlockReopenQueue *reopen_queue,
                                   uint64_t perm, uint64_t shared,
                                   uint64_t *nperm, uint64_t *nshared)
 {
-    *nperm = *nshared = BLK_PERM_CONSISTENT_READ \
+    *nperm = BLK_PERM_CONSISTENT_READ;
+    if ((bs->open_flags & (BDRV_O_INACTIVE | BDRV_O_RDWR)) == BDRV_O_RDWR) {
+        *nperm |= BLK_PERM_WRITE;
+    }
+    *nshared = BLK_PERM_CONSISTENT_READ \
               | BLK_PERM_WRITE \
               | BLK_PERM_WRITE_UNCHANGED;
-
    return;
 }

@@ -338,12 +342,24 @@ static void secondary_do_checkpoint(BDRVReplicationState *s, Error **errp)
        return;
    }

+    if (!s->active_disk->bs->drv) {
+        error_setg(errp, "Active disk %s is ejected",
+                   s->active_disk->bs->node_name);
+        return;
+    }
+
    ret = s->active_disk->bs->drv->bdrv_make_empty(s->active_disk->bs);
    if (ret < 0) {
        error_setg(errp, "Cannot make active disk empty");
        return;
    }

+    if (!s->hidden_disk->bs->drv) {
+        error_setg(errp, "Hidden disk %s is ejected",
+                   s->hidden_disk->bs->node_name);
+        return;
+    }
+
    ret = s->hidden_disk->bs->drv->bdrv_make_empty(s->hidden_disk->bs);
    if (ret < 0) {
        error_setg(errp, "Cannot make hidden disk empty");
@@ -507,6 +523,9 @@ static void replication_start(ReplicationState *rs, ReplicationMode mode,
            return;
        }

+        /* Must be true, or the bdrv_getlength() calls would have failed */
+        assert(s->active_disk->bs->drv && s->hidden_disk->bs->drv);
+
        if (!s->active_disk->bs->drv->bdrv_make_empty ||
            !s->hidden_disk->bs->drv->bdrv_make_empty) {
            error_setg(errp,
--- a/block/snapshot.c
+++ b/block/snapshot.c
@@ -181,10 +181,24 @@ int bdrv_snapshot_goto(BlockDriverState *bs,
 {
    BlockDriver *drv = bs->drv;
    int ret, open_ret;
+    int64_t len;

    if (!drv) {
        return -ENOMEDIUM;
    }
+
+    len = bdrv_getlength(bs);
+    if (len < 0) {
+        return len;
+    }
+    /* We should set all bits in all enabled dirty bitmaps, because dirty
+     * bitmaps reflect active state of disk and snapshot switch operation
+     * actually dirties active state.
+     * TODO: It may make sense not to set all bits but analyze block status of
+     * current state and destination snapshot and do not set bits corresponding
+     * to both-zero or both-unallocated areas. */
+    bdrv_set_dirty(bs, 0, len);
+
    if (drv->bdrv_snapshot_goto) {
        return drv->bdrv_snapshot_goto(bs, snapshot_id);
    }
@@ -403,6 +417,7 @@ bool bdrv_all_can_snapshot(BlockDriverState **first_bad_bs)
        }
        aio_context_release(ctx);
        if (!ok) {
+            bdrv_next_cleanup(&it);
            goto fail;
        }
    }
@@ -430,6 +445,7 @@ int bdrv_all_delete_snapshot(const char *name, BlockDriverState **first_bad_bs,
        }
        aio_context_release(ctx);
        if (ret < 0) {
+            bdrv_next_cleanup(&it);
            goto fail;
        }
    }
@@ -455,6 +471,7 @@ int bdrv_all_goto_snapshot(const char *name, BlockDriverState **first_bad_bs)
        }
        aio_context_release(ctx);
        if (err < 0) {
+            bdrv_next_cleanup(&it);
            goto fail;
        }
    }
@@ -480,6 +497,7 @@ int bdrv_all_find_snapshot(const char *name, BlockDriverState **first_bad_bs)
        }
        aio_context_release(ctx);
        if (err < 0) {
+            bdrv_next_cleanup(&it);
            goto fail;
        }
    }
@@ -511,6 +529,7 @@ int bdrv_all_create_snapshot(QEMUSnapshotInfo *sn,
        }
        aio_context_release(ctx);
        if (err < 0) {
+            bdrv_next_cleanup(&it);
            goto fail;
        }
    }
@@ -534,6 +553,7 @@ BlockDriverState *bdrv_all_find_vmstate_bs(void)
        aio_context_release(ctx);

        if (found) {
+            bdrv_next_cleanup(&it);
            break;
        }
    }
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -403,17 +403,19 @@ static void coroutine_fn throttle_group_restart_queue_entry(void *opaque)
        schedule_next_request(tgm, is_write);
        qemu_mutex_unlock(&tg->lock);
    }
+
+    g_free(data);
 }

 static void throttle_group_restart_queue(ThrottleGroupMember *tgm, bool is_write)
 {
    Coroutine *co;
-    RestartData rd = {
-        .tgm = tgm,
-        .is_write = is_write
-    };
+    RestartData *rd = g_new0(RestartData, 1);

-    co = qemu_coroutine_create(throttle_group_restart_queue_entry, &rd);
+    rd->tgm = tgm;
+    rd->is_write = is_write;
+
+    co = qemu_coroutine_create(throttle_group_restart_queue_entry, rd);
    aio_co_enter(tgm->aio_context, co);
 }

@@ -591,7 +593,25 @@ void throttle_group_attach_aio_context(ThrottleGroupMember *tgm,

 void throttle_group_detach_aio_context(ThrottleGroupMember *tgm)
 {
+    ThrottleGroup *tg = container_of(tgm->throttle_state, ThrottleGroup, ts);
    ThrottleTimers *tt = &tgm->throttle_timers;
+    int i;
+
+    /* Requests must have been drained */
+    assert(tgm->pending_reqs[0] == 0 && tgm->pending_reqs[1] == 0);
+    assert(qemu_co_queue_empty(&tgm->throttled_reqs[0]));
+    assert(qemu_co_queue_empty(&tgm->throttled_reqs[1]));
+
+    /* Kick off next ThrottleGroupMember, if necessary */
+    qemu_mutex_lock(&tg->lock);
+    for (i = 0; i < 2; i++) {
+        if (timer_pending(tt->timers[i])) {
+            tg->any_timer_armed[i] = false;
+            schedule_next_request(tgm, i);
+        }
+    }
+    qemu_mutex_unlock(&tg->lock);
+
    throttle_timers_detach_aio_context(tt);
    tgm->aio_context = NULL;
 }
--- a/block/throttle.c
+++ b/block/throttle.c
@@ -197,6 +197,21 @@ static bool throttle_recurse_is_first_non_filter(BlockDriverState *bs,
    return bdrv_recurse_is_first_non_filter(bs->file->bs, candidate);
 }

+static void coroutine_fn throttle_co_drain_begin(BlockDriverState *bs)
+{
+    ThrottleGroupMember *tgm = bs->opaque;
+    if (atomic_fetch_inc(&tgm->io_limits_disabled) == 0) {
+        throttle_group_restart_tgm(tgm);
+    }
+}
+
+static void coroutine_fn throttle_co_drain_end(BlockDriverState *bs)
+{
+    ThrottleGroupMember *tgm = bs->opaque;
+    assert(tgm->io_limits_disabled);
+    atomic_dec(&tgm->io_limits_disabled);
+}
+
 static BlockDriver bdrv_throttle = {
    .format_name                        =   "throttle",
    .protocol_name                      =   "throttle",
@@ -226,6 +241,9 @@ static BlockDriver bdrv_throttle = {
    .bdrv_reopen_abort                  =   throttle_reopen_abort,
    .bdrv_co_get_block_status           =   bdrv_co_get_block_status_from_file,

+    .bdrv_co_drain_begin                =   throttle_co_drain_begin,
+    .bdrv_co_drain_end                  =   throttle_co_drain_end,
+
    .is_filter                          =   true,
 };

--- a/block/trace-events
+++ b/block/trace-events
@@ -12,7 +12,7 @@ blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, int flag
 bdrv_co_preadv(void *bs, int64_t offset, int64_t nbytes, unsigned int flags) "bs %p offset %"PRId64" nbytes %"PRId64" flags 0x%x"
 bdrv_co_pwritev(void *bs, int64_t offset, int64_t nbytes, unsigned int flags) "bs %p offset %"PRId64" nbytes %"PRId64" flags 0x%x"
 bdrv_co_pwrite_zeroes(void *bs, int64_t offset, int count, int flags) "bs %p offset %"PRId64" count %d flags 0x%x"
-bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t cluster_offset, unsigned int cluster_bytes) "bs %p offset %"PRId64" bytes %u cluster_offset %"PRId64" cluster_bytes %u"
+bdrv_co_do_copy_on_readv(void *bs, int64_t offset, unsigned int bytes, int64_t cluster_offset, int64_t cluster_bytes) "bs %p offset %"PRId64" bytes %u cluster_offset %"PRId64" cluster_bytes %"PRId64

 # block/stream.c
 stream_one_iteration(void *s, int64_t offset, uint64_t bytes, int is_allocated) "s %p offset %" PRId64 " bytes %" PRIu64 " is_allocated %d"
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -1008,13 +1008,6 @@ static int vhdx_open(BlockDriverState *bs, QDict *options, int flags,
        goto fail;
    }

-    if (flags & BDRV_O_RDWR) {
-        ret = vhdx_update_headers(bs, s, false, NULL);
-        if (ret < 0) {
-            goto fail;
-        }
-    }
-
    /* TODO: differencing files */

    return 0;
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -57,15 +57,6 @@

 static void checkpoint(void);

-#ifdef __MINGW32__
-void nonono(const char* file, int line, const char* msg) {
-    fprintf(stderr, "Nonono! %s:%d %s\n", file, line, msg);
-    exit(-5);
-}
-#undef assert
-#define assert(a) do {if (!(a)) nonono(__FILE__, __LINE__, #a);}while(0)
-#endif
-
 #else

 #define DLOG(a)
@@ -1268,7 +1259,11 @@ static int vvfat_open(BlockDriverState *bs, QDict *options, int flags,
                       "Unable to set VVFAT to 'rw' when drive is read-only");
            goto fail;
        }
-    } else  {
+    } else  if (!bdrv_is_read_only(bs)) {
+        error_report("Opening non-rw vvfat images without an explicit "
+                     "read-only=on option is deprecated. Future versions "
+                     "will refuse to open the image instead of "
+                     "automatically marking the image read-only.");
        /* read only is the default for safety */
        ret = bdrv_set_read_only(bs, true, &local_err);
        if (ret < 0) {
@@ -2952,7 +2947,7 @@ static int do_commit(BDRVVVFATState* s)
        return ret;
    }

-    if (s->qcow->bs->drv->bdrv_make_empty) {
+    if (s->qcow->bs->drv && s->qcow->bs->drv->bdrv_make_empty) {
        s->qcow->bs->drv->bdrv_make_empty(s->qcow->bs);
    }

@@ -3211,6 +3206,7 @@ err:

 static void vvfat_child_perm(BlockDriverState *bs, BdrvChild *c,
                             const BdrvChildRole *role,
+                             BlockReopenQueue *reopen_queue,
                             uint64_t perm, uint64_t shared,
                             uint64_t *nperm, uint64_t *nshared)
 {
@@ -3270,24 +3266,11 @@ static void bdrv_vvfat_init(void)
 block_init(bdrv_vvfat_init);

 #ifdef DEBUG
-static void checkpoint(void) {
+static void checkpoint(void)
+{
    assert(((mapping_t*)array_get(&(vvv->mapping), 0))->end == 2);
    check1(vvv);
    check2(vvv);
    assert(!vvv->current_mapping || vvv->current_fd || (vvv->current_mapping->mode & MODE_DIRECTORY));
-#if 0
-    if (((direntry_t*)vvv->directory.pointer)[1].attributes != 0xf)
-        fprintf(stderr, "Nonono!\n");
-    mapping_t* mapping;
-    direntry_t* direntry;
-    assert(vvv->mapping.size >= vvv->mapping.item_size * vvv->mapping.next);
-    assert(vvv->directory.size >= vvv->directory.item_size * vvv->directory.next);
-    if (vvv->mapping.next<47)
-        return;
-    assert((mapping = array_get(&(vvv->mapping), 47)));
-    assert(mapping->dir_index < vvv->directory.next);
-    direntry = array_get(&(vvv->directory), mapping->dir_index);
-    assert(!memcmp(direntry->name, "USB     H  ", 11) || direntry->name[0]==0);
-#endif
 }
 #endif
--- a/bsd-user/main.c
+++ b/bsd-user/main.c
@@ -977,7 +977,8 @@ int main(int argc, char **argv)
    /* Now that we've loaded the binary, GUEST_BASE is fixed.  Delay
       generating the prologue until now so that the prologue can take
       the real value of GUEST_BASE into account.  */
-    tcg_prologue_init(&tcg_ctx);
+    tcg_prologue_init(tcg_ctx);
+    tcg_region_init();

    /* build Task State */
    memset(ts, 0, sizeof(TaskState));
--- a/1
+++ b/1
--- a/chardev/Makefile.objs
+++ b/chardev/Makefile.objs
@@ -20,5 +20,6 @@ chardev-obj-$(CONFIG_WIN32) += char-win-stdio.o
 common-obj-y += msmouse.o wctablet.o testdev.o
 common-obj-$(CONFIG_BRLAPI) += baum.o
 baum.o-cflags := $(SDL_CFLAGS)
+baum.o-libs := $(BRLAPI_LIBS)

 common-obj-$(CONFIG_SPICE) += spice.o
--- a/chardev/baum.c
+++ b/chardev/baum.c
@@ -643,6 +643,7 @@ static void baum_chr_open(Chardev *chr,
        error_setg(errp, "brlapi__openConnection: %s",
                   brlapi_strerror(brlapi_error_location()));
        g_free(handle);
+        baum->brlapi = NULL;
        return;
    }
    baum->deferred_init = 0;
--- a/chardev/char-fd.c
+++ b/chardev/char-fd.c
@@ -84,8 +84,7 @@ static GSource *fd_chr_add_watch(Chardev *chr, GIOCondition cond)
    return qio_channel_create_watch(s->ioc_out, cond);
 }

-static void fd_chr_update_read_handler(Chardev *chr,
-                                       GMainContext *context)
+static void fd_chr_update_read_handler(Chardev *chr)
 {
    FDChardev *s = FD_CHARDEV(chr);

@@ -94,7 +93,7 @@ static void fd_chr_update_read_handler(Chardev *chr,
        chr->gsource = io_add_watch_poll(chr, s->ioc_in,
                                           fd_chr_read_poll,
                                           fd_chr_read, chr,
-                                           context);
+                                           chr->gcontext);
    }
 }

--- a/chardev/char-fe.c
+++ b/chardev/char-fe.c
@@ -253,7 +253,6 @@ void qemu_chr_fe_set_handlers(CharBackend *b,
                              bool set_open)
 {
    Chardev *s;
-    ChardevClass *cc;
    int fe_open;

    s = b->chr;
@@ -261,7 +260,6 @@ void qemu_chr_fe_set_handlers(CharBackend *b,
        return;
    }

-    cc = CHARDEV_GET_CLASS(s);
    if (!opaque && !fd_can_read && !fd_read && !fd_event) {
        fe_open = 0;
        remove_fd_in_watch(s);
@@ -273,9 +271,8 @@ void qemu_chr_fe_set_handlers(CharBackend *b,
    b->chr_event = fd_event;
    b->chr_be_change = be_change;
    b->opaque = opaque;
-    if (cc->chr_update_read_handler) {
-        cc->chr_update_read_handler(s, context);
-    }
+
+    qemu_chr_be_update_read_handlers(s, context);

    if (set_open) {
        qemu_chr_fe_set_open(b, fe_open);
--- a/chardev/char-pty.c
+++ b/chardev/char-pty.c
@@ -112,8 +112,7 @@ static void pty_chr_update_read_handler_locked(Chardev *chr)
    }
 }

-static void pty_chr_update_read_handler(Chardev *chr,
-                                        GMainContext *context)
+static void pty_chr_update_read_handler(Chardev *chr)
 {
    qemu_mutex_lock(&chr->chr_write_lock);
    pty_chr_update_read_handler_locked(chr);
@@ -219,7 +218,7 @@ static void pty_chr_state(Chardev *chr, int connected)
            chr->gsource = io_add_watch_poll(chr, s->ioc,
                                               pty_chr_read_poll,
                                               pty_chr_read,
-                                               chr, NULL);
+                                               chr, chr->gcontext);
        }
    }
 }
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -332,10 +332,6 @@ static void tcp_chr_free_connection(Chardev *chr)
    SocketChardev *s = SOCKET_CHARDEV(chr);
    int i;

-    if (!s->connected) {
-        return;
-    }
-
    if (s->read_msgfds_num) {
        for (i = 0; i < s->read_msgfds_num; i++) {
            close(s->read_msgfds[i]);
@@ -394,22 +390,25 @@ static void update_disconnected_filename(SocketChardev *s)
                                         s->is_listen, s->is_telnet);
 }

+/* NB may be called even if tcp_chr_connect has not been
+ * reached, due to TLS or telnet initialization failure,
+ * so can *not* assume s->connected == true
+ */
 static void tcp_chr_disconnect(Chardev *chr)
 {
    SocketChardev *s = SOCKET_CHARDEV(chr);
-
-    if (!s->connected) {
-        return;
-    }
+    bool emit_close = s->connected;

    tcp_chr_free_connection(chr);

-    if (s->listen_ioc) {
+    if (s->listen_ioc && s->listen_tag == 0) {
        s->listen_tag = qio_channel_add_watch(
            QIO_CHANNEL(s->listen_ioc), G_IO_IN, tcp_chr_accept, chr, NULL);
    }
    update_disconnected_filename(s);
+    if (emit_close) {
        qemu_chr_be_event(chr, CHR_EVENT_CLOSED);
+    }
    if (s->reconnect_time) {
        qemu_chr_socket_restart_timer(chr);
    }
@@ -516,13 +515,12 @@ static void tcp_chr_connect(void *opaque)
        chr->gsource = io_add_watch_poll(chr, s->ioc,
                                           tcp_chr_read_poll,
                                           tcp_chr_read,
-                                           chr, NULL);
+                                           chr, chr->gcontext);
    }
    qemu_chr_be_event(chr, CHR_EVENT_OPENED);
 }

-static void tcp_chr_update_read_handler(Chardev *chr,
-                                        GMainContext *context)
+static void tcp_chr_update_read_handler(Chardev *chr)
 {
    SocketChardev *s = SOCKET_CHARDEV(chr);

@@ -535,7 +533,7 @@ static void tcp_chr_update_read_handler(Chardev *chr,
        chr->gsource = io_add_watch_poll(chr, s->ioc,
                                           tcp_chr_read_poll,
                                           tcp_chr_read, chr,
-                                           context);
+                                           chr->gcontext);
    }
 }

--- a/chardev/char-udp.c
+++ b/chardev/char-udp.c
@@ -100,8 +100,7 @@ static gboolean udp_chr_read(QIOChannel *chan, GIOCondition cond, void *opaque)
    return TRUE;
 }

-static void udp_chr_update_read_handler(Chardev *chr,
-                                        GMainContext *context)
+static void udp_chr_update_read_handler(Chardev *chr)
 {
    UdpChardev *s = UDP_CHARDEV(chr);

@@ -110,7 +109,7 @@ static void udp_chr_update_read_handler(Chardev *chr,
        chr->gsource = io_add_watch_poll(chr, s->ioc,
                                           udp_chr_read_poll,
                                           udp_chr_read, chr,
-                                           context);
+                                           chr->gcontext);
    }
 }

--- a/chardev/char.c
+++ b/chardev/char.c
@@ -180,6 +180,17 @@ void qemu_chr_be_write(Chardev *s, uint8_t *buf, int len)
    }
 }

+void qemu_chr_be_update_read_handlers(Chardev *s,
+                                      GMainContext *context)
+{
+    ChardevClass *cc = CHARDEV_GET_CLASS(s);
+
+    s->gcontext = context;
+    if (cc->chr_update_read_handler) {
+        cc->chr_update_read_handler(s);
+    }
+}
+
 int qemu_chr_add_client(Chardev *s, int fd)
 {
    return CHARDEV_GET_CLASS(s)->chr_add_client ?
--- a/293
+++ b/293
@@ -265,6 +265,16 @@ libs_qga=""
 debug_info="yes"
 stack_protector=""

+if test -e "$source_path/.git"
+then
+    git_update=yes
+    git_submodules="ui/keycodemapdb"
+else
+    git_update=no
+    git_submodules=""
+fi
+git="git"
+
 # Don't accept a target_list environment variable.
 unset target_list

@@ -290,12 +300,14 @@ netmap="no"
 sdl=""
 sdlabi=""
 virtfs=""
+mpath=""
 vnc="yes"
 sparse="no"
 vde=""
 vnc_sasl=""
 vnc_jpeg=""
 vnc_png=""
+xkbcommon=""
 xen=""
 xen_ctrl_version=""
 xen_pv_domain_build="no"
@@ -331,6 +343,7 @@ modules="no"
 prefix="/usr/local"
 mandir="\${prefix}/share/man"
 datadir="\${prefix}/share"
+firmwarepath="\${prefix}/share/qemu-firmware"
 qemu_docdir="\${prefix}/share/doc/qemu"
 bindir="\${prefix}/bin"
 libdir="\${prefix}/lib"
@@ -365,6 +378,7 @@ opengl_dmabuf="no"
 cpuid_h="no"
 avx2_opt="no"
 zlib="yes"
+capstone=""
 lzo=""
 snappy=""
 bzip2=""
@@ -468,6 +482,7 @@ ccas="${CCAS-$cc}"
 cpp="${CPP-$cc -E}"
 objcopy="${OBJCOPY-${cross_prefix}objcopy}"
 ld="${LD-${cross_prefix}ld}"
+ranlib="${RANLIB-${cross_prefix}ranlib}"
 nm="${NM-${cross_prefix}nm}"
 strip="${STRIP-${cross_prefix}strip}"
 windres="${WINDRES-${cross_prefix}windres}"
@@ -745,7 +760,6 @@ SunOS)
  solaris="yes"
  make="${MAKE-gmake}"
  install="${INSTALL-ginstall}"
-  ld="gld"
  smbd="${SMBD-/usr/sfw/sbin/smbd}"
  if test -f /usr/include/sys/soundcard.h ; then
    audio_drv_list="oss"
@@ -818,7 +832,7 @@ if test "$mingw32" = "yes" ; then
  sysconfdir="\${prefix}"
  local_statedir=
  confsuffix=""
-  libs_qga="-lws2_32 -lwinmm -lpowrprof -lwtsapi32 -liphlpapi -lnetapi32 $libs_qga"
+  libs_qga="-lws2_32 -lwinmm -lpowrprof -lwtsapi32 -lwininet -liphlpapi -lnetapi32 $libs_qga"
 fi

 werror=""
@@ -914,6 +928,8 @@ for opt do
  ;;
  --localstatedir=*) local_statedir="$optarg"
  ;;
+  --firmwarepath=*) firmwarepath="$optarg"
+  ;;
  --sbindir=*|--sharedstatedir=*|\
  --oldincludedir=*|--datarootdir=*|--infodir=*|--localedir=*|\
  --htmldir=*|--dvidir=*|--pdfdir=*|--psdir=*)
@@ -936,6 +952,10 @@ for opt do
  ;;
  --enable-virtfs) virtfs="yes"
  ;;
+  --disable-mpath) mpath="no"
+  ;;
+  --enable-mpath) mpath="yes"
+  ;;
  --disable-vnc) vnc="no"
  ;;
  --enable-vnc) vnc="yes"
@@ -1279,6 +1299,20 @@ for opt do
          error_exit "vhost-user isn't available on win32"
      fi
  ;;
+  --disable-capstone) capstone="no"
+  ;;
+  --enable-capstone) capstone="yes"
+  ;;
+  --enable-capstone=git) capstone="git"
+  ;;
+  --enable-capstone=system) capstone="system"
+  ;;
+  --with-git=*) git="$optarg"
+  ;;
+  --enable-git-update) git_update=yes
+  ;;
+  --disable-git-update) git_update=no
+  ;;
  *)
      echo "ERROR: unknown option $opt"
      echo "Try '$0 --help' for more information"
@@ -1411,6 +1445,7 @@ Advanced options (experts only):
  --libdir=PATH            install libraries in PATH
  --sysconfdir=PATH        install config in PATH$confsuffix
  --localstatedir=PATH     install local state in PATH (set at runtime on win32)
+  --firmwarepath=PATH      search PATH for firmware files
  --with-confsuffix=SUFFIX suffix for QEMU data inside datadir/libdir/sysconfdir [$confsuffix]
  --enable-debug           enable common debug build options
  --disable-strip          disable stripping binaries
@@ -1479,6 +1514,7 @@ disabled with --disable-FEATURE, default is enabled if available:
  vnc-png         PNG compression for VNC server
  cocoa           Cocoa UI (Mac OS X only)
  virtfs          VirtFS
+  mpath           Multipath persistent reservation passthrough
  xen             xen backend driver support
  xen-pci-passthrough
  brlapi          BrlAPI (Braile)
@@ -1524,6 +1560,7 @@ disabled with --disable-FEATURE, default is enabled if available:
  vxhs            Veritas HyperScale vDisk backend support
  crypto-afalg    Linux AF_ALG crypto backend driver
  vhost-user      vhost-user support
+  capstone        capstone disassembler support

 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -1642,6 +1679,19 @@ EOF
  fi
 fi

+# Disable -Wmissing-braces on older compilers that warn even for
+# the "universal" C zero initializer {0}.
+cat > $TMPC << EOF
+struct {
+  int a[2];
+} x = {0};
+EOF
+if compile_object "-Werror" "" ; then
+  :
+else
+  QEMU_CFLAGS="$QEMU_CFLAGS -Wno-missing-braces"
+fi
+
 # Workaround for http://gcc.gnu.org/PR55489.  Happens with -fPIE/-fPIC and
 # large functions that use global variables.  The bug is in all releases of
 # GCC, but it became particularly acute in 4.6.x and 4.7.x.  It is fixed in
@@ -2788,7 +2838,6 @@ EOF
    sdl_cflags="$sdl_cflags $x11_cflags"
    sdl_libs="$sdl_libs $x11_libs"
  fi
-  libs_softmmu="$sdl_libs $libs_softmmu"
 fi

 ##########################################
@@ -2801,7 +2850,6 @@ EOF
  rdma_libs="-lrdmacm -libverbs"
  if compile_prog "" "$rdma_libs" ; then
    rdma="yes"
-    libs_softmmu="$libs_softmmu $rdma_libs"
  else
    if test "$rdma" = "yes" ; then
        error_exit \
@@ -2893,6 +2941,21 @@ EOF
  fi
 fi

+##########################################
+# xkbcommon probe
+if test "$xkbcommon" != "no" ; then
+  if $pkg_config xkbcommon --exists; then
+    xkbcommon_cflags=$($pkg_config xkbcommon --cflags)
+    xkbcommon_libs=$($pkg_config xkbcommon --libs)
+    xkbcommon=yes
+  else
+    if test "$xkbcommon" = "yes" ; then
+      feature_not_found "xkbcommon" "Install libxkbcommon-devel"
+    fi
+    xkbcommon=no
+  fi
+fi
+
 ##########################################
 # fnmatch() probe, used for ACL routines
 fnmatch="no"
@@ -2946,8 +3009,6 @@ int main(void)
 EOF
  if compile_prog "" "$vde_libs" ; then
    vde=yes
-    libs_softmmu="$vde_libs $libs_softmmu"
-    libs_tools="$vde_libs $libs_tools"
  else
    if test "$vde" = "yes" ; then
      feature_not_found "vde" "Install vde (Virtual Distributed Ethernet) devel"
@@ -3035,13 +3096,13 @@ for drv in $audio_drv_list; do
    alsa)
    audio_drv_probe $drv alsa/asoundlib.h -lasound \
        "return snd_pcm_close((snd_pcm_t *)0);"
-    libs_softmmu="-lasound $libs_softmmu"
+    alsa_libs="-lasound"
    ;;

    pa)
    audio_drv_probe $drv pulse/pulseaudio.h "-lpulse" \
        "pa_context_set_source_output_volume(NULL, 0, NULL, NULL, NULL); return 0;"
-    libs_softmmu="-lpulse $libs_softmmu"
+    pulse_libs="-lpulse"
    audio_pt_int="yes"
    ;;

@@ -3052,16 +3113,16 @@ for drv in $audio_drv_list; do
    ;;

    coreaudio)
-      libs_softmmu="-framework CoreAudio $libs_softmmu"
+      coreaudio_libs="-framework CoreAudio"
    ;;

    dsound)
-      libs_softmmu="-lole32 -ldxguid $libs_softmmu"
+      dsound_libs="-lole32 -ldxguid"
      audio_win_int="yes"
    ;;

    oss)
-      libs_softmmu="$oss_lib $libs_softmmu"
+      oss_libs="$oss_lib"
    ;;

    wav)
@@ -3089,7 +3150,6 @@ int main( void ) { return brlapi__openConnection (NULL, NULL, NULL); }
 EOF
  if compile_prog "" "$brlapi_libs" ; then
    brlapi=yes
-    libs_softmmu="$brlapi_libs $libs_softmmu"
  else
    if test "$brlapi" = "yes" ; then
      feature_not_found "brlapi" "Install brlapi devel"
@@ -3299,6 +3359,38 @@ else
      "Please install the pixman devel package."
 fi

+##########################################
+# libmpathpersist probe
+
+if test "$mpath" != "no" ; then
+  cat > $TMPC <<EOF
+#include <libudev.h>
+#include <mpath_persist.h>
+unsigned mpath_mx_alloc_len = 1024;
+int logsink;
+static struct config *multipath_conf;
+extern struct udev *udev;
+extern struct config *get_multipath_config(void);
+extern void put_multipath_config(struct config *conf);
+struct udev *udev;
+struct config *get_multipath_config(void) { return multipath_conf; }
+void put_multipath_config(struct config *conf) { }
+
+int main(void) {
+    udev = udev_new();
+    multipath_conf = mpath_lib_init();
+    return 0;
+}
+EOF
+  if compile_prog "" "-ludev -lmultipath -lmpathpersist" ; then
+    mpathpersist=yes
+  else
+    mpathpersist=no
+  fi
+else
+  mpathpersist=no
+fi
+
 ##########################################
 # libcap probe

@@ -3467,6 +3559,12 @@ else
  tpm_passthrough=no
 fi

+# TPM emulator is for all posix systems
+if test "$mingw32" != "yes"; then
+  tpm_emulator=$tpm
+else
+  tpm_emulator=no
+fi
 ##########################################
 # attr probe

@@ -3556,8 +3654,12 @@ EOF
  if compile_prog "" "$fdt_libs" ; then
    # system DTC is good - use it
    fdt=yes
-  elif test -d ${source_path}/dtc/libfdt ; then
-    # have submodule DTC - use it
+  else
+      # have GIT checkout, so activate dtc submodule
+      if test -e "${source_path}/.git" ; then
+          git_submodules="${git_submodules} dtc"
+      fi
+      if test -d "${source_path}/dtc/libfdt" || test -e "${source_path}/.git" ; then
          fdt=yes
          dtc_internal="yes"
          mkdir -p dtc
@@ -3568,16 +3670,15 @@ EOF
          fdt_cflags="-I\$(SRC_PATH)/dtc/libfdt"
          fdt_libs="-L\$(BUILD_DIR)/dtc/libfdt $fdt_libs"
      elif test "$fdt" = "yes" ; then
-    # have neither and want - prompt for system/submodule install
-    error_exit "DTC (libfdt) version >= 1.4.2 not present. Your options:" \
-        "  (1) Preferred: Install the DTC (libfdt) devel package" \
-        "  (2) Fetch the DTC submodule, using:" \
-        "      git submodule update --init dtc"
+          # Not a git build & no libfdt found, prompt for system install
+          error_exit "DTC (libfdt) version >= 1.4.2 not present." \
+                     "Please install the DTC (libfdt) devel package"
      else
          # don't have and don't want
          fdt_libs=
          fdt=no
      fi
+  fi
 fi

 libs_softmmu="$libs_softmmu $fdt_libs"
@@ -4204,13 +4305,10 @@ EOF
 fi

 # check for smartcard support
-smartcard_cflags=""
 if test "$smartcard" != "no"; then
    if $pkg_config libcacard; then
        libcacard_cflags=$($pkg_config --cflags libcacard)
        libcacard_libs=$($pkg_config --libs libcacard)
-        QEMU_CFLAGS="$QEMU_CFLAGS $libcacard_cflags"
-        libs_softmmu="$libs_softmmu $libcacard_libs"
        smartcard="yes"
    else
        if test "$smartcard" = "yes"; then
@@ -4226,8 +4324,6 @@ if test "$libusb" != "no" ; then
        libusb="yes"
        libusb_cflags=$($pkg_config --cflags libusb-1.0)
        libusb_libs=$($pkg_config --libs libusb-1.0)
-        QEMU_CFLAGS="$QEMU_CFLAGS $libusb_cflags"
-        libs_softmmu="$libs_softmmu $libusb_libs"
    else
        if test "$libusb" = "yes"; then
            feature_not_found "libusb" "Install libusb devel >= 1.0.13"
@@ -4242,8 +4338,6 @@ if test "$usb_redir" != "no" ; then
        usb_redir="yes"
        usb_redir_cflags=$($pkg_config --cflags libusbredirparser-0.5)
        usb_redir_libs=$($pkg_config --libs libusbredirparser-0.5)
-        QEMU_CFLAGS="$QEMU_CFLAGS $usb_redir_cflags"
-        libs_softmmu="$libs_softmmu $usb_redir_libs"
    else
        if test "$usb_redir" = "yes"; then
            feature_not_found "usb-redir" "Install usbredir devel"
@@ -4349,6 +4443,58 @@ EOF
  fi
 fi

+##########################################
+# capstone
+
+case "$capstone" in
+  "" | yes)
+    if $pkg_config capstone; then
+      capstone=system
+    elif test -e "${source_path}/.git" ; then
+      capstone=git
+    elif test -e "${source_path}/capstone/Makefile" ; then
+      capstone=internal
+    elif test -z "$capstone" ; then
+      capstone=no
+    else
+      feature_not_found "capstone" "Install capstone devel or git submodule"
+    fi
+    ;;
+
+  system)
+    if ! $pkg_config capstone; then
+      feature_not_found "capstone" "Install capstone devel"
+    fi
+    ;;
+esac
+
+case "$capstone" in
+  git | internal)
+    if test "$capstone" = git; then
+      git_submodules="${git_submodules} capstone"
+    fi
+    mkdir -p capstone
+    QEMU_CFLAGS="$QEMU_CFLAGS -I\$(SRC_PATH)/capstone/include"
+    if test "$mingw32" = "yes"; then
+      LIBCAPSTONE=capstone.lib
+    else
+      LIBCAPSTONE=libcapstone.a
+    fi
+    LIBS="-L\$(BUILD_DIR)/capstone -lcapstone $LIBS"
+    ;;
+
+  system)
+    QEMU_CFLAGS="$QEMU_CFLAGS $($pkg_config --cflags capstone)"
+    LIBS="$($pkg_config --libs capstone) $LIBS"
+    ;;
+
+  no)
+    ;;
+  *)
+    error_exit "Unknown state for capstone: $capstone"
+    ;;
+esac
+
 ##########################################
 # check if we have fdatasync

@@ -4406,6 +4552,18 @@ if compile_prog "" "" ; then
    posix_syslog=yes
 fi

+##########################################
+# check if we have sem_timedwait
+
+sem_timedwait=no
+cat > $TMPC << EOF
+#include <semaphore.h>
+int main(void) { return sem_timedwait(0, 0); }
+EOF
+if compile_prog "" "" ; then
+    sem_timedwait=yes
+fi
+
 ##########################################
 # check if trace backend exists

@@ -5034,16 +5192,37 @@ if test "$want_tools" = "yes" ; then
  fi
 fi
 if test "$softmmu" = yes ; then
-  if test "$virtfs" != no ; then
-    if test "$cap" = yes && test "$linux" = yes && test "$attr" = yes ; then
+  if test "$linux" = yes; then
+    if test "$virtfs" != no && test "$cap" = yes && test "$attr" = yes ; then
      virtfs=yes
      tools="$tools fsdev/virtfs-proxy-helper\$(EXESUF)"
    else
      if test "$virtfs" = yes; then
-        error_exit "VirtFS is supported only on Linux and requires libcap devel and libattr devel"
+        error_exit "VirtFS requires libcap devel and libattr devel"
      fi
      virtfs=no
    fi
+    if test "$mpath" != no && test "$mpathpersist" = yes ; then
+      mpath=yes
+    else
+      if test "$mpath" = yes; then
+        error_exit "Multipath requires libmpathpersist devel"
+      fi
+      mpath=no
+    fi
+    tools="$tools scsi/qemu-pr-helper\$(EXESUF)"
+  else
+    if test "$virtfs" = yes; then
+      error_exit "VirtFS is supported only on Linux"
+    fi
+    virtfs=no
+    if test "$mpath" = yes; then
+      error_exit "Multipath is supported only on Linux"
+    fi
+    mpath=no
+  fi
+  if test "$xkbcommon" = "yes"; then
+    tools="qemu-keymap\$(EXESUF) $tools"
  fi
 fi

@@ -5228,6 +5407,7 @@ libs_softmmu="$pixman_libs $libs_softmmu"

 echo "Install prefix    $prefix"
 echo "BIOS directory    $(eval echo $qemu_datadir)"
+echo "firmware path     $(eval echo $firmwarepath)"
 echo "binary directory  $(eval echo $bindir)"
 echo "library directory $(eval echo $libdir)"
 echo "module directory  $(eval echo $qemu_moddir)"
@@ -5243,6 +5423,8 @@ echo "local state directory   queried at runtime"
 echo "Windows SDK       $win_sdk"
 fi
 echo "Source path       $source_path"
+echo "GIT binary        $git"
+echo "GIT submodules    $git_submodules"
 echo "C compiler        $cc"
 echo "Host C compiler   $host_cc"
 echo "C++ compiler      $cxx"
@@ -5289,6 +5471,7 @@ echo "Audio drivers     $audio_drv_list"
 echo "Block whitelist (rw) $block_drv_rw_whitelist"
 echo "Block whitelist (ro) $block_drv_ro_whitelist"
 echo "VirtFS support    $virtfs"
+echo "Multipath support $mpath"
 echo "VNC support       $vnc"
 if test "$vnc" = "yes" ; then
    echo "VNC SASL support  $vnc_sasl"
@@ -5359,6 +5542,7 @@ echo "gcov enabled      $gcov"
 echo "TPM support       $tpm"
 echo "libssh2 support   $libssh2"
 echo "TPM passthrough   $tpm_passthrough"
+echo "TPM emulator      $tpm_emulator"
 echo "QOM debugging     $qom_cast_debug"
 echo "Live block migration $live_block_migration"
 echo "lzo support       $lzo"
@@ -5370,6 +5554,7 @@ echo "jemalloc support  $jemalloc"
 echo "avx2 optimization $avx2_opt"
 echo "replication support $replication"
 echo "VxHS block device $vxhs"
+echo "capstone          $capstone"

 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -5418,6 +5603,7 @@ echo "mandir=$mandir" >> $config_host_mak
 echo "sysconfdir=$sysconfdir" >> $config_host_mak
 echo "qemu_confdir=$qemu_confdir" >> $config_host_mak
 echo "qemu_datadir=$qemu_datadir" >> $config_host_mak
+echo "qemu_firmwarepath=$firmwarepath" >> $config_host_mak
 echo "qemu_docdir=$qemu_docdir" >> $config_host_mak
 echo "qemu_moddir=$qemu_moddir" >> $config_host_mak
 if test "$mingw32" = "no" ; then
@@ -5429,6 +5615,9 @@ echo "extra_cxxflags=$EXTRA_CXXFLAGS" >> $config_host_mak
 echo "extra_ldflags=$EXTRA_LDFLAGS" >> $config_host_mak
 echo "qemu_localedir=$qemu_localedir" >> $config_host_mak
 echo "libs_softmmu=$libs_softmmu" >> $config_host_mak
+echo "GIT=$git" >> $config_host_mak
+echo "GIT_SUBMODULES=$git_submodules" >> $config_host_mak
+echo "GIT_UPDATE=$git_update" >> $config_host_mak

 echo "ARCH=$ARCH" >> $config_host_mak

@@ -5499,6 +5688,7 @@ if test "$slirp" = "yes" ; then
 fi
 if test "$vde" = "yes" ; then
  echo "CONFIG_VDE=y" >> $config_host_mak
+  echo "VDE_LIBS=$vde_libs" >> $config_host_mak
 fi
 if test "$netmap" = "yes" ; then
  echo "CONFIG_NETMAP=y" >> $config_host_mak
@@ -5514,6 +5704,11 @@ for drv in $audio_drv_list; do
    def=CONFIG_$(echo $drv | LC_ALL=C tr '[a-z]' '[A-Z]')
    echo "$def=y" >> $config_host_mak
 done
+echo "ALSA_LIBS=$alsa_libs" >> $config_host_mak
+echo "PULSE_LIBS=$pulse_libs" >> $config_host_mak
+echo "COREAUDIO_LIBS=$coreaudio_libs" >> $config_host_mak
+echo "DSOUND_LIBS=$dsound_libs" >> $config_host_mak
+echo "OSS_LIBS=$oss_libs" >> $config_host_mak
 if test "$audio_pt_int" = "yes" ; then
  echo "CONFIG_AUDIO_PT_INT=y" >> $config_host_mak
 fi
@@ -5534,6 +5729,10 @@ fi
 if test "$vnc_png" = "yes" ; then
  echo "CONFIG_VNC_PNG=y" >> $config_host_mak
 fi
+if test "$xkbcommon" = "yes" ; then
+  echo "XKBCOMMON_CFLAGS=$xkbcommon_cflags" >> $config_host_mak
+  echo "XKBCOMMON_LIBS=$xkbcommon_libs" >> $config_host_mak
+fi
 if test "$fnmatch" = "yes" ; then
  echo "CONFIG_FNMATCH=y" >> $config_host_mak
 fi
@@ -5558,6 +5757,7 @@ if test "$sdl" = "yes" ; then
  echo "CONFIG_SDL=y" >> $config_host_mak
  echo "CONFIG_SDLABI=$sdlabi" >> $config_host_mak
  echo "SDL_CFLAGS=$sdl_cflags" >> $config_host_mak
+  echo "SDL_LIBS=$sdl_libs" >> $config_host_mak
 fi
 if test "$cocoa" = "yes" ; then
  echo "CONFIG_COCOA=y" >> $config_host_mak
@@ -5634,6 +5834,9 @@ fi
 if test "$inotify1" = "yes" ; then
  echo "CONFIG_INOTIFY1=y" >> $config_host_mak
 fi
+if test "$sem_timedwait" = "yes" ; then
+  echo "CONFIG_SEM_TIMEDWAIT=y" >> $config_host_mak
+fi
 if test "$byteswap_h" = "yes" ; then
  echo "CONFIG_BYTESWAP_H=y" >> $config_host_mak
 fi
@@ -5647,6 +5850,7 @@ if test "$curl" = "yes" ; then
 fi
 if test "$brlapi" = "yes" ; then
  echo "CONFIG_BRLAPI=y" >> $config_host_mak
+  echo "BRLAPI_LIBS=$brlapi_libs" >> $config_host_mak
 fi
 if test "$bluez" = "yes" ; then
  echo "CONFIG_BLUEZ=y" >> $config_host_mak
@@ -5732,6 +5936,9 @@ fi
 if test "$virtfs" = "yes" ; then
  echo "CONFIG_VIRTFS=y" >> $config_host_mak
 fi
+if test "$mpath" = "yes" ; then
+  echo "CONFIG_MPATH=y" >> $config_host_mak
+fi
 if test "$vhost_scsi" = "yes" ; then
  echo "CONFIG_VHOST_SCSI=y" >> $config_host_mak
 fi
@@ -5781,14 +5988,20 @@ fi

 if test "$smartcard" = "yes" ; then
  echo "CONFIG_SMARTCARD=y" >> $config_host_mak
+  echo "SMARTCARD_CFLAGS=$libcacard_cflags" >> $config_host_mak
+  echo "SMARTCARD_LIBS=$libcacard_libs" >> $config_host_mak
 fi

 if test "$libusb" = "yes" ; then
  echo "CONFIG_USB_LIBUSB=y" >> $config_host_mak
+  echo "LIBUSB_CFLAGS=$libusb_cflags" >> $config_host_mak
+  echo "LIBUSB_LIBS=$libusb_libs" >> $config_host_mak
 fi

 if test "$usb_redir" = "yes" ; then
  echo "CONFIG_USB_REDIR=y" >> $config_host_mak
+  echo "USB_REDIR_CFLAGS=$usb_redir_cflags" >> $config_host_mak
+  echo "USB_REDIR_LIBS=$usb_redir_libs" >> $config_host_mak
 fi

 if test "$opengl" = "yes" ; then
@@ -5937,12 +6150,16 @@ if test "$live_block_migration" = "yes" ; then
  echo "CONFIG_LIVE_BLOCK_MIGRATION=y" >> $config_host_mak
 fi

-# TPM passthrough support?
 if test "$tpm" = "yes"; then
  echo 'CONFIG_TPM=$(CONFIG_SOFTMMU)' >> $config_host_mak
+  # TPM passthrough support?
  if test "$tpm_passthrough" = "yes"; then
    echo "CONFIG_TPM_PASSTHROUGH=y" >> $config_host_mak
  fi
+  # TPM emulator support?
+  if test "$tpm_emulator" = "yes"; then
+    echo "CONFIG_TPM_EMULATOR=y" >> $config_host_mak
+  fi
 fi

 echo "TRACE_BACKENDS=$trace_backends" >> $config_host_mak
@@ -5984,6 +6201,7 @@ echo "CONFIG_TRACE_FILE=$trace_file" >> $config_host_mak

 if test "$rdma" = "yes" ; then
  echo "CONFIG_RDMA=y" >> $config_host_mak
+  echo "RDMA_LIBS=$rdma_libs" >> $config_host_mak
 fi

 if test "$have_rtnetlink" = "yes" ; then
@@ -6013,6 +6231,9 @@ fi
 if test "$ivshmem" = "yes" ; then
  echo "CONFIG_IVSHMEM=y" >> $config_host_mak
 fi
+if test "$capstone" != "no" ; then
+  echo "CONFIG_CAPSTONE=y" >> $config_host_mak
+fi

 # Hold two types of flag:
 #   CONFIG_THREAD_SETNAME_BYTHREAD  - we've got a way of setting the name on
@@ -6068,6 +6289,7 @@ echo "CCAS=$ccas" >> $config_host_mak
 echo "CPP=$cpp" >> $config_host_mak
 echo "OBJCOPY=$objcopy" >> $config_host_mak
 echo "LD=$ld" >> $config_host_mak
+echo "RANLIB=$ranlib" >> $config_host_mak
 echo "NM=$nm" >> $config_host_mak
 echo "WINDRES=$windres" >> $config_host_mak
 echo "CFLAGS=$CFLAGS" >> $config_host_mak
@@ -6495,6 +6717,12 @@ done # for target in $targets
 if [ "$dtc_internal" = "yes" ]; then
  echo "config-host.h: subdir-dtc" >> $config_host_mak
 fi
+if [ "$capstone" = "git" -o "$capstone" = "internal" ]; then
+  echo "config-host.h: subdir-capstone" >> $config_host_mak
+fi
+if test -n "$LIBCAPSTONE"; then
+  echo "LIBCAPSTONE=$LIBCAPSTONE" >> $config_host_mak
+fi

 if test "$numa" = "yes"; then
  echo "CONFIG_NUMA=y" >> $config_host_mak
@@ -6505,8 +6733,8 @@ if test "$ccache_cpp2" = "yes"; then
 fi

 # build tree in object directory in case the source is not in the current directory
-DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32 tests/libqos tests/qapi-schema tests/tcg/xtensa tests/qemu-iotests"
-DIRS="$DIRS docs docs/interop fsdev"
+DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32 tests/libqos tests/qapi-schema tests/tcg/xtensa tests/qemu-iotests tests/vm"
+DIRS="$DIRS docs docs/interop fsdev scsi"
 DIRS="$DIRS pc-bios/optionrom pc-bios/spapr-rtas pc-bios/s390-ccw"
 DIRS="$DIRS roms/seabios roms/vgabios"
 DIRS="$DIRS qapi-generated"
@@ -6556,6 +6784,7 @@ for rom in seabios vgabios ; do
    echo "OBJCOPY=objcopy" >> $config_mak
    echo "IASL=$iasl" >> $config_mak
    echo "LD=$ld" >> $config_mak
+    echo "RANLIB=$ranlib" >> $config_mak
 done

 # set up tests data directory
--- a/contrib/libvhost-user/Makefile.objs
+++ b/contrib/libvhost-user/Makefile.objs
@@ -1 +1 @@
-libvhost-user-obj-y = libvhost-user.o
+libvhost-user-obj-y += libvhost-user.o libvhost-user-glib.o
--- a/contrib/libvhost-user/libvhost-user-glib.c
+++ b/contrib/libvhost-user/libvhost-user-glib.c
@@ -0,0 +1,154 @@
+/*
+ * Vhost User library
+ *
+ * Copyright (c) 2016 Nutanix Inc. All rights reserved.
+ * Copyright (c) 2017 Red Hat, Inc.
+ *
+ * Authors:
+ *  Marc-André Lureau <mlureau@redhat.com>
+ *  Felipe Franciosi <felipe@nutanix.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "libvhost-user-glib.h"
+
+/* glib event loop integration for libvhost-user and misc callbacks */
+
+G_STATIC_ASSERT((int)G_IO_IN == (int)VU_WATCH_IN);
+G_STATIC_ASSERT((int)G_IO_OUT == (int)VU_WATCH_OUT);
+G_STATIC_ASSERT((int)G_IO_PRI == (int)VU_WATCH_PRI);
+G_STATIC_ASSERT((int)G_IO_ERR == (int)VU_WATCH_ERR);
+G_STATIC_ASSERT((int)G_IO_HUP == (int)VU_WATCH_HUP);
+
+typedef struct VugSrc {
+    GSource parent;
+    VuDev *dev;
+    GPollFD gfd;
+} VugSrc;
+
+static gboolean
+vug_src_prepare(GSource *gsrc, gint *timeout)
+{
+    g_assert(timeout);
+
+    *timeout = -1;
+    return FALSE;
+}
+
+static gboolean
+vug_src_check(GSource *gsrc)
+{
+    VugSrc *src = (VugSrc *)gsrc;
+
+    g_assert(src);
+
+    return src->gfd.revents & src->gfd.events;
+}
+
+static gboolean
+vug_src_dispatch(GSource *gsrc, GSourceFunc cb, gpointer data)
+{
+    VugSrc *src = (VugSrc *)gsrc;
+
+    g_assert(src);
+
+    ((vu_watch_cb)cb)(src->dev, src->gfd.revents, data);
+
+    return G_SOURCE_CONTINUE;
+}
+
+static GSourceFuncs vug_src_funcs = {
+    vug_src_prepare,
+    vug_src_check,
+    vug_src_dispatch,
+    NULL
+};
+
+static GSource *
+vug_source_new(VuDev *dev, int fd, GIOCondition cond,
+               vu_watch_cb vu_cb, gpointer data)
+{
+    GSource *gsrc;
+    VugSrc *src;
+    guint id;
+
+    g_assert(dev);
+    g_assert(fd >= 0);
+    g_assert(vu_cb);
+
+    gsrc = g_source_new(&vug_src_funcs, sizeof(VugSrc));
+    g_source_set_callback(gsrc, (GSourceFunc)vu_cb, data, NULL);
+    src = (VugSrc *)gsrc;
+    src->dev = dev;
+    src->gfd.fd = fd;
+    src->gfd.events = cond;
+
+    g_source_add_poll(gsrc, &src->gfd);
+    id = g_source_attach(gsrc, NULL);
+    g_assert(id);
+    g_source_unref(gsrc);
+
+    return gsrc;
+}
+
+static void
+set_watch(VuDev *vu_dev, int fd, int vu_evt, vu_watch_cb cb, void *pvt)
+{
+    GSource *src;
+    VugDev *dev;
+
+    g_assert(vu_dev);
+    g_assert(fd >= 0);
+    g_assert(cb);
+
+    dev = container_of(vu_dev, VugDev, parent);
+    src = vug_source_new(vu_dev, fd, vu_evt, cb, pvt);
+    g_hash_table_replace(dev->fdmap, GINT_TO_POINTER(fd), src);
+}
+
+static void
+remove_watch(VuDev *vu_dev, int fd)
+{
+    VugDev *dev;
+
+    g_assert(vu_dev);
+    g_assert(fd >= 0);
+
+    dev = container_of(vu_dev, VugDev, parent);
+    g_hash_table_remove(dev->fdmap, GINT_TO_POINTER(fd));
+}
+
+
+static void vug_watch(VuDev *dev, int condition, void *data)
+{
+    if (!vu_dispatch(dev) != 0) {
+        dev->panic(dev, "Error processing vhost message");
+    }
+}
+
+void
+vug_init(VugDev *dev, int socket,
+         vu_panic_cb panic, const VuDevIface *iface)
+{
+    g_assert(dev);
+    g_assert(iface);
+
+    vu_init(&dev->parent, socket, panic, set_watch, remove_watch, iface);
+    dev->fdmap = g_hash_table_new_full(NULL, NULL, NULL,
+                                       (GDestroyNotify) g_source_destroy);
+
+    dev->src = vug_source_new(&dev->parent, socket, G_IO_IN, vug_watch, NULL);
+}
+
+void
+vug_deinit(VugDev *dev)
+{
+    g_assert(dev);
+
+    g_hash_table_unref(dev->fdmap);
+    g_source_unref(dev->src);
+}
--- a/contrib/libvhost-user/libvhost-user-glib.h
+++ b/contrib/libvhost-user/libvhost-user-glib.h
@@ -0,0 +1,32 @@
+/*
+ * Vhost User library
+ *
+ * Copyright (c) 2016 Nutanix Inc. All rights reserved.
+ * Copyright (c) 2017 Red Hat, Inc.
+ *
+ * Authors:
+ *  Marc-André Lureau <mlureau@redhat.com>
+ *  Felipe Franciosi <felipe@nutanix.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#ifndef LIBVHOST_USER_GLIB_H
+#define LIBVHOST_USER_GLIB_H
+
+#include <glib.h>
+#include "libvhost-user.h"
+
+typedef struct VugDev {
+    VuDev parent;
+
+    GHashTable *fdmap; /* fd -> gsource */
+    GSource *src;
+} VugDev;
+
+void vug_init(VugDev *dev, int socket,
+              vu_panic_cb panic, const VuDevIface *iface);
+void vug_deinit(VugDev *dev);
+
+#endif /* LIBVHOST_USER_GLIB_H */
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -13,14 +13,35 @@
 * later.  See the COPYING file in the top-level directory.
 */

-#include <qemu/osdep.h>
+/* this code avoids GLib dependency */
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <stdarg.h>
+#include <errno.h>
+#include <string.h>
+#include <assert.h>
+#include <inttypes.h>
+#include <sys/types.h>
+#include <sys/socket.h>
 #include <sys/eventfd.h>
+#include <sys/mman.h>
 #include <linux/vhost.h>

+#include "qemu/compiler.h"
 #include "qemu/atomic.h"

 #include "libvhost-user.h"

+/* usually provided by GLib */
+#ifndef MIN
+#define MIN(x, y) ({                            \
+            typeof(x) _min1 = (x);              \
+            typeof(y) _min2 = (y);              \
+            (void) (&_min1 == &_min2);          \
+            _min1 < _min2 ? _min1 : _min2; })
+#endif
+
 #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)

 /* The version of the protocol we support */
@@ -35,13 +56,10 @@
    } while (0)

 static const char *
-vu_request_to_string(int req)
+vu_request_to_string(unsigned int req)
 {
 #define REQ(req) [req] = #req
    static const char *vu_request_str[] = {
-        REQ(VHOST_USER_NONE),
-        REQ(VHOST_USER_GET_FEATURES),
-        REQ(VHOST_USER_SET_FEATURES),
        REQ(VHOST_USER_NONE),
        REQ(VHOST_USER_GET_FEATURES),
        REQ(VHOST_USER_SET_FEATURES),
@@ -62,7 +80,10 @@ vu_request_to_string(int req)
        REQ(VHOST_USER_GET_QUEUE_NUM),
        REQ(VHOST_USER_SET_VRING_ENABLE),
        REQ(VHOST_USER_SEND_RARP),
-        REQ(VHOST_USER_INPUT_GET_CONFIG),
+        REQ(VHOST_USER_NET_SET_MTU),
+        REQ(VHOST_USER_SET_SLAVE_REQ_FD),
+        REQ(VHOST_USER_IOTLB_MSG),
+        REQ(VHOST_USER_SET_VRING_ENDIAN),
        REQ(VHOST_USER_MAX),
    };
 #undef REQ
@@ -81,7 +102,9 @@ vu_panic(VuDev *dev, const char *msg, ...)
    va_list ap;

    va_start(ap, msg);
-    buf = g_strdup_vprintf(msg, ap);
+    if (vasprintf(&buf, msg, ap) < 0) {
+        buf = NULL;
+    }
    va_end(ap);

    dev->broken = true;
@@ -703,7 +726,8 @@ vu_set_vring_err_exec(VuDev *dev, VhostUserMsg *vmsg)
 static bool
 vu_get_protocol_features_exec(VuDev *dev, VhostUserMsg *vmsg)
 {
-    uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD;
+    uint64_t features = 1ULL << VHOST_USER_PROTOCOL_F_LOG_SHMFD |
+                        1ULL << VHOST_USER_PROTOCOL_F_SLAVE_REQ;

    if (dev->iface->get_protocol_features) {
        features |= dev->iface->get_protocol_features(dev);
@@ -756,6 +780,23 @@ vu_set_vring_enable_exec(VuDev *dev, VhostUserMsg *vmsg)
    return false;
 }

+static bool
+vu_set_slave_req_fd(VuDev *dev, VhostUserMsg *vmsg)
+{
+    if (vmsg->fd_num != 1) {
+        vu_panic(dev, "Invalid slave_req_fd message (%d fd's)", vmsg->fd_num);
+        return false;
+    }
+
+    if (dev->slave_fd != -1) {
+        close(dev->slave_fd);
+    }
+    dev->slave_fd = vmsg->fds[0];
+    DPRINT("Got slave_fd: %d\n", vmsg->fds[0]);
+
+    return false;
+}
+
 static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -819,6 +860,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
        return vu_get_queue_num_exec(dev, vmsg);
    case VHOST_USER_SET_VRING_ENABLE:
        return vu_set_vring_enable_exec(dev, vmsg);
+    case VHOST_USER_SET_SLAVE_REQ_FD:
+        return vu_set_slave_req_fd(dev, vmsg);
    case VHOST_USER_NONE:
        break;
    default:
@@ -853,7 +896,7 @@ vu_dispatch(VuDev *dev)
    success = true;

 end:
-    g_free(vmsg.data);
+    free(vmsg.data);
    return success;
 }

@@ -892,6 +935,10 @@ vu_deinit(VuDev *dev)


    vu_close_log(dev);
+    if (dev->slave_fd != -1) {
+        close(dev->slave_fd);
+        dev->slave_fd = -1;
+    }

    if (dev->sock != -1) {
        close(dev->sock);
@@ -922,6 +969,7 @@ vu_init(VuDev *dev,
    dev->remove_watch = remove_watch;
    dev->iface = iface;
    dev->log_call_fd = -1;
+    dev->slave_fd = -1;
    for (i = 0; i < VHOST_MAX_NR_VIRTQUEUE; i++) {
        dev->vq[i] = (VuVirtq) {
            .call_fd = -1, .kick_fd = -1, .err_fd = -1,
@@ -943,6 +991,12 @@ vu_queue_enabled(VuDev *dev, VuVirtq *vq)
    return vq->enable;
 }

+bool
+vu_queue_started(const VuDev *dev, const VuVirtq *vq)
+{
+    return vq->started;
+}
+
 static inline uint16_t
 vring_avail_flags(VuVirtq *vq)
 {
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -34,6 +34,10 @@ enum VhostUserProtocolFeature {
    VHOST_USER_PROTOCOL_F_MQ = 0,
    VHOST_USER_PROTOCOL_F_LOG_SHMFD = 1,
    VHOST_USER_PROTOCOL_F_RARP = 2,
+    VHOST_USER_PROTOCOL_F_REPLY_ACK = 3,
+    VHOST_USER_PROTOCOL_F_NET_MTU = 4,
+    VHOST_USER_PROTOCOL_F_SLAVE_REQ = 5,
+    VHOST_USER_PROTOCOL_F_CROSS_ENDIAN = 6,

    VHOST_USER_PROTOCOL_F_MAX
 };
@@ -61,7 +65,10 @@ typedef enum VhostUserRequest {
    VHOST_USER_GET_QUEUE_NUM = 17,
    VHOST_USER_SET_VRING_ENABLE = 18,
    VHOST_USER_SEND_RARP = 19,
-    VHOST_USER_INPUT_GET_CONFIG = 20,
+    VHOST_USER_NET_SET_MTU = 20,
+    VHOST_USER_SET_SLAVE_REQ_FD = 21,
+    VHOST_USER_IOTLB_MSG = 22,
+    VHOST_USER_SET_VRING_ENDIAN = 23,
    VHOST_USER_MAX
 } VhostUserRequest;

@@ -219,6 +226,7 @@ struct VuDev {
    VuDevRegion regions[VHOST_MEMORY_MAX_NREGIONS];
    VuVirtq vq[VHOST_MAX_NR_VIRTQUEUE];
    int log_call_fd;
+    int slave_fd;
    uint64_t log_size;
    uint8_t *log_table;
    uint64_t features;
@@ -334,6 +342,15 @@ void vu_queue_set_notification(VuDev *dev, VuVirtq *vq, int enable);
 */
 bool vu_queue_enabled(VuDev *dev, VuVirtq *vq);

+/**
+ * vu_queue_started:
+ * @dev: a VuDev context
+ * @vq: a VuVirtq queue
+ *
+ * Returns: whether the queue is started.
+ */
+bool vu_queue_started(const VuDev *dev, const VuVirtq *vq);
+
 /**
 * vu_queue_empty:
 * @dev: a VuDev context
@@ -358,7 +375,8 @@ void vu_queue_notify(VuDev *dev, VuVirtq *vq);
 * @vq: a VuVirtq queue
 * @sz: the size of struct to return (must be >= VuVirtqElement)
 *
- * Returns: a VuVirtqElement filled from the queue or NULL.
+ * Returns: a VuVirtqElement filled from the queue or NULL. The
+ * returned element must be free()-d by the caller.
 */
 void *vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz);

--- a/contrib/vhost-user-scsi/vhost-user-scsi.c
+++ b/contrib/vhost-user-scsi/vhost-user-scsi.c
@@ -11,263 +11,33 @@
 */

 #include "qemu/osdep.h"
-#include "contrib/libvhost-user/libvhost-user.h"
-#include "hw/virtio/virtio-scsi.h"
+#include "contrib/libvhost-user/libvhost-user-glib.h"
+#include "standard-headers/linux/virtio_scsi.h"
 #include "iscsi/iscsi.h"
+#include "iscsi/scsi-lowlevel.h"

 #include <glib.h>

-/* Small compat shim from glib 2.32 */
-#ifndef G_SOURCE_CONTINUE
-#define G_SOURCE_CONTINUE TRUE
-#endif
-#ifndef G_SOURCE_REMOVE
-#define G_SOURCE_REMOVE FALSE
-#endif
-
-/* #define VUS_DEBUG 1 */
-
-/** Log helpers **/
-
-#define PPRE                                                          \
-    struct timespec ts;                                               \
-    char   timebuf[64];                                               \
-    struct tm tm;                                                     \
-    (void)clock_gettime(CLOCK_REALTIME, &ts);                         \
-    (void)strftime(timebuf, 64, "%Y%m%d %T", gmtime_r(&ts.tv_sec, &tm))
-
-#define PEXT(lvl, msg, ...) do {                                      \
-    PPRE;                                                             \
-    fprintf(stderr, "%s.%06ld " lvl ": %s:%s():%d: " msg "\n",        \
-            timebuf, ts.tv_nsec / 1000,                               \
-            __FILE__, __func__, __LINE__, ## __VA_ARGS__);            \
-} while (0)
-
-#define PNOR(lvl, msg, ...) do {                                      \
-    PPRE;                                                             \
-    fprintf(stderr, "%s.%06ld " lvl ": " msg "\n",                    \
-            timebuf, ts.tv_nsec / 1000, ## __VA_ARGS__);              \
-} while (0)
-
-#ifdef VUS_DEBUG
-#define PDBG(msg, ...) PEXT("DBG", msg, ## __VA_ARGS__)
-#define PERR(msg, ...) PEXT("ERR", msg, ## __VA_ARGS__)
-#define PLOG(msg, ...) PEXT("LOG", msg, ## __VA_ARGS__)
-#else
-#define PDBG(msg, ...) { }
-#define PERR(msg, ...) PNOR("ERR", msg, ## __VA_ARGS__)
-#define PLOG(msg, ...) PNOR("LOG", msg, ## __VA_ARGS__)
-#endif
-
-/** vhost-user-scsi specific definitions **/
-
- /* Only 1 LUN and device supported today */
-#define VUS_MAX_LUNS 1
-#define VUS_MAX_DEVS 1
-
 #define VUS_ISCSI_INITIATOR "iqn.2016-11.com.nutanix:vhost-user-scsi"

-typedef struct iscsi_lun {
+typedef struct VusIscsiLun {
    struct iscsi_context *iscsi_ctx;
    int iscsi_lun;
-} iscsi_lun_t;
+} VusIscsiLun;

-typedef struct vhost_scsi_dev {
-    VuDev vu_dev;
-    int server_sock;
+typedef struct VusDev {
+    VugDev parent;
+
+    VusIscsiLun lun;
    GMainLoop *loop;
-    GTree *fdmap;   /* fd -> gsource context id */
-    iscsi_lun_t luns[VUS_MAX_LUNS];
-} vhost_scsi_dev_t;
-
-static vhost_scsi_dev_t *vhost_scsi_devs[VUS_MAX_DEVS];
-
-/** glib event loop integration for libvhost-user and misc callbacks **/
-
-QEMU_BUILD_BUG_ON((int)G_IO_IN != (int)VU_WATCH_IN);
-QEMU_BUILD_BUG_ON((int)G_IO_OUT != (int)VU_WATCH_OUT);
-QEMU_BUILD_BUG_ON((int)G_IO_PRI != (int)VU_WATCH_PRI);
-QEMU_BUILD_BUG_ON((int)G_IO_ERR != (int)VU_WATCH_ERR);
-QEMU_BUILD_BUG_ON((int)G_IO_HUP != (int)VU_WATCH_HUP);
-
-typedef struct vus_gsrc {
-    GSource parent;
-    vhost_scsi_dev_t *vdev_scsi;
-    GPollFD gfd;
-    vu_watch_cb vu_cb;
-} vus_gsrc_t;
-
-static gint vus_fdmap_compare(gconstpointer a, gconstpointer b)
-{
-    return (b > a) - (b < a);
-}
-
-static gboolean vus_gsrc_prepare(GSource *src, gint *timeout)
-{
-    assert(timeout);
-
-    *timeout = -1;
-    return FALSE;
-}
-
-static gboolean vus_gsrc_check(GSource *src)
-{
-    vus_gsrc_t *vus_src = (vus_gsrc_t *)src;
-
-    assert(vus_src);
-
-    return vus_src->gfd.revents & vus_src->gfd.events;
-}
-
-static gboolean vus_gsrc_dispatch(GSource *src, GSourceFunc cb, gpointer data)
-{
-    vhost_scsi_dev_t *vdev_scsi;
-    vus_gsrc_t *vus_src = (vus_gsrc_t *)src;
-
-    assert(vus_src);
-    assert(!(vus_src->vu_cb && cb));
-
-    vdev_scsi = vus_src->vdev_scsi;
-
-    assert(vdev_scsi);
-
-    if (cb) {
-        return cb(data);
-    }
-    if (vus_src->vu_cb) {
-        vus_src->vu_cb(&vdev_scsi->vu_dev, vus_src->gfd.revents, data);
-    }
-    return G_SOURCE_CONTINUE;
-}
-
-static GSourceFuncs vus_gsrc_funcs = {
-    vus_gsrc_prepare,
-    vus_gsrc_check,
-    vus_gsrc_dispatch,
-    NULL
-};
-
-static int vus_gsrc_new(vhost_scsi_dev_t *vdev_scsi, int fd, GIOCondition cond,
-                        vu_watch_cb vu_cb, GSourceFunc gsrc_cb, gpointer data)
-{
-    GSource *vus_gsrc;
-    vus_gsrc_t *vus_src;
-    guint id;
-
-    assert(vdev_scsi);
-    assert(fd >= 0);
-    assert(vu_cb || gsrc_cb);
-    assert(!(vu_cb && gsrc_cb));
-
-    vus_gsrc = g_source_new(&vus_gsrc_funcs, sizeof(vus_gsrc_t));
-    if (!vus_gsrc) {
-        PERR("Error creating GSource for new watch");
-        return -1;
-    }
-    vus_src = (vus_gsrc_t *)vus_gsrc;
-
-    vus_src->vdev_scsi = vdev_scsi;
-    vus_src->gfd.fd = fd;
-    vus_src->gfd.events = cond;
-    vus_src->vu_cb = vu_cb;
-
-    g_source_add_poll(vus_gsrc, &vus_src->gfd);
-    g_source_set_callback(vus_gsrc, gsrc_cb, data, NULL);
-    id = g_source_attach(vus_gsrc, NULL);
-    assert(id);
-    g_source_unref(vus_gsrc);
-
-    g_tree_insert(vdev_scsi->fdmap, (gpointer)(uintptr_t)fd,
-                                    (gpointer)(uintptr_t)id);
-
-    return 0;
-}
-
-/* from libiscsi's scsi-lowlevel.h **
- *
- * nb. We can't directly include scsi-lowlevel.h due to a namespace conflict:
- *     QEMU's scsi.h also defines "SCSI_XFER_NONE".
- */
-
-#define SCSI_CDB_MAX_SIZE           16
-
-struct scsi_iovector {
-    struct scsi_iovec *iov;
-    int niov;
-    int nalloc;
-    size_t offset;
-    int consumed;
-};
-
-struct scsi_allocated_memory {
-    struct scsi_allocated_memory *next;
-    char buf[0];
-};
-
-struct scsi_data {
-    int            size;
-    unsigned char *data;
-};
-
-enum scsi_sense_key {
-    SCSI_SENSE_NO_SENSE            = 0x00,
-    SCSI_SENSE_RECOVERED_ERROR     = 0x01,
-    SCSI_SENSE_NOT_READY           = 0x02,
-    SCSI_SENSE_MEDIUM_ERROR        = 0x03,
-    SCSI_SENSE_HARDWARE_ERROR      = 0x04,
-    SCSI_SENSE_ILLEGAL_REQUEST     = 0x05,
-    SCSI_SENSE_UNIT_ATTENTION      = 0x06,
-    SCSI_SENSE_DATA_PROTECTION     = 0x07,
-    SCSI_SENSE_BLANK_CHECK         = 0x08,
-    SCSI_SENSE_VENDOR_SPECIFIC     = 0x09,
-    SCSI_SENSE_COPY_ABORTED        = 0x0a,
-    SCSI_SENSE_COMMAND_ABORTED     = 0x0b,
-    SCSI_SENSE_OBSOLETE_ERROR_CODE = 0x0c,
-    SCSI_SENSE_OVERFLOW_COMMAND    = 0x0d,
-    SCSI_SENSE_MISCOMPARE          = 0x0e
-};
-
-struct scsi_sense {
-    unsigned char       error_type;
-    enum scsi_sense_key key;
-    int                 ascq;
-    unsigned            sense_specific:1;
-    unsigned            ill_param_in_cdb:1;
-    unsigned            bit_pointer_valid:1;
-    unsigned char       bit_pointer;
-    uint16_t            field_pointer;
-};
-
-enum scsi_residual {
-    SCSI_RESIDUAL_NO_RESIDUAL = 0,
-    SCSI_RESIDUAL_UNDERFLOW,
-    SCSI_RESIDUAL_OVERFLOW
-};
-
-struct scsi_task {
-    int status;
-    int cdb_size;
-    int xfer_dir;
-    int expxferlen;
-    unsigned char cdb[SCSI_CDB_MAX_SIZE];
-    enum scsi_residual residual_status;
-    size_t residual;
-    struct scsi_sense sense;
-    struct scsi_data datain;
-    struct scsi_allocated_memory *mem;
-    void *ptr;
-
-    uint32_t itt;
-    uint32_t cmdsn;
-    uint32_t lun;
-
-    struct scsi_iovector iovector_in;
-    struct scsi_iovector iovector_out;
-};
+} VusDev;

 /** libiscsi integration **/

-static int iscsi_add_lun(iscsi_lun_t *lun, char *iscsi_uri)
+typedef struct virtio_scsi_cmd_req VirtIOSCSICmdReq;
+typedef struct virtio_scsi_cmd_resp VirtIOSCSICmdResp;
+
+static int vus_iscsi_add_lun(VusIscsiLun *lun, char *iscsi_uri)
 {
    struct iscsi_url *iscsi_url;
    struct iscsi_context *iscsi_ctx;
@@ -275,30 +45,32 @@ static int iscsi_add_lun(iscsi_lun_t *lun, char *iscsi_uri)

    assert(lun);
    assert(iscsi_uri);
+    assert(!lun->iscsi_ctx);

    iscsi_ctx = iscsi_create_context(VUS_ISCSI_INITIATOR);
    if (!iscsi_ctx) {
-        PERR("Unable to create iSCSI context");
+        g_warning("Unable to create iSCSI context");
        return -1;
    }

    iscsi_url = iscsi_parse_full_url(iscsi_ctx, iscsi_uri);
    if (!iscsi_url) {
-        PERR("Unable to parse iSCSI URL: %s", iscsi_get_error(iscsi_ctx));
+        g_warning("Unable to parse iSCSI URL: %s", iscsi_get_error(iscsi_ctx));
        goto fail;
    }

    iscsi_set_session_type(iscsi_ctx, ISCSI_SESSION_NORMAL);
    iscsi_set_header_digest(iscsi_ctx, ISCSI_HEADER_DIGEST_NONE_CRC32C);
    if (iscsi_full_connect_sync(iscsi_ctx, iscsi_url->portal, iscsi_url->lun)) {
-        PERR("Unable to login to iSCSI portal: %s", iscsi_get_error(iscsi_ctx));
+        g_warning("Unable to login to iSCSI portal: %s",
+                  iscsi_get_error(iscsi_ctx));
        goto fail;
    }

    lun->iscsi_ctx = iscsi_ctx;
    lun->iscsi_lun = iscsi_url->lun;

-    PDBG("Context %p created for lun 0: %s", iscsi_ctx, iscsi_uri);
+    g_debug("Context %p created for lun 0: %s", iscsi_ctx, iscsi_uri);

 out:
    if (iscsi_url) {
@@ -313,18 +85,14 @@ fail:
 }

 static struct scsi_task *scsi_task_new(int cdb_len, uint8_t *cdb, int dir,
-                                       int xfer_len) {
+                                       int xfer_len)
+{
    struct scsi_task *task;

    assert(cdb_len > 0);
    assert(cdb);

-    task = calloc(1, sizeof(struct scsi_task));
-    if (!task) {
-        PERR("Error allocating task: %s", strerror(errno));
-        return NULL;
-    }
-
+    task = g_new0(struct scsi_task, 1);
    memcpy(task->cdb, cdb, cdb_len);
    task->cdb_size = cdb_len;
    task->xfer_dir = dir;
@@ -344,7 +112,7 @@ static int get_cdb_len(uint8_t *cdb)
    case 4: return 16;
    case 5: return 12;
    }
-    PERR("Unable to determine cdb len (0x%02hhX)", cdb[0] >> 5);
+    g_warning("Unable to determine cdb len (0x%02hhX)", cdb[0] >> 5);
    return -1;
 }

@@ -352,7 +120,8 @@ static int handle_cmd_sync(struct iscsi_context *ctx,
                           VirtIOSCSICmdReq *req,
                           struct iovec *out, unsigned int out_len,
                           VirtIOSCSICmdResp *rsp,
-                           struct iovec *in, unsigned int in_len) {
+                           struct iovec *in, unsigned int in_len)
+{
    struct scsi_task *task;
    uint32_t dir;
    uint32_t len;
@@ -365,7 +134,7 @@ static int handle_cmd_sync(struct iscsi_context *ctx,

    if (!(!req->lun[1] && req->lun[2] == 0x40 && !req->lun[3])) {
        /* Ignore anything different than target=0, lun=0 */
-        PDBG("Ignoring unconnected lun (0x%hhX, 0x%hhX)",
+        g_debug("Ignoring unconnected lun (0x%hhX, 0x%hhX)",
             req->lun[1], req->lun[3]);
        rsp->status = SCSI_STATUS_CHECK_CONDITION;
        memset(rsp->sense, 0, sizeof(rsp->sense));
@@ -387,36 +156,32 @@ static int handle_cmd_sync(struct iscsi_context *ctx,
    if (!out_len && !in_len) {
        dir = SCSI_XFER_NONE;
    } else if (out_len) {
-        dir = SCSI_XFER_TO_DEV;
+        dir = SCSI_XFER_WRITE;
        for (i = 0; i < out_len; i++) {
            len += out[i].iov_len;
        }
    } else {
-        dir = SCSI_XFER_FROM_DEV;
+        dir = SCSI_XFER_READ;
        for (i = 0; i < in_len; i++) {
            len += in[i].iov_len;
        }
    }

    task = scsi_task_new(cdb_len, req->cdb, dir, len);
-    if (!task) {
-        PERR("Unable to create iscsi task");
-        return -1;
-    }

-    if (dir == SCSI_XFER_TO_DEV) {
+    if (dir == SCSI_XFER_WRITE) {
        task->iovector_out.iov = (struct scsi_iovec *)out;
        task->iovector_out.niov = out_len;
-    } else if (dir == SCSI_XFER_FROM_DEV) {
+    } else if (dir == SCSI_XFER_READ) {
        task->iovector_in.iov = (struct scsi_iovec *)in;
        task->iovector_in.niov = in_len;
    }

-    PDBG("Sending iscsi cmd (cdb_len=%d, dir=%d, task=%p)",
+    g_debug("Sending iscsi cmd (cdb_len=%d, dir=%d, task=%p)",
         cdb_len, dir, task);
    if (!iscsi_scsi_command_sync(ctx, 0, task, NULL)) {
-        PERR("Error serving SCSI command");
-        free(task);
+        g_warning("Error serving SCSI command");
+        g_free(task);
        return -1;
    }

@@ -431,9 +196,9 @@ static int handle_cmd_sync(struct iscsi_context *ctx,
        memcpy(rsp->sense, &task->datain.data[2], rsp->sense_len);
    }

-    free(task);
+    g_free(task);

-    PDBG("Filled in rsp: status=%hhX, resid=%u, response=%hhX, sense_len=%u",
+    g_debug("Filled in rsp: status=%hhX, resid=%u, response=%hhX, sense_len=%u",
         rsp->status, rsp->resid, rsp->response, rsp->sense_len);

    return 0;
@@ -441,116 +206,46 @@ static int handle_cmd_sync(struct iscsi_context *ctx,

 /** libvhost-user callbacks **/

-static vhost_scsi_dev_t *vdev_scsi_find_by_vu(VuDev *vu_dev);
-
 static void vus_panic_cb(VuDev *vu_dev, const char *buf)
 {
-    vhost_scsi_dev_t *vdev_scsi;
+    VugDev *gdev;
+    VusDev *vdev_scsi;

    assert(vu_dev);

-    vdev_scsi = vdev_scsi_find_by_vu(vu_dev);
-
+    gdev = container_of(vu_dev, VugDev, parent);
+    vdev_scsi = container_of(gdev, VusDev, parent);
    if (buf) {
-        PERR("vu_panic: %s", buf);
+        g_warning("vu_panic: %s", buf);
    }

-    if (vdev_scsi) {
-        assert(vdev_scsi->loop);
    g_main_loop_quit(vdev_scsi->loop);
-    }
-}
-
-static void vus_add_watch_cb(VuDev *vu_dev, int fd, int vu_evt, vu_watch_cb cb,
-                             void *pvt) {
-    vhost_scsi_dev_t *vdev_scsi;
-    guint id;
-
-    assert(vu_dev);
-    assert(fd >= 0);
-    assert(cb);
-
-    vdev_scsi = vdev_scsi_find_by_vu(vu_dev);
-    if (!vdev_scsi) {
-        vus_panic_cb(vu_dev, NULL);
-        return;
-    }
-
-    id = (guint)(uintptr_t)g_tree_lookup(vdev_scsi->fdmap,
-                                         (gpointer)(uintptr_t)fd);
-    if (id) {
-        GSource *vus_src = g_main_context_find_source_by_id(NULL, id);
-        assert(vus_src);
-        g_source_destroy(vus_src);
-        (void)g_tree_remove(vdev_scsi->fdmap, (gpointer)(uintptr_t)fd);
-    }
-
-    if (vus_gsrc_new(vdev_scsi, fd, vu_evt, cb, NULL, pvt)) {
-        vus_panic_cb(vu_dev, NULL);
-    }
-}
-
-static void vus_del_watch_cb(VuDev *vu_dev, int fd)
-{
-    vhost_scsi_dev_t *vdev_scsi;
-    guint id;
-
-    assert(vu_dev);
-    assert(fd >= 0);
-
-    vdev_scsi = vdev_scsi_find_by_vu(vu_dev);
-    if (!vdev_scsi) {
-        vus_panic_cb(vu_dev, NULL);
-        return;
-    }
-
-    id = (guint)(uintptr_t)g_tree_lookup(vdev_scsi->fdmap,
-                                         (gpointer)(uintptr_t)fd);
-    if (id) {
-        GSource *vus_src = g_main_context_find_source_by_id(NULL, id);
-        assert(vus_src);
-        g_source_destroy(vus_src);
-        (void)g_tree_remove(vdev_scsi->fdmap, (gpointer)(uintptr_t)fd);
-    }
-}
-
-static void vus_proc_ctl(VuDev *vu_dev, int idx)
-{
-    /* Control VQ not implemented */
-}
-
-static void vus_proc_evt(VuDev *vu_dev, int idx)
-{
-    /* Event VQ not implemented */
 }

 static void vus_proc_req(VuDev *vu_dev, int idx)
 {
-    vhost_scsi_dev_t *vdev_scsi;
+    VugDev *gdev;
+    VusDev *vdev_scsi;
    VuVirtq *vq;

    assert(vu_dev);

-    vdev_scsi = vdev_scsi_find_by_vu(vu_dev);
-    if (!vdev_scsi) {
-        vus_panic_cb(vu_dev, NULL);
-        return;
-    }
-
-    if ((idx < 0) || (idx >= VHOST_MAX_NR_VIRTQUEUE)) {
-        PERR("VQ Index out of range: %d", idx);
+    gdev = container_of(vu_dev, VugDev, parent);
+    vdev_scsi = container_of(gdev, VusDev, parent);
+    if (idx < 0 || idx >= VHOST_MAX_NR_VIRTQUEUE) {
+        g_warning("VQ Index out of range: %d", idx);
        vus_panic_cb(vu_dev, NULL);
        return;
    }

    vq = vu_get_queue(vu_dev, idx);
    if (!vq) {
-        PERR("Error fetching VQ (dev=%p, idx=%d)", vu_dev, idx);
+        g_warning("Error fetching VQ (dev=%p, idx=%d)", vu_dev, idx);
        vus_panic_cb(vu_dev, NULL);
        return;
    }

-    PDBG("Got kicked on vq[%d]@%p", idx, vq);
+    g_debug("Got kicked on vq[%d]@%p", idx, vq);

    while (1) {
        VuVirtqElement *elem;
@@ -559,29 +254,29 @@ static void vus_proc_req(VuDev *vu_dev, int idx)

        elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement));
        if (!elem) {
-            PDBG("No more elements pending on vq[%d]@%p", idx, vq);
+            g_debug("No more elements pending on vq[%d]@%p", idx, vq);
            break;
        }
-        PDBG("Popped elem@%p", elem);
+        g_debug("Popped elem@%p", elem);

-        assert(!((elem->out_num > 1) && (elem->in_num > 1)));
-        assert((elem->out_num > 0) && (elem->in_num > 0));
+        assert(!(elem->out_num > 1 && elem->in_num > 1));
+        assert(elem->out_num > 0 && elem->in_num > 0);

        if (elem->out_sg[0].iov_len < sizeof(VirtIOSCSICmdReq)) {
-            PERR("Invalid virtio-scsi req header");
+            g_warning("Invalid virtio-scsi req header");
            vus_panic_cb(vu_dev, NULL);
            break;
        }
        req = (VirtIOSCSICmdReq *)elem->out_sg[0].iov_base;

        if (elem->in_sg[0].iov_len < sizeof(VirtIOSCSICmdResp)) {
-            PERR("Invalid virtio-scsi rsp header");
+            g_warning("Invalid virtio-scsi rsp header");
            vus_panic_cb(vu_dev, NULL);
            break;
        }
        rsp = (VirtIOSCSICmdResp *)elem->in_sg[0].iov_base;

-        if (handle_cmd_sync(vdev_scsi->luns[0].iscsi_ctx,
+        if (handle_cmd_sync(vdev_scsi->lun.iscsi_ctx,
                            req, &elem->out_sg[1], elem->out_num - 1,
                            rsp, &elem->in_sg[1], elem->in_num - 1) != 0) {
            vus_panic_cb(vu_dev, NULL);
@@ -601,22 +296,17 @@ static void vus_queue_set_started(VuDev *vu_dev, int idx, bool started)

    assert(vu_dev);

-    if ((idx < 0) || (idx >= VHOST_MAX_NR_VIRTQUEUE)) {
-        PERR("VQ Index out of range: %d", idx);
+    if (idx < 0 || idx >= VHOST_MAX_NR_VIRTQUEUE) {
+        g_warning("VQ Index out of range: %d", idx);
        vus_panic_cb(vu_dev, NULL);
        return;
    }

    vq = vu_get_queue(vu_dev, idx);

-    switch (idx) {
-    case 0:
-        vu_set_queue_handler(vu_dev, vq, started ? vus_proc_ctl : NULL);
-        break;
-    case 1:
-        vu_set_queue_handler(vu_dev, vq, started ? vus_proc_evt : NULL);
-        break;
-    default:
+    if (idx == 0 || idx == 1) {
+        g_debug("queue %d unimplemented", idx);
+    } else {
        vu_set_queue_handler(vu_dev, vq, started ? vus_proc_req : NULL);
    }
 }
@@ -625,21 +315,6 @@ static const VuDevIface vus_iface = {
    .queue_set_started = vus_queue_set_started,
 };

-static gboolean vus_vhost_cb(gpointer data)
-{
-    VuDev *vu_dev = (VuDev *)data;
-
-    assert(vu_dev);
-
-    if (!vu_dispatch(vu_dev) != 0) {
-        PERR("Error processing vhost message");
-        vus_panic_cb(vu_dev, NULL);
-        return G_SOURCE_REMOVE;
-    }
-
-    return G_SOURCE_CONTINUE;
-}
-
 /** misc helpers **/

 static int unix_sock_new(char *unix_fn)
@@ -681,159 +356,22 @@ fail:

 /** vhost-user-scsi **/

-static vhost_scsi_dev_t *vdev_scsi_find_by_vu(VuDev *vu_dev)
-{
-    int i;
-
-    assert(vu_dev);
-
-    for (i = 0; i < VUS_MAX_DEVS; i++) {
-        if (&vhost_scsi_devs[i]->vu_dev == vu_dev) {
-            return vhost_scsi_devs[i];
-        }
-    }
-
-    PERR("Unknown VuDev %p", vu_dev);
-    return NULL;
-}
-
-static void vdev_scsi_deinit(vhost_scsi_dev_t *vdev_scsi)
-{
-    if (!vdev_scsi) {
-        return;
-    }
-
-    if (vdev_scsi->server_sock >= 0) {
-        struct sockaddr_storage ss;
-        socklen_t sslen = sizeof(ss);
-
-        if (getsockname(vdev_scsi->server_sock, (struct sockaddr *)&ss,
-                        &sslen) == 0) {
-            struct sockaddr_un *su = (struct sockaddr_un *)&ss;
-            (void)unlink(su->sun_path);
-        }
-
-        (void)close(vdev_scsi->server_sock);
-        vdev_scsi->server_sock = -1;
-    }
-
-    if (vdev_scsi->loop) {
-        g_main_loop_unref(vdev_scsi->loop);
-        vdev_scsi->loop = NULL;
-    }
-}
-
-static vhost_scsi_dev_t *vdev_scsi_new(char *unix_fn)
-{
-    vhost_scsi_dev_t *vdev_scsi = NULL;
-
-    assert(unix_fn);
-
-    vdev_scsi = calloc(1, sizeof(vhost_scsi_dev_t));
-    if (!vdev_scsi) {
-        PERR("calloc: %s", strerror(errno));
-        return NULL;
-    }
-
-    vdev_scsi->server_sock = unix_sock_new(unix_fn);
-    if (vdev_scsi->server_sock < 0) {
-        goto err;
-    }
-
-    vdev_scsi->loop = g_main_loop_new(NULL, FALSE);
-    if (!vdev_scsi->loop) {
-        PERR("Error creating glib event loop");
-        goto err;
-    }
-
-    vdev_scsi->fdmap = g_tree_new(vus_fdmap_compare);
-    if (!vdev_scsi->fdmap) {
-        PERR("Error creating glib tree for fdmap");
-        goto err;
-    }
-
-    return vdev_scsi;
-
-err:
-    vdev_scsi_deinit(vdev_scsi);
-    free(vdev_scsi);
-
-    return NULL;
-}
-
-static int vdev_scsi_add_iscsi_lun(vhost_scsi_dev_t *vdev_scsi,
-                                   char *iscsi_uri, uint32_t lun) {
-    assert(vdev_scsi);
-    assert(iscsi_uri);
-    assert(lun < VUS_MAX_LUNS);
-
-    if (vdev_scsi->luns[lun].iscsi_ctx) {
-        PERR("Lun %d already configured", lun);
-        return -1;
-    }
-
-    if (iscsi_add_lun(&vdev_scsi->luns[lun], iscsi_uri) != 0) {
-        return -1;
-    }
-
-    return 0;
-}
-
-static int vdev_scsi_run(vhost_scsi_dev_t *vdev_scsi)
-{
-    int cli_sock;
-    int ret = 0;
-
-    assert(vdev_scsi);
-    assert(vdev_scsi->server_sock >= 0);
-    assert(vdev_scsi->loop);
-
-    cli_sock = accept(vdev_scsi->server_sock, (void *)0, (void *)0);
-    if (cli_sock < 0) {
-        perror("accept");
-        return -1;
-    }
-
-    vu_init(&vdev_scsi->vu_dev,
-            cli_sock,
-            vus_panic_cb,
-            vus_add_watch_cb,
-            vus_del_watch_cb,
-            &vus_iface);
-
-    if (vus_gsrc_new(vdev_scsi, cli_sock, G_IO_IN, NULL, vus_vhost_cb,
-                     &vdev_scsi->vu_dev)) {
-        goto fail;
-    }
-
-    g_main_loop_run(vdev_scsi->loop);
-
-out:
-    vu_deinit(&vdev_scsi->vu_dev);
-
-    return ret;
-
-fail:
-    ret = -1;
-    goto out;
-}
-
 int main(int argc, char **argv)
 {
-    vhost_scsi_dev_t *vdev_scsi = NULL;
+    VusDev *vdev_scsi = NULL;
    char *unix_fn = NULL;
    char *iscsi_uri = NULL;
-    int opt, err = EXIT_SUCCESS;
+    int lsock = -1, csock = -1, opt, err = EXIT_SUCCESS;

    while ((opt = getopt(argc, argv, "u:i:")) != -1) {
        switch (opt) {
        case 'h':
            goto help;
        case 'u':
-            unix_fn = strdup(optarg);
+            unix_fn = g_strdup(optarg);
            break;
        case 'i':
-            iscsi_uri = strdup(optarg);
+            iscsi_uri = g_strdup(optarg);
            break;
        default:
            goto help;
@@ -843,31 +381,44 @@ int main(int argc, char **argv)
        goto help;
    }

-    vdev_scsi = vdev_scsi_new(unix_fn);
-    if (!vdev_scsi) {
-        goto err;
-    }
-    vhost_scsi_devs[0] = vdev_scsi;
-
-    if (vdev_scsi_add_iscsi_lun(vdev_scsi, iscsi_uri, 0) != 0) {
+    lsock = unix_sock_new(unix_fn);
+    if (lsock < 0) {
        goto err;
    }

-    if (vdev_scsi_run(vdev_scsi) != 0) {
+    csock = accept(lsock, NULL, NULL);
+    if (csock < 0) {
+        perror("accept");
        goto err;
    }

+    vdev_scsi = g_new0(VusDev, 1);
+    vdev_scsi->loop = g_main_loop_new(NULL, FALSE);
+
+    if (vus_iscsi_add_lun(&vdev_scsi->lun, iscsi_uri) != 0) {
+        goto err;
+    }
+
+    vug_init(&vdev_scsi->parent, csock, vus_panic_cb, &vus_iface);
+
+    g_main_loop_run(vdev_scsi->loop);
+
+    vug_deinit(&vdev_scsi->parent);
+
 out:
    if (vdev_scsi) {
-        vdev_scsi_deinit(vdev_scsi);
-        free(vdev_scsi);
+        g_main_loop_unref(vdev_scsi->loop);
+        g_free(vdev_scsi);
+        unlink(unix_fn);
    }
-    if (unix_fn) {
-        free(unix_fn);
+    if (csock >= 0) {
+        close(csock);
    }
-    if (iscsi_uri) {
-        free(iscsi_uri);
+    if (lsock >= 0) {
+        close(lsock);
    }
+    g_free(unix_fn);
+    g_free(iscsi_uri);

    return err;

--- a/cpus.c
+++ b/cpus.c
@@ -1307,6 +1307,7 @@ static void *qemu_tcg_rr_cpu_thread_fn(void *arg)
    CPUState *cpu = arg;

    rcu_register_thread();
+    tcg_register_thread();

    qemu_mutex_lock_iothread();
    qemu_thread_get_self(cpu->thread);
@@ -1454,6 +1455,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
    g_assert(!use_icount);

    rcu_register_thread();
+    tcg_register_thread();

    qemu_mutex_lock_iothread();
    qemu_thread_get_self(cpu->thread);
@@ -1664,6 +1666,18 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
    char thread_name[VCPU_THREAD_NAME_SIZE];
    static QemuCond *single_tcg_halt_cond;
    static QemuThread *single_tcg_cpu_thread;
+    static int tcg_region_inited;
+
+    /*
+     * Initialize TCG regions--once. Now is a good time, because:
+     * (1) TCG's init context, prologue and target globals have been set up.
+     * (2) qemu_tcg_mttcg_enabled() works now (TCG init code runs before the
+     *     -accel flag is processed, so the check doesn't work then).
+     */
+    if (!tcg_region_inited) {
+        tcg_region_inited = 1;
+        tcg_region_init();
+    }

    if (qemu_tcg_mttcg_enabled() || !single_tcg_cpu_thread) {
        cpu->thread = g_malloc0(sizeof(QemuThread));
@@ -1764,8 +1778,9 @@ void qemu_init_vcpu(CPUState *cpu)
        /* If the target cpu hasn't set up any address spaces itself,
         * give it the default one.
         */
-        AddressSpace *as = address_space_init_shareable(cpu->memory,
-                                                        "cpu-memory");
+        AddressSpace *as = g_new0(AddressSpace, 1);
+
+        address_space_init(as, cpu->memory, "cpu-memory");
        cpu->num_ases = 1;
        cpu_address_space_init(cpu, as, 0);
    }
--- a/crypto/block-luks.c
+++ b/crypto/block-luks.c
@@ -846,8 +846,9 @@ qcrypto_block_luks_open(QCryptoBlock *block,
        }
    }

+    block->sector_size = QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;
    block->payload_offset = luks->header.payload_offset *
-        QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;
+        block->sector_size;

    luks->cipher_alg = cipheralg;
    luks->cipher_mode = ciphermode;
@@ -1240,8 +1241,9 @@ qcrypto_block_luks_create(QCryptoBlock *block,
                   QCRYPTO_BLOCK_LUKS_SECTOR_SIZE)) *
         QCRYPTO_BLOCK_LUKS_NUM_KEY_SLOTS);

+    block->sector_size = QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;
    block->payload_offset = luks->header.payload_offset *
-        QCRYPTO_BLOCK_LUKS_SECTOR_SIZE;
+        block->sector_size;

    /* Reserve header space to match payload offset */
    initfunc(block, block->payload_offset, opaque, &local_err);
@@ -1397,29 +1399,33 @@ static void qcrypto_block_luks_cleanup(QCryptoBlock *block)

 static int
 qcrypto_block_luks_decrypt(QCryptoBlock *block,
-                           uint64_t startsector,
+                           uint64_t offset,
                           uint8_t *buf,
                           size_t len,
                           Error **errp)
 {
+    assert(QEMU_IS_ALIGNED(offset, QCRYPTO_BLOCK_LUKS_SECTOR_SIZE));
+    assert(QEMU_IS_ALIGNED(len, QCRYPTO_BLOCK_LUKS_SECTOR_SIZE));
    return qcrypto_block_decrypt_helper(block->cipher,
                                        block->niv, block->ivgen,
                                        QCRYPTO_BLOCK_LUKS_SECTOR_SIZE,
-                                        startsector, buf, len, errp);
+                                        offset, buf, len, errp);
 }


 static int
 qcrypto_block_luks_encrypt(QCryptoBlock *block,
-                           uint64_t startsector,
+                           uint64_t offset,
                           uint8_t *buf,
                           size_t len,
                           Error **errp)
 {
+    assert(QEMU_IS_ALIGNED(offset, QCRYPTO_BLOCK_LUKS_SECTOR_SIZE));
+    assert(QEMU_IS_ALIGNED(len, QCRYPTO_BLOCK_LUKS_SECTOR_SIZE));
    return qcrypto_block_encrypt_helper(block->cipher,
                                        block->niv, block->ivgen,
                                        QCRYPTO_BLOCK_LUKS_SECTOR_SIZE,
-                                        startsector, buf, len, errp);
+                                        offset, buf, len, errp);
 }


--- a/crypto/block-qcow.c
+++ b/crypto/block-qcow.c
@@ -80,6 +80,7 @@ qcrypto_block_qcow_init(QCryptoBlock *block,
        goto fail;
    }

+    block->sector_size = QCRYPTO_BLOCK_QCOW_SECTOR_SIZE;
    block->payload_offset = 0;

    return 0;
@@ -142,29 +143,33 @@ qcrypto_block_qcow_cleanup(QCryptoBlock *block)

 static int
 qcrypto_block_qcow_decrypt(QCryptoBlock *block,
-                           uint64_t startsector,
+                           uint64_t offset,
                           uint8_t *buf,
                           size_t len,
                           Error **errp)
 {
+    assert(QEMU_IS_ALIGNED(offset, QCRYPTO_BLOCK_QCOW_SECTOR_SIZE));
+    assert(QEMU_IS_ALIGNED(len, QCRYPTO_BLOCK_QCOW_SECTOR_SIZE));
    return qcrypto_block_decrypt_helper(block->cipher,
                                        block->niv, block->ivgen,
                                        QCRYPTO_BLOCK_QCOW_SECTOR_SIZE,
-                                        startsector, buf, len, errp);
+                                        offset, buf, len, errp);
 }


 static int
 qcrypto_block_qcow_encrypt(QCryptoBlock *block,
-                           uint64_t startsector,
+                           uint64_t offset,
                           uint8_t *buf,
                           size_t len,
                           Error **errp)
 {
+    assert(QEMU_IS_ALIGNED(offset, QCRYPTO_BLOCK_QCOW_SECTOR_SIZE));
+    assert(QEMU_IS_ALIGNED(len, QCRYPTO_BLOCK_QCOW_SECTOR_SIZE));
    return qcrypto_block_encrypt_helper(block->cipher,
                                        block->niv, block->ivgen,
                                        QCRYPTO_BLOCK_QCOW_SECTOR_SIZE,
-                                        startsector, buf, len, errp);
+                                        offset, buf, len, errp);
 }


--- a/crypto/block.c
+++ b/crypto/block.c
@@ -127,22 +127,22 @@ QCryptoBlockInfo *qcrypto_block_get_info(QCryptoBlock *block,


 int qcrypto_block_decrypt(QCryptoBlock *block,
-                          uint64_t startsector,
+                          uint64_t offset,
                          uint8_t *buf,
                          size_t len,
                          Error **errp)
 {
-    return block->driver->decrypt(block, startsector, buf, len, errp);
+    return block->driver->decrypt(block, offset, buf, len, errp);
 }


 int qcrypto_block_encrypt(QCryptoBlock *block,
-                          uint64_t startsector,
+                          uint64_t offset,
                          uint8_t *buf,
                          size_t len,
                          Error **errp)
 {
-    return block->driver->encrypt(block, startsector, buf, len, errp);
+    return block->driver->encrypt(block, offset, buf, len, errp);
 }


@@ -170,6 +170,12 @@ uint64_t qcrypto_block_get_payload_offset(QCryptoBlock *block)
 }


+uint64_t qcrypto_block_get_sector_size(QCryptoBlock *block)
+{
+    return block->sector_size;
+}
+
+
 void qcrypto_block_free(QCryptoBlock *block)
 {
    if (!block) {
@@ -188,13 +194,17 @@ int qcrypto_block_decrypt_helper(QCryptoCipher *cipher,
                                 size_t niv,
                                 QCryptoIVGen *ivgen,
                                 int sectorsize,
-                                 uint64_t startsector,
+                                 uint64_t offset,
                                 uint8_t *buf,
                                 size_t len,
                                 Error **errp)
 {
    uint8_t *iv;
    int ret = -1;
+    uint64_t startsector = offset / sectorsize;
+
+    assert(QEMU_IS_ALIGNED(offset, sectorsize));
+    assert(QEMU_IS_ALIGNED(len, sectorsize));

    iv = niv ? g_new0(uint8_t, niv) : NULL;

@@ -237,13 +247,17 @@ int qcrypto_block_encrypt_helper(QCryptoCipher *cipher,
                                 size_t niv,
                                 QCryptoIVGen *ivgen,
                                 int sectorsize,
-                                 uint64_t startsector,
+                                 uint64_t offset,
                                 uint8_t *buf,
                                 size_t len,
                                 Error **errp)
 {
    uint8_t *iv;
    int ret = -1;
+    uint64_t startsector = offset / sectorsize;
+
+    assert(QEMU_IS_ALIGNED(offset, sectorsize));
+    assert(QEMU_IS_ALIGNED(len, sectorsize));

    iv = niv ? g_new0(uint8_t, niv) : NULL;

--- a/crypto/blockpriv.h
+++ b/crypto/blockpriv.h
@@ -36,6 +36,7 @@ struct QCryptoBlock {
    QCryptoHashAlgorithm kdfhash;
    size_t niv;
    uint64_t payload_offset; /* In bytes */
+    uint64_t sector_size; /* In bytes */
 };

 struct QCryptoBlockDriver {
@@ -81,7 +82,7 @@ int qcrypto_block_decrypt_helper(QCryptoCipher *cipher,
                                 size_t niv,
                                 QCryptoIVGen *ivgen,
                                 int sectorsize,
-                                 uint64_t startsector,
+                                 uint64_t offset,
                                 uint8_t *buf,
                                 size_t len,
                                 Error **errp);
@@ -90,7 +91,7 @@ int qcrypto_block_encrypt_helper(QCryptoCipher *cipher,
                                 size_t niv,
                                 QCryptoIVGen *ivgen,
                                 int sectorsize,
-                                 uint64_t startsector,
+                                 uint64_t offset,
                                 uint8_t *buf,
                                 size_t len,
                                 Error **errp);
--- a/crypto/cipher.c
+++ b/crypto/cipher.c
@@ -164,11 +164,10 @@ QCryptoCipher *qcrypto_cipher_new(QCryptoCipherAlgorithm alg,
 {
    QCryptoCipher *cipher;
    void *ctx = NULL;
-    Error *err2 = NULL;
    QCryptoCipherDriver *drv = NULL;

 #ifdef CONFIG_AF_ALG
-    ctx = qcrypto_afalg_cipher_ctx_new(alg, mode, key, nkey, &err2);
+    ctx = qcrypto_afalg_cipher_ctx_new(alg, mode, key, nkey, NULL);
    if (ctx) {
        drv = &qcrypto_cipher_afalg_driver;
    }
@@ -177,12 +176,10 @@ QCryptoCipher *qcrypto_cipher_new(QCryptoCipherAlgorithm alg,
    if (!ctx) {
        ctx = qcrypto_cipher_ctx_new(alg, mode, key, nkey, errp);
        if (!ctx) {
-            error_free(err2);
            return NULL;
        }

        drv = &qcrypto_cipher_lib_driver;
-        error_free(err2);
    }

    cipher = g_new0(QCryptoCipher, 1);
--- a/crypto/hash.c
+++ b/crypto/hash.c
@@ -48,19 +48,16 @@ int qcrypto_hash_bytesv(QCryptoHashAlgorithm alg,
 {
 #ifdef CONFIG_AF_ALG
    int ret;
-
-    ret = qcrypto_hash_afalg_driver.hash_bytesv(alg, iov, niov,
-                                                result, resultlen,
-                                                errp);
-    if (ret == 0) {
-        return ret;
-    }
-
    /*
     * TODO:
     * Maybe we should treat some afalg errors as fatal
     */
-    error_free(*errp);
+    ret = qcrypto_hash_afalg_driver.hash_bytesv(alg, iov, niov,
+                                                result, resultlen,
+                                                NULL);
+    if (ret == 0) {
+        return ret;
+    }
 #endif

    return qcrypto_hash_lib_driver.hash_bytesv(alg, iov, niov,
--- a/crypto/hmac.c
+++ b/crypto/hmac.c
@@ -90,11 +90,10 @@ QCryptoHmac *qcrypto_hmac_new(QCryptoHashAlgorithm alg,
 {
    QCryptoHmac *hmac;
    void *ctx = NULL;
-    Error *err2 = NULL;
    QCryptoHmacDriver *drv = NULL;

 #ifdef CONFIG_AF_ALG
-    ctx = qcrypto_afalg_hmac_ctx_new(alg, key, nkey, &err2);
+    ctx = qcrypto_afalg_hmac_ctx_new(alg, key, nkey, NULL);
    if (ctx) {
        drv = &qcrypto_hmac_afalg_driver;
    }
@@ -107,7 +106,6 @@ QCryptoHmac *qcrypto_hmac_new(QCryptoHashAlgorithm alg,
        }

        drv = &qcrypto_hmac_lib_driver;
-        error_free(err2);
    }

    hmac = g_new0(QCryptoHmac, 1);
--- a/default-configs/arm-softmmu.mak
+++ b/default-configs/arm-softmmu.mak
@@ -129,3 +129,6 @@ CONFIG_ACPI=y
 CONFIG_SMBIOS=y
 CONFIG_ASPEED_SOC=y
 CONFIG_GPIO_KEY=y
+CONFIG_MSF2=y
+
+CONFIG_FW_CFG_DMA=y
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -7,6 +7,7 @@ CONFIG_QXL=$(CONFIG_SPICE)
 CONFIG_VGA_ISA=y
 CONFIG_VGA_CIRRUS=y
 CONFIG_VMWARE_VGA=y
+CONFIG_VMXNET3_PCI=y
 CONFIG_VIRTIO_VGA=y
 CONFIG_VMMOUSE=y
 CONFIG_IPMI=y
@@ -59,3 +60,4 @@ CONFIG_SMBIOS=y
 CONFIG_HYPERV_TESTDEV=$(CONFIG_KVM)
 CONFIG_PXB=y
 CONFIG_ACPI_VMGENID=y
+CONFIG_FW_CFG_DMA=y
--- a/default-configs/or1k-softmmu.mak
+++ b/default-configs/or1k-softmmu.mak
@@ -2,3 +2,4 @@

 CONFIG_SERIAL=y
 CONFIG_OPENCORES_ETH=y
+CONFIG_OMPIC=y
--- a/default-configs/pci.mak
+++ b/default-configs/pci.mak
@@ -22,7 +22,6 @@ CONFIG_MPTSAS_SCSI_PCI=y
 CONFIG_RTL8139_PCI=y
 CONFIG_E1000_PCI=y
 CONFIG_E1000E_PCI=y
-CONFIG_VMXNET3_PCI=y
 CONFIG_IDE_CORE=y
 CONFIG_IDE_QDEV=y
 CONFIG_IDE_PCI=y
--- a/default-configs/ppc-linux-user.mak
+++ b/default-configs/ppc-linux-user.mak
@@ -1,2 +1 @@
 # Default configuration for ppc-linux-user
-CONFIG_LIBDECNUMBER=y
--- a/default-configs/ppc-softmmu.mak
+++ b/default-configs/ppc-softmmu.mak
@@ -46,7 +46,6 @@ CONFIG_E500=y
 CONFIG_OPENPIC_KVM=$(call land,$(CONFIG_E500),$(CONFIG_KVM))
 CONFIG_PLATFORM_BUS=y
 CONFIG_ETSEC=y
-CONFIG_LIBDECNUMBER=y
 CONFIG_SM501=y
 # For PReP
 CONFIG_SERIAL_ISA=y
--- a/default-configs/ppc64-linux-user.mak
+++ b/default-configs/ppc64-linux-user.mak
@@ -1,2 +1 @@
 # Default configuration for ppc64-linux-user
-CONFIG_LIBDECNUMBER=y
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -51,7 +51,6 @@ CONFIG_E500=y
 CONFIG_OPENPIC_KVM=$(call land,$(CONFIG_E500),$(CONFIG_KVM))
 CONFIG_PLATFORM_BUS=y
 CONFIG_ETSEC=y
-CONFIG_LIBDECNUMBER=y
 CONFIG_SM501=y
 # For pSeries
 CONFIG_XICS=$(CONFIG_PSERIES)
--- a/default-configs/ppc64abi32-linux-user.mak
+++ b/default-configs/ppc64abi32-linux-user.mak
@@ -1,2 +1 @@
 # Default configuration for ppc64abi32-linux-user
-CONFIG_LIBDECNUMBER=y
--- a/default-configs/ppc64le-linux-user.mak
+++ b/default-configs/ppc64le-linux-user.mak
@@ -1,2 +1 @@
 # Default configuration for ppc64le-linux-user
-CONFIG_LIBDECNUMBER=y
--- a/default-configs/ppcemb-softmmu.mak
+++ b/default-configs/ppcemb-softmmu.mak
@@ -15,5 +15,4 @@ CONFIG_PTIMER=y
 CONFIG_I8259=y
 CONFIG_XILINX=y
 CONFIG_XILINX_ETHLITE=y
-CONFIG_LIBDECNUMBER=y
 CONFIG_SM501=y
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -7,6 +7,7 @@ CONFIG_QXL=$(CONFIG_SPICE)
 CONFIG_VGA_ISA=y
 CONFIG_VGA_CIRRUS=y
 CONFIG_VMWARE_VGA=y
+CONFIG_VMXNET3_PCI=y
 CONFIG_VIRTIO_VGA=y
 CONFIG_VMMOUSE=y
 CONFIG_IPMI=y
@@ -59,3 +60,4 @@ CONFIG_SMBIOS=y
 CONFIG_HYPERV_TESTDEV=$(CONFIG_KVM)
 CONFIG_PXB=y
 CONFIG_ACPI_VMGENID=y
+CONFIG_FW_CFG_DMA=y
--- a/disas.c
+++ b/disas.c
@@ -6,6 +6,7 @@

 #include "cpu.h"
 #include "disas/disas.h"
+#include "disas/capstone.h"

 typedef struct CPUDebug {
    struct disassemble_info info;
@@ -171,15 +172,257 @@ static int print_insn_od_target(bfd_vma pc, disassemble_info *info)
    return print_insn_objdump(pc, info, "OBJD-T");
 }

-/* Disassemble this for me please... (debugging). 'flags' has the following
-   values:
-    i386 - 1 means 16 bit code, 2 means 64 bit code
-    ppc  - bits 0:15 specify (optionally) the machine instruction set;
-           bit 16 indicates little endian.
-    other targets - unused
- */
+#ifdef CONFIG_CAPSTONE
+/* Temporary storage for the capstone library.  This will be alloced via
+   malloc with a size private to the library; thus there's no reason not
+   to share this across calls and across host vs target disassembly.  */
+static __thread cs_insn *cap_insn;
+
+/* Initialize the Capstone library.  */
+/* ??? It would be nice to cache this.  We would need one handle for the
+   host and one for the target.  For most targets we can reset specific
+   parameters via cs_option(CS_OPT_MODE, new_mode), but we cannot change
+   CS_ARCH_* in this way.  Thus we would need to be able to close and
+   re-open the target handle with a different arch for the target in order
+   to handle AArch64 vs AArch32 mode switching.  */
+static cs_err cap_disas_start(disassemble_info *info, csh *handle)
+{
+    cs_mode cap_mode = info->cap_mode;
+    cs_err err;
+
+    cap_mode += (info->endian == BFD_ENDIAN_BIG ? CS_MODE_BIG_ENDIAN
+                 : CS_MODE_LITTLE_ENDIAN);
+
+    err = cs_open(info->cap_arch, cap_mode, handle);
+    if (err != CS_ERR_OK) {
+        return err;
+    }
+
+    /* ??? There probably ought to be a better place to put this.  */
+    if (info->cap_arch == CS_ARCH_X86) {
+        /* We don't care about errors (if for some reason the library
+           is compiled without AT&T syntax); the user will just have
+           to deal with the Intel syntax.  */
+        cs_option(*handle, CS_OPT_SYNTAX, CS_OPT_SYNTAX_ATT);
+    }
+
+    /* "Disassemble" unknown insns as ".byte W,X,Y,Z".  */
+    cs_option(*handle, CS_OPT_SKIPDATA, CS_OPT_ON);
+
+    /* Allocate temp space for cs_disasm_iter.  */
+    if (cap_insn == NULL) {
+        cap_insn = cs_malloc(*handle);
+        if (cap_insn == NULL) {
+            cs_close(handle);
+            return CS_ERR_MEM;
+        }
+    }
+    return CS_ERR_OK;
+}
+
+static void cap_dump_insn_units(disassemble_info *info, cs_insn *insn,
+                                int i, int n)
+{
+    fprintf_function print = info->fprintf_func;
+    FILE *stream = info->stream;
+
+    switch (info->cap_insn_unit) {
+    case 4:
+        if (info->endian == BFD_ENDIAN_BIG) {
+            for (; i < n; i += 4) {
+                print(stream, " %08x", ldl_be_p(insn->bytes + i));
+
+            }
+        } else {
+            for (; i < n; i += 4) {
+                print(stream, " %08x", ldl_le_p(insn->bytes + i));
+            }
+        }
+        break;
+
+    case 2:
+        if (info->endian == BFD_ENDIAN_BIG) {
+            for (; i < n; i += 2) {
+                print(stream, " %04x", lduw_be_p(insn->bytes + i));
+            }
+        } else {
+            for (; i < n; i += 2) {
+                print(stream, " %04x", lduw_le_p(insn->bytes + i));
+            }
+        }
+        break;
+
+    default:
+        for (; i < n; i++) {
+            print(stream, " %02x", insn->bytes[i]);
+        }
+        break;
+    }
+}
+
+static void cap_dump_insn(disassemble_info *info, cs_insn *insn)
+{
+    fprintf_function print = info->fprintf_func;
+    int i, n, split;
+
+    print(info->stream, "0x%08" PRIx64 ": ", insn->address);
+
+    n = insn->size;
+    split = info->cap_insn_split;
+
+    /* Dump the first SPLIT bytes of the instruction.  */
+    cap_dump_insn_units(info, insn, 0, MIN(n, split));
+
+    /* Add padding up to SPLIT so that mnemonics line up.  */
+    if (n < split) {
+        int width = (split - n) / info->cap_insn_unit;
+        width *= (2 * info->cap_insn_unit + 1);
+        print(info->stream, "%*s", width, "");
+    }
+
+    /* Print the actual instruction.  */
+    print(info->stream, "  %-8s %s\n", insn->mnemonic, insn->op_str);
+
+    /* Dump any remaining part of the insn on subsequent lines.  */
+    for (i = split; i < n; i += split) {
+        print(info->stream, "0x%08" PRIx64 ": ", insn->address + i);
+        cap_dump_insn_units(info, insn, i, MIN(n, i + split));
+        print(info->stream, "\n");
+    }
+}
+
+/* Disassemble SIZE bytes at PC for the target.  */
+static bool cap_disas_target(disassemble_info *info, uint64_t pc, size_t size)
+{
+    uint8_t cap_buf[1024];
+    csh handle;
+    cs_insn *insn;
+    size_t csize = 0;
+
+    if (cap_disas_start(info, &handle) != CS_ERR_OK) {
+        return false;
+    }
+    insn = cap_insn;
+
+    while (1) {
+        size_t tsize = MIN(sizeof(cap_buf) - csize, size);
+        const uint8_t *cbuf = cap_buf;
+
+        target_read_memory(pc + csize, cap_buf + csize, tsize, info);
+        csize += tsize;
+        size -= tsize;
+
+        while (cs_disasm_iter(handle, &cbuf, &csize, &pc, insn)) {
+           cap_dump_insn(info, insn);
+        }
+
+        /* If the target memory is not consumed, go back for more... */
+        if (size != 0) {
+            /* ... taking care to move any remaining fractional insn
+               to the beginning of the buffer.  */
+            if (csize != 0) {
+                memmove(cap_buf, cbuf, csize);
+            }
+            continue;
+        }
+
+        /* Since the target memory is consumed, we should not have
+           a remaining fractional insn.  */
+        if (csize != 0) {
+            (*info->fprintf_func)(info->stream,
+                "Disassembler disagrees with translator "
+                "over instruction decoding\n"
+                "Please report this to qemu-devel@nongnu.org\n");
+        }
+        break;
+    }
+
+    cs_close(&handle);
+    return true;
+}
+
+/* Disassemble SIZE bytes at CODE for the host.  */
+static bool cap_disas_host(disassemble_info *info, void *code, size_t size)
+{
+    csh handle;
+    const uint8_t *cbuf;
+    cs_insn *insn;
+    uint64_t pc;
+
+    if (cap_disas_start(info, &handle) != CS_ERR_OK) {
+        return false;
+    }
+    insn = cap_insn;
+
+    cbuf = code;
+    pc = (uintptr_t)code;
+
+    while (cs_disasm_iter(handle, &cbuf, &size, &pc, insn)) {
+       cap_dump_insn(info, insn);
+    }
+    if (size != 0) {
+        (*info->fprintf_func)(info->stream,
+            "Disassembler disagrees with TCG over instruction encoding\n"
+            "Please report this to qemu-devel@nongnu.org\n");
+    }
+
+    cs_close(&handle);
+    return true;
+}
+
+#if !defined(CONFIG_USER_ONLY)
+/* Disassemble COUNT insns at PC for the target.  */
+static bool cap_disas_monitor(disassemble_info *info, uint64_t pc, int count)
+{
+    uint8_t cap_buf[32];
+    csh handle;
+    cs_insn *insn;
+    size_t csize = 0;
+
+    if (cap_disas_start(info, &handle) != CS_ERR_OK) {
+        return false;
+    }
+    insn = cap_insn;
+
+    while (1) {
+        /* We want to read memory for one insn, but generically we do not
+           know how much memory that is.  We have a small buffer which is
+           known to be sufficient for all supported targets.  Try to not
+           read beyond the page, Just In Case.  For even more simplicity,
+           ignore the actual target page size and use a 1k boundary.  If
+           that turns out to be insufficient, we'll come back around the
+           loop and read more.  */
+        uint64_t epc = QEMU_ALIGN_UP(pc + csize + 1, 1024);
+        size_t tsize = MIN(sizeof(cap_buf) - csize, epc - pc);
+        const uint8_t *cbuf = cap_buf;
+
+        /* Make certain that we can make progress.  */
+        assert(tsize != 0);
+        info->read_memory_func(pc, cap_buf + csize, tsize, info);
+        csize += tsize;
+
+        if (cs_disasm_iter(handle, &cbuf, &csize, &pc, insn)) {
+            cap_dump_insn(info, insn);
+            if (--count <= 0) {
+                break;
+            }
+        }
+        memmove(cap_buf, cbuf, csize);
+    }
+
+    cs_close(&handle);
+    return true;
+}
+#endif /* !CONFIG_USER_ONLY */
+#else
+# define cap_disas_target(i, p, s)  false
+# define cap_disas_host(i, p, s)  false
+# define cap_disas_monitor(i, p, c)  false
+#endif /* CONFIG_CAPSTONE */
+
+/* Disassemble this for me please... (debugging).  */
 void target_disas(FILE *out, CPUState *cpu, target_ulong code,
-                  target_ulong size, int flags)
+                  target_ulong size)
 {
    CPUClass *cc = CPU_GET_CLASS(cpu);
    target_ulong pc;
@@ -190,10 +433,13 @@ void target_disas(FILE *out, CPUState *cpu, target_ulong code,

    s.cpu = cpu;
    s.info.read_memory_func = target_read_memory;
-    s.info.read_memory_inner_func = NULL;
    s.info.buffer_vma = code;
    s.info.buffer_length = size;
    s.info.print_address_func = generic_print_address;
+    s.info.cap_arch = -1;
+    s.info.cap_mode = 0;
+    s.info.cap_insn_unit = 4;
+    s.info.cap_insn_split = 4;

 #ifdef TARGET_WORDS_BIGENDIAN
    s.info.endian = BFD_ENDIAN_BIG;
@@ -205,32 +451,10 @@ void target_disas(FILE *out, CPUState *cpu, target_ulong code,
        cc->disas_set_info(cpu, &s.info);
    }

-#if defined(TARGET_I386)
-    if (flags == 2) {
-        s.info.mach = bfd_mach_x86_64;
-    } else if (flags == 1) {
-        s.info.mach = bfd_mach_i386_i8086;
-    } else {
-        s.info.mach = bfd_mach_i386_i386;
+    if (s.info.cap_arch >= 0 && cap_disas_target(&s.info, code, size)) {
+        return;
    }
-    s.info.print_insn = print_insn_i386;
-#elif defined(TARGET_PPC)
-    if ((flags >> 16) & 1) {
-        s.info.endian = BFD_ENDIAN_LITTLE;
-    }
-    if (flags & 0xFFFF) {
-        /* If we have a precise definition of the instruction set, use it. */
-        s.info.mach = flags & 0xFFFF;
-    } else {
-#ifdef TARGET_PPC64
-        s.info.mach = bfd_mach_ppc64;
-#else
-        s.info.mach = bfd_mach_ppc;
-#endif
-    }
-    s.info.disassembler_options = (char *)"any";
-    s.info.print_insn = print_insn_ppc;
-#endif
+
    if (s.info.print_insn == NULL) {
        s.info.print_insn = print_insn_od_target;
    }
@@ -238,18 +462,6 @@ void target_disas(FILE *out, CPUState *cpu, target_ulong code,
    for (pc = code; size > 0; pc += count, size -= count) {
 	fprintf(out, "0x" TARGET_FMT_lx ":  ", pc);
 	count = s.info.print_insn(pc, &s.info);
-#if 0
-        {
-            int i;
-            uint8_t b;
-            fprintf(out, " {");
-            for(i = 0; i < count; i++) {
-                target_read_memory(pc + i, &b, 1, &s.info);
-                fprintf(out, " %02x", b);
-            }
-            fprintf(out, " }");
-        }
-#endif
 	fprintf(out, "\n");
 	if (count < 0)
 	    break;
@@ -277,6 +489,10 @@ void disas(FILE *out, void *code, unsigned long size)
    s.info.buffer = code;
    s.info.buffer_vma = (uintptr_t)code;
    s.info.buffer_length = size;
+    s.info.cap_arch = -1;
+    s.info.cap_mode = 0;
+    s.info.cap_insn_unit = 4;
+    s.info.cap_insn_split = 4;

 #ifdef HOST_WORDS_BIGENDIAN
    s.info.endian = BFD_ENDIAN_BIG;
@@ -288,14 +504,27 @@ void disas(FILE *out, void *code, unsigned long size)
 #elif defined(__i386__)
    s.info.mach = bfd_mach_i386_i386;
    print_insn = print_insn_i386;
+    s.info.cap_arch = CS_ARCH_X86;
+    s.info.cap_mode = CS_MODE_32;
+    s.info.cap_insn_unit = 1;
+    s.info.cap_insn_split = 8;
 #elif defined(__x86_64__)
    s.info.mach = bfd_mach_x86_64;
    print_insn = print_insn_i386;
+    s.info.cap_arch = CS_ARCH_X86;
+    s.info.cap_mode = CS_MODE_64;
+    s.info.cap_insn_unit = 1;
+    s.info.cap_insn_split = 8;
 #elif defined(_ARCH_PPC)
    s.info.disassembler_options = (char *)"any";
    print_insn = print_insn_ppc;
+    s.info.cap_arch = CS_ARCH_PPC;
+# ifdef _ARCH_PPC64
+    s.info.cap_mode = CS_MODE_64;
+# endif
 #elif defined(__aarch64__) && defined(CONFIG_ARM_A64_DIS)
    print_insn = print_insn_arm_a64;
+    s.info.cap_arch = CS_ARCH_ARM64;
 #elif defined(__alpha__)
    print_insn = print_insn_alpha;
 #elif defined(__sparc__)
@@ -303,6 +532,8 @@ void disas(FILE *out, void *code, unsigned long size)
    s.info.mach = bfd_mach_sparc_v9b;
 #elif defined(__arm__)
    print_insn = print_insn_arm;
+    s.info.cap_arch = CS_ARCH_ARM;
+    /* TCG only generates code for arm mode.  */
 #elif defined(__MIPSEB__)
    print_insn = print_insn_big_mips;
 #elif defined(__MIPSEL__)
@@ -314,6 +545,11 @@ void disas(FILE *out, void *code, unsigned long size)
 #elif defined(__hppa__)
    print_insn = print_insn_hppa;
 #endif
+
+    if (s.info.cap_arch >= 0 && cap_disas_host(&s.info, code, size)) {
+        return;
+    }
+
    if (print_insn == NULL) {
        print_insn = print_insn_od_host;
    }
@@ -346,26 +582,17 @@ const char *lookup_symbol(target_ulong orig_addr)

 #include "monitor/monitor.h"

-static int monitor_disas_is_physical;
-
 static int
-monitor_read_memory (bfd_vma memaddr, bfd_byte *myaddr, int length,
+physical_read_memory(bfd_vma memaddr, bfd_byte *myaddr, int length,
                     struct disassemble_info *info)
 {
-    CPUDebug *s = container_of(info, CPUDebug, info);
-
-    if (monitor_disas_is_physical) {
    cpu_physical_memory_read(memaddr, myaddr, length);
-    } else {
-        cpu_memory_rw_debug(s->cpu, memaddr, myaddr, length, 0);
-    }
    return 0;
 }

-/* Disassembler for the monitor.
-   See target_disas for a description of flags. */
+/* Disassembler for the monitor.  */
 void monitor_disas(Monitor *mon, CPUState *cpu,
-                   target_ulong pc, int nb_insn, int is_physical, int flags)
+                   target_ulong pc, int nb_insn, int is_physical)
 {
    CPUClass *cc = CPU_GET_CLASS(cpu);
    int count, i;
@@ -374,11 +601,14 @@ void monitor_disas(Monitor *mon, CPUState *cpu,
    INIT_DISASSEMBLE_INFO(s.info, (FILE *)mon, monitor_fprintf);

    s.cpu = cpu;
-    monitor_disas_is_physical = is_physical;
-    s.info.read_memory_func = monitor_read_memory;
+    s.info.read_memory_func
+        = (is_physical ? physical_read_memory : target_read_memory);
    s.info.print_address_func = generic_print_address;
-
    s.info.buffer_vma = pc;
+    s.info.cap_arch = -1;
+    s.info.cap_mode = 0;
+    s.info.cap_insn_unit = 4;
+    s.info.cap_insn_split = 4;

 #ifdef TARGET_WORDS_BIGENDIAN
    s.info.endian = BFD_ENDIAN_BIG;
@@ -390,31 +620,10 @@ void monitor_disas(Monitor *mon, CPUState *cpu,
        cc->disas_set_info(cpu, &s.info);
    }

-#if defined(TARGET_I386)
-    if (flags == 2) {
-        s.info.mach = bfd_mach_x86_64;
-    } else if (flags == 1) {
-        s.info.mach = bfd_mach_i386_i8086;
-    } else {
-        s.info.mach = bfd_mach_i386_i386;
+    if (s.info.cap_arch >= 0 && cap_disas_monitor(&s.info, pc, nb_insn)) {
+        return;
    }
-    s.info.print_insn = print_insn_i386;
-#elif defined(TARGET_PPC)
-    if (flags & 0xFFFF) {
-        /* If we have a precise definition of the instruction set, use it. */
-        s.info.mach = flags & 0xFFFF;
-    } else {
-#ifdef TARGET_PPC64
-        s.info.mach = bfd_mach_ppc64;
-#else
-        s.info.mach = bfd_mach_ppc;
-#endif
-    }
-    if ((flags >> 16) & 1) {
-        s.info.endian = BFD_ENDIAN_LITTLE;
-    }
-    s.info.print_insn = print_insn_ppc;
-#endif
+
    if (!s.info.print_insn) {
        monitor_printf(mon, "0x" TARGET_FMT_lx
                       ": Asm output not supported on this arch\n", pc);
--- a/disas/arm.c
+++ b/disas/arm.c
@@ -70,6 +70,17 @@ static void floatformat_to_double (unsigned char *data, double *dest)
    *dest = u.f;
 }

+static int arm_read_memory(bfd_vma memaddr, bfd_byte *b, int length,
+                           struct disassemble_info *info)
+{
+    assert((info->flags & INSN_ARM_BE32) == 0 || length == 2 || length == 4);
+
+    if ((info->flags & INSN_ARM_BE32) != 0 && length == 2) {
+        memaddr ^= 2;
+    }
+    return info->read_memory_func(memaddr, b, length, info);
+}
+
 /* End of qemu specific additions.  */

 struct opcode32
@@ -3810,7 +3821,7 @@ find_ifthen_state (bfd_vma pc, struct disassemble_info *info,
 	  return;
 	}
      addr -= 2;
-      status = info->read_memory_func (addr, (bfd_byte *)b, 2, info);
+      status = arm_read_memory (addr, (bfd_byte *)b, 2, info);
      if (status)
 	return;

@@ -3882,7 +3893,7 @@ print_insn_arm (bfd_vma pc, struct disassemble_info *info)
      info->bytes_per_chunk = size;
      printer = print_insn_data;

-      status = info->read_memory_func (pc, (bfd_byte *)b, size, info);
+      status = arm_read_memory (pc, (bfd_byte *)b, size, info);
      given = 0;
      if (little)
 	for (i = size - 1; i >= 0; i--)
@@ -3899,7 +3910,7 @@ print_insn_arm (bfd_vma pc, struct disassemble_info *info)
      info->bytes_per_chunk = 4;
      size = 4;

-      status = info->read_memory_func (pc, (bfd_byte *)b, 4, info);
+      status = arm_read_memory (pc, (bfd_byte *)b, 4, info);
      if (little)
 	given = (b[0]) | (b[1] << 8) | (b[2] << 16) | ((unsigned)b[3] << 24);
      else
@@ -3915,7 +3926,7 @@ print_insn_arm (bfd_vma pc, struct disassemble_info *info)
      info->bytes_per_chunk = 2;
      size = 2;

-      status = info->read_memory_func (pc, (bfd_byte *)b, 2, info);
+      status = arm_read_memory (pc, (bfd_byte *)b, 2, info);
      if (little)
 	given = (b[0]) | (b[1] << 8);
      else
@@ -3929,7 +3940,7 @@ print_insn_arm (bfd_vma pc, struct disassemble_info *info)
 	      || (given & 0xF800) == 0xF000
 	      || (given & 0xF800) == 0xE800)
 	    {
-	      status = info->read_memory_func (pc + 2, (bfd_byte *)b, 2, info);
+	      status = arm_read_memory (pc + 2, (bfd_byte *)b, 2, info);
 	      if (little)
 		given = (b[0]) | (b[1] << 8) | (given << 16);
 	      else
--- a/docs/devel/atomics.txt
+++ b/docs/devel/atomics.txt
@@ -63,11 +63,23 @@ operations:
    typeof(*ptr) atomic_fetch_sub(ptr, val)
    typeof(*ptr) atomic_fetch_and(ptr, val)
    typeof(*ptr) atomic_fetch_or(ptr, val)
+    typeof(*ptr) atomic_fetch_xor(ptr, val)
+    typeof(*ptr) atomic_fetch_inc_nonzero(ptr)
    typeof(*ptr) atomic_xchg(ptr, val)
    typeof(*ptr) atomic_cmpxchg(ptr, old, new)

 all of which return the old value of *ptr.  These operations are
-polymorphic; they operate on any type that is as wide as an int.
+polymorphic; they operate on any type that is as wide as a pointer.
+
+Similar operations return the new value of *ptr:
+
+    typeof(*ptr) atomic_inc_fetch(ptr)
+    typeof(*ptr) atomic_dec_fetch(ptr)
+    typeof(*ptr) atomic_add_fetch(ptr, val)
+    typeof(*ptr) atomic_sub_fetch(ptr, val)
+    typeof(*ptr) atomic_and_fetch(ptr, val)
+    typeof(*ptr) atomic_or_fetch(ptr, val)
+    typeof(*ptr) atomic_xor_fetch(ptr, val)

 Sequentially consistent loads and stores can be done using:

--- a/docs/devel/loads-stores.rst
+++ b/docs/devel/loads-stores.rst
@@ -0,0 +1,396 @@
+..
+   Copyright (c) 2017 Linaro Limited
+   Written by Peter Maydell
+
+===================
+Load and Store APIs
+===================
+
+QEMU internally has multiple families of functions for performing
+loads and stores. This document attempts to enumerate them all
+and indicate when to use them. It does not provide detailed
+documentation of each API -- for that you should look at the
+documentation comments in the relevant header files.
+
+
+``ld*_p and st*_p``
+~~~~~~~~~~~~~~~~~~~
+
+These functions operate on a host pointer, and should be used
+when you already have a pointer into host memory (corresponding
+to guest ram or a local buffer). They deal with doing accesses
+with the desired endianness and with correctly handling
+potentially unaligned pointer values.
+
+Function names follow the pattern:
+
+load: ``ld{type}{sign}{size}_{endian}_p(ptr)``
+
+store: ``st{type}{size}_{endian}_p(ptr, val)``
+
+``type``
+ - (empty) : integer access
+ - ``f`` : float access
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes (including floats and doubles)
+ - ``u`` : unsigned
+ - ``s`` : signed
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``endian``
+ - ``he`` : host endian
+ - ``be`` : big endian
+ - ``le`` : little endian
+
+The ``_{endian}`` infix is omitted for target-endian accesses.
+
+The target endian accessors are only available to source
+files which are built per-target.
+
+Regexes for git grep
+ - ``\<ldf\?[us]\?[bwlq]\(_[hbl]e\)\?_p\>``
+ - ``\<stf\?[bwlq]\(_[hbl]e\)\?_p\>``
+
+``cpu_{ld,st}_*``
+~~~~~~~~~~~~~~~~~
+
+These functions operate on a guest virtual address. Be aware
+that these functions may cause a guest CPU exception to be
+taken (e.g. for an alignment fault or MMU fault) which will
+result in guest CPU state being updated and control longjumping
+out of the function call. They should therefore only be used
+in code that is implementing emulation of the target CPU.
+
+These functions may throw an exception (longjmp() back out
+to the top level TCG loop). This means they must only be used
+from helper functions where the translator has saved all
+necessary CPU state before generating the helper function call.
+It's usually better to use the ``_ra`` variants described below
+from helper functions, but these functions are the right choice
+for calls made from hooks like the CPU do_interrupt hook or
+when you know for certain that the translator had to save all
+the CPU state that ``cpu_restore_state()`` would restore anyway.
+
+Function names follow the pattern:
+
+load: ``cpu_ld{sign}{size}_{mmusuffix}(env, ptr)``
+
+store: ``cpu_st{size}_{mmusuffix}(env, ptr, val)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+ - ``s`` : signed
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``mmusuffix`` is one of the generic suffixes ``data`` or ``code``, or
+(for softmmu configs) a target-specific MMU mode suffix as defined
+in the target's ``cpu.h``.
+
+Regexes for git grep
+ - ``\<cpu_ld[us]\?[bwlq]_[a-zA-Z0-9]\+\>``
+ - ``\<cpu_st[bwlq]_[a-zA-Z0-9]\+\>``
+
+``cpu_{ld,st}_*_ra``
+~~~~~~~~~~~~~~~~~~~~
+
+These functions work like the ``cpu_{ld,st}_*`` functions except
+that they also take a ``retaddr`` argument. This extra argument
+allows for correct unwinding of any exception that is taken,
+and should generally be the result of GETPC() called directly
+from the top level HELPER(foo) function (i.e. the return address
+in the generated code).
+
+These are generally the preferred way to do accesses by guest
+virtual address from helper functions; see the documentation
+of the non-``_ra`` variants for when those would be better.
+
+Calling these functions with a ``retaddr`` argument of 0 is
+equivalent to calling the non-``_ra`` version of the function.
+
+Function names follow the pattern:
+
+load: ``cpu_ld{sign}{size}_{mmusuffix}_ra(env, ptr, retaddr)``
+
+store: ``cpu_st{sign}{size}_{mmusuffix}_ra(env, ptr, val, retaddr)``
+
+Regexes for git grep
+ - ``\<cpu_ld[us]\?[bwlq]_[a-zA-Z0-9]\+_ra\>``
+ - ``\<cpu_st[bwlq]_[a-zA-Z0-9]\+_ra\>``
+
+``helper_*_{ld,st}*mmu``
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+These functions are intended primarily to be called by the code
+generated by the TCG backend. They may also be called by target
+CPU helper function code. Like the ``cpu_{ld,st}_*_ra`` functions
+they perform accesses by guest virtual address; the difference is
+that these functions allow you to specify an ``opindex`` parameter
+which encodes (among other things) the mmu index to use for the
+access. This is necessary if your helper needs to make an access
+via a specific mmu index (for instance, an "always as non-privileged"
+access) rather than using the default mmu index for the current state
+of the guest CPU.
+
+The ``opindex`` parameter should be created by calling ``make_memop_idx()``.
+
+The ``retaddr`` parameter should be the result of GETPC() called directly
+from the top level HELPER(foo) function (or 0 if no guest CPU state
+unwinding is required).
+
+**TODO** The names of these functions are a bit odd for historical
+reasons because they were originally expected to be called only from
+within generated code. We should rename them to bring them
+more in line with the other memory access functions.
+
+load: ``helper_{endian}_ld{sign}{size}_mmu(env, addr, opindex, retaddr)``
+
+load (code): ``helper_{endian}_ld{sign}{size}_cmmu(env, addr, opindex, retaddr)``
+
+store: ``helper_{endian}_st{size}_mmu(env, addr, val, opindex, retaddr)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+ - ``s`` : signed
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``endian``
+ - ``le`` : little endian
+ - ``be`` : big endian
+ - ``ret`` : target endianness
+
+Regexes for git grep
+ - ``\<helper_\(le\|be\|ret\)_ld[us]\?[bwlq]_c\?mmu\>``
+ - ``\<helper_\(le\|be\|ret\)_st[bwlq]_mmu\>``
+
+``address_space_*``
+~~~~~~~~~~~~~~~~~~~
+
+These functions are the primary ones to use when emulating CPU
+or device memory accesses. They take an AddressSpace, which is the
+way QEMU defines the view of memory that a device or CPU has.
+(They generally correspond to being the "master" end of a hardware bus
+or bus fabric.)
+
+Each CPU has an AddressSpace. Some kinds of CPU have more than
+one AddressSpace (for instance ARM guest CPUs have an AddressSpace
+for the Secure world and one for NonSecure if they implement TrustZone).
+Devices which can do DMA-type operations should generally have an
+AddressSpace. There is also a "system address space" which typically
+has all the devices and memory that all CPUs can see. (Some older
+device models use the "system address space" rather than properly
+modelling that they have an AddressSpace of their own.)
+
+Functions are provided for doing byte-buffer reads and writes,
+and also for doing one-data-item loads and stores.
+
+In all cases the caller provides a MemTxAttrs to specify bus
+transaction attributes, and can check whether the memory transaction
+succeeded using a MemTxResult return code.
+
+``address_space_read(address_space, addr, attrs, buf, len)``
+
+``address_space_write(address_space, addr, attrs, buf, len)``
+
+``address_space_rw(address_space, addr, attrs, buf, len, is_write)``
+
+``address_space_ld{sign}{size}_{endian}(address_space, addr, attrs, txresult)``
+
+``address_space_st{size}_{endian}(address_space, addr, val, attrs, txresult)``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+
+(No signed load operations are provided.)
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``endian``
+ - ``le`` : little endian
+ - ``be`` : big endian
+
+The ``_{endian}`` suffix is omitted for byte accesses.
+
+Regexes for git grep
+ - ``\<address_space_\(read\|write\|rw\)\>``
+ - ``\<address_space_ldu\?[bwql]\(_[lb]e\)\?\>``
+ - ``\<address_space_st[bwql]\(_[lb]e\)\?\>``
+
+``{ld,st}*_phys``
+~~~~~~~~~~~~~~~~~
+
+These are functions which are identical to
+``address_space_{ld,st}*``, except that they always pass
+``MEMTXATTRS_UNSPECIFIED`` for the transaction attributes, and ignore
+whether the transaction succeeded or failed.
+
+The fact that they ignore whether the transaction succeeded means
+they should not be used in new code, unless you know for certain
+that your code will only be used in a context where the CPU or
+device doing the access has no way to report such an error.
+
+``load: ld{sign}{size}_{endian}_phys``
+
+``store: st{size}_{endian}_phys``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+
+(No signed load operations are provided.)
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``endian``
+ - ``le`` : little endian
+ - ``be`` : big endian
+
+The ``_{endian}_`` infix is omitted for byte accesses.
+
+Regexes for git grep
+ - ``\<ldu\?[bwlq]\(_[bl]e\)\?_phys\>``
+ - ``\<st[bwlq]\(_[bl]e\)\?_phys\>``
+
+``cpu_physical_memory_*``
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These are convenience functions which are identical to
+``address_space_*`` but operate specifically on the system address space,
+always pass a ``MEMTXATTRS_UNSPECIFIED`` set of memory attributes and
+ignore whether the memory transaction succeeded or failed.
+For new code they are better avoided:
+
+* there is likely to be behaviour you need to model correctly for a
+  failed read or write operation
+* a device should usually perform operations on its own AddressSpace
+  rather than using the system address space
+
+``cpu_physical_memory_read``
+
+``cpu_physical_memory_write``
+
+``cpu_physical_memory_rw``
+
+Regexes for git grep
+ - ``\<cpu_physical_memory_\(read\|write\|rw\)\>``
+
+``cpu_physical_memory_write_rom``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This function performs a write by physical address like
+``address_space_write``, except that if the write is to a ROM then
+the ROM contents will be modified, even though a write by the guest
+CPU to the ROM would be ignored.
+
+Note that unlike ``cpu_physical_memory_write()`` this function takes
+an AddressSpace argument, but unlike ``address_space_write()`` this
+function does not take a ``MemTxAttrs`` or return a ``MemTxResult``.
+
+**TODO**: we should probably clean up this inconsistency and
+turn the function into ``address_space_write_rom`` with an API
+matching ``address_space_write``.
+
+``cpu_physical_memory_write_rom``
+
+
+``cpu_memory_rw_debug``
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Access CPU memory by virtual address for debug purposes.
+
+This function is intended for use by the GDB stub and similar code.
+It takes a virtual address, converts it to a physical address via
+an MMU lookup using the current settings of the specified CPU,
+and then performs the access (using ``address_space_rw`` for
+reads or ``cpu_physical_memory_write_rom`` for writes).
+This means that if the access is a write to a ROM then this
+function will modify the contents (whereas a normal guest CPU access
+would ignore the write attempt).
+
+``cpu_memory_rw_debug``
+
+``dma_memory_*``
+~~~~~~~~~~~~~~~~
+
+These behave like ``address_space_*``, except that they perform a DMA
+barrier operation first.
+
+**TODO**: We should provide guidance on when you need the DMA
+barrier operation and when it's OK to use ``address_space_*``, and
+make sure our existing code is doing things correctly.
+
+``dma_memory_read``
+
+``dma_memory_write``
+
+``dma_memory_rw``
+
+Regexes for git grep
+ - ``\<dma_memory_\(read\|write\|rw\)\>``
+
+``pci_dma_*`` and ``{ld,st}*_pci_dma``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These functions are specifically for PCI device models which need to
+perform accesses where the PCI device is a bus master. You pass them a
+``PCIDevice *`` and they will do ``dma_memory_*`` operations on the
+correct address space for that device.
+
+``pci_dma_read``
+
+``pci_dma_write``
+
+``pci_dma_rw``
+
+``load: ld{sign}{size}_{endian}_pci_dma``
+
+``store: st{size}_{endian}_pci_dma``
+
+``sign``
+ - (empty) : for 32 or 64 bit sizes
+ - ``u`` : unsigned
+
+(No signed load operations are provided.)
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``endian``
+ - ``le`` : little endian
+ - ``be`` : big endian
+
+The ``_{endian}_`` infix is omitted for byte accesses.
+
+Regexes for git grep
+ - ``\<pci_dma_\(read\|write\|rw\)\>``
+ - ``\<ldu\?[bwlq]\(_[bl]e\)\?_pci_dma\>``
+ - ``\<st[bwlq]\(_[bl]e\)\?_pci_dma\>``
--- a/docs/devel/migration.txt
+++ b/docs/devel/migration.txt
@@ -202,7 +202,7 @@ The functions to do that are inside a vmstate definition, and are called:

  This function is called after we load the state of one device.

- void (*pre_save)(void *opaque);
+- int (*pre_save)(void *opaque);

  This function is called before we save the state of one device.

--- a/docs/interop/pr-helper.rst
+++ b/docs/interop/pr-helper.rst
@@ -0,0 +1,83 @@
+..
+
+======================================
+Persistent reservation helper protocol
+======================================
+
+QEMU's SCSI passthrough devices, ``scsi-block`` and ``scsi-generic``,
+can delegate implementation of persistent reservations to an external
+(and typically privileged) program.  Persistent Reservations allow
+restricting access to block devices to specific initiators in a shared
+storage setup.
+
+For a more detailed reference please refer the the SCSI Primary
+Commands standard, specifically the section on Reservations and the
+"PERSISTENT RESERVE IN" and "PERSISTENT RESERVE OUT" commands.
+
+This document describes the socket protocol used between QEMU's
+``pr-manager-helper`` object and the external program.
+
+.. contents::
+
+Connection and initialization
+-----------------------------
+
+All data transmitted on the socket is big-endian.
+
+After connecting to the helper program's socket, the helper starts a simple
+feature negotiation process by writing four bytes corresponding to
+the features it exposes (``supported_features``).  QEMU reads it,
+then writes four bytes corresponding to the desired features of the
+helper program (``requested_features``).
+
+If a bit is 1 in ``requested_features`` and 0 in ``supported_features``,
+the corresponding feature is not supported by the helper and the connection
+is closed.  On the other hand, it is acceptable for a bit to be 0 in
+``requested_features`` and 1 in ``supported_features``; in this case,
+the helper will not enable the feature.
+
+Right now no feature is defined, so the two parties always write four
+zero bytes.
+
+Command format
+--------------
+
+It is invalid to send multiple commands concurrently on the same
+socket.  It is however possible to connect multiple sockets to the
+helper and send multiple commands to the helper for one or more
+file descriptors.
+
+A command consists of a request and a response.  A request consists
+of a 16-byte SCSI CDB.  A file descriptor must be passed to the helper
+together with the SCSI CDB using ancillary data.
+
+The CDB has the following limitations:
+
+- the command (stored in the first byte) must be one of 0x5E
+  (PERSISTENT RESERVE IN) or 0x5F (PERSISTENT RESERVE OUT).
+
+- the allocation length (stored in bytes 7-8 of the CDB for PERSISTENT
+  RESERVE IN) or parameter list length (stored in bytes 5-8 of the CDB
+  for PERSISTENT RESERVE OUT) is limited to 8 KiB.
+
+For PERSISTENT RESERVE OUT, the parameter list is sent right after the
+CDB.  The length of the parameter list is taken from the CDB itself.
+
+The helper's reply has the following structure:
+
+- 4 bytes for the SCSI status
+
+- 4 bytes for the payload size (nonzero only for PERSISTENT RESERVE IN
+  and only if the SCSI status is 0x00, i.e. GOOD)
+
+- 96 bytes for the SCSI sense data
+
+- if the size is nonzero, the payload follows
+
+The sense data is always sent to keep the protocol simple, even though
+it is only valid if the SCSI status is CHECK CONDITION (0x02).
+
+The payload size is always less than or equal to the allocation length
+specified in the CDB for the PERSISTENT RESERVE IN command.
+
+If the protocol is violated, the helper closes the socket.
--- a/docs/memory-hotplug.txt
+++ b/docs/memory-hotplug.txt
@@ -24,7 +24,7 @@ Where,

 For example, the following command-line:

- qemu [...] 1G,slots=3,maxmem=4G
+ qemu [...] -m 1G,slots=3,maxmem=4G

 Creates a guest with 1GB of memory and three hotpluggable memory slots.
 The hotpluggable memory slots are empty when the guest is booted, so all
--- a/docs/pr-manager.rst
+++ b/docs/pr-manager.rst
@@ -0,0 +1,111 @@
+======================================
+Persistent reservation managers
+======================================
+
+SCSI persistent Reservations allow restricting access to block devices
+to specific initiators in a shared storage setup.  When implementing
+clustering of virtual machines, it is a common requirement for virtual
+machines to send persistent reservation SCSI commands.  However,
+the operating system restricts sending these commands to unprivileged
+programs because incorrect usage can disrupt regular operation of the
+storage fabric.
+
+For this reason, QEMU's SCSI passthrough devices, ``scsi-block``
+and ``scsi-generic`` (both are only available on Linux) can delegate
+implementation of persistent reservations to a separate object,
+the "persistent reservation manager".  Only PERSISTENT RESERVE OUT and
+PERSISTENT RESERVE IN commands are passed to the persistent reservation
+manager object; other commands are processed by QEMU as usual.
+
+-----------------------------------------
+Defining a persistent reservation manager
+-----------------------------------------
+
+A persistent reservation manager is an instance of a subclass of the
+"pr-manager" QOM class.
+
+Right now only one subclass is defined, ``pr-manager-helper``, which
+forwards the commands to an external privileged helper program
+over Unix sockets.  The helper program only allows sending persistent
+reservation commands to devices for which QEMU has a file descriptor,
+so that QEMU will not be able to effect persistent reservations
+unless it has access to both the socket and the device.
+
+``pr-manager-helper`` has a single string property, ``path``, which
+accepts the path to the helper program's Unix socket.  For example,
+the following command line defines a ``pr-manager-helper`` object and
+attaches it to a SCSI passthrough device::
+
+      $ qemu-system-x86_64
+          -device virtio-scsi \
+          -object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
+          -drive if=none,id=hd,driver=raw,file.filename=/dev/sdb,file.pr-manager=helper0
+          -device scsi-block,drive=hd
+
+Alternatively, using ``-blockdev``::
+
+      $ qemu-system-x86_64
+          -device virtio-scsi \
+          -object pr-manager-helper,id=helper0,path=/var/run/qemu-pr-helper.sock
+          -blockdev node-name=hd,driver=raw,file.driver=host_device,file.filename=/dev/sdb,file.pr-manager=helper0
+          -device scsi-block,drive=hd
+
+----------------------------------
+Invoking :program:`qemu-pr-helper`
+----------------------------------
+
+QEMU provides an implementation of the persistent reservation helper,
+called :program:`qemu-pr-helper`.  The helper should be started as a
+system service and supports the following option:
+
+-d, --daemon              run in the background
+-q, --quiet               decrease verbosity
+-v, --verbose             increase verbosity
+-f, --pidfile=path        PID file when running as a daemon
+-k, --socket=path         path to the socket
+-T, --trace=trace-opts    tracing options
+
+By default, the socket and PID file are placed in the runtime state
+directory, for example :file:`/var/run/qemu-pr-helper.sock` and
+:file:`/var/run/qemu-pr-helper.pid`.  The PID file is not created
+unless :option:`-d` is passed too.
+
+:program:`qemu-pr-helper` can also use the systemd socket activation
+protocol.  In this case, the systemd socket unit should specify a
+Unix stream socket, like this::
+
+    [Socket]
+    ListenStream=/var/run/qemu-pr-helper.sock
+
+After connecting to the socket, :program:`qemu-pr-helper`` can optionally drop
+root privileges, except for those capabilities that are needed for
+its operation.  To do this, add the following options:
+
+-u, --user=user           user to drop privileges to
+-g, --group=group         group to drop privileges to
+
+---------------------------------------------
+Multipath devices and persistent reservations
+---------------------------------------------
+
+Proper support of persistent reservation for multipath devices requires
+communication with the multipath daemon, so that the reservation is
+registered and applied when a path is newly discovered or becomes online
+again.  :command:`qemu-pr-helper` can do this if the ``libmpathpersist``
+library was available on the system at build time.
+
+As of August 2017, a reservation key must be specified in ``multipath.conf``
+for ``multipathd`` to check for persistent reservation for newly
+discovered paths or reinstated paths.  The attribute can be added
+to the ``defaults`` section or the ``multipaths`` section; for example::
+
+    multipaths {
+        multipath {
+            wwid   XXXXXXXXXXXXXXXX
+            alias      yellow
+            reservation_key  0x123abc
+        }
+    }
+
+Linking :program:`qemu-pr-helper` to ``libmpathpersist`` does not impede
+its usage on regular SCSI devices.
--- a/docs/qdev-device-use.txt
+++ b/docs/qdev-device-use.txt
@@ -366,17 +366,9 @@ bus=PCI-BUS,addr=DEVFN to control the PCI device address, as usual.
 === Host Device Assignment ===

 QEMU supports assigning host PCI devices (qemu-kvm only at this time)
-and host USB devices.
+and host USB devices.  PCI devices can only be assigned with -device:

-The old way to assign a host PCI device is
-
-    -pcidevice host=ADDR,dma=none,id=ID
-
-The new way is
-
-    -device pci-assign,host=ADDR,iommu=IOMMU,id=ID
-
-The old dma=none becomes iommu=off with -device.
+    -device vfio-pci,host=ADDR,id=ID

 The old way to assign a host USB device is

--- a/Show More
+++ b/Show More
@@ -1 +1 @@
 .10.50
 .10.91