Creating QEMU images for pytest

Today I’m creating QEMU images and configuring them using cloud-init, all to create a virtual machine image to run the pytest suite for “faas”, my implementation of a function-as-a-service platform.

Cloud-init and Ubuntu cloud image

I’m going to use Ubuntu cloud images for our virtual machines. They come with cloud-init already installed, and skip any desktop installation steps, perfect for the server application I’m developing.

You can download the release at https://cloud-images.ubuntu.com/. I’m using the “noble” release, and I’m running these VMs on an M1 Mac so I’ll select an arm64 architecture.

curl -fsSL -o ubuntu.img "https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-arm64.img"

I noticed that if you boot up the VM without cloud-init configuration files the initialization hangs after Started systemd-timedated.service at Job systemd-networkd-wait-online.service/start running. Not sure why, but let’s define the cloud-init files and bootstrap the VM.

There are two files to create:

user-data: “cloud-config” style YAML that declaratively configures the VM.
meta-data: per-instance details for configuration (instance-id, local-hostname)

Cloud-init has templating built in, where key-values from a JSON object in a instance-data file are interpolated into a final user-data configuration file from a Jinja-templated user-data file. I don’t have a complex configuration for this image, nor do I plan to generate variations on this user-data configuration, so I won’t template the file.

I’ll start by mentioning the meta-data file. This is required by cloud-init, and is usually fetched from a cloud-provider’s metadata system that manages on-platform VMs. Since we’re running this locally, I manually define it.

instance-id: faasvm-test-runner
local-hostname: faasvm

The instance-id is required. Cloud-init uses it to know if the virtual machine has been initialized already. If the instance-id matches what is stored on the VM’s disk at /var/lib/cloud/data/instance-id, cloud-init skips running per-instance modules from the user-data config (identified by “Module Frequency” in the docs).

Let’s turn to the user-data configuration. Cloud-init supports a lot of different user-data formats, we’re using the “cloud-config” format. The file MUST start with #cloud-config to distinguish it to cloud-init’s parser.

#cloud-config
hostname: faasvm
package_update: true
package_upgrade: false
write_files:
  - path: /usr/local/bin/setup.sh
    permissions: '0755'
    content: |
      #!/bin/bash
      set -euo pipefail
      export DEBIAN_FRONTEND=noninteractive
      apt-get update
      apt-get install -y --no-install-recommends ca-certificates curl git \
        openssh-client python3 python3-pip python3-venv rsync
      curl -LsSf https://astral.sh/uv/install.sh | sh
      install -m 0755 -o root -g root /root/.local/bin/uv /usr/local/bin/uv
runcmd:
  - /usr/local/bin/setup.sh
  - rm -f /usr/local/bin/setup.sh
power_state:
  mode: poweroff
  delay: now
users:
  - name: faas_user
    sudo: ['ALL=(root) NOPASSWD: /sbin/poweroff']
    ssh_authorized_keys:

The VM is simple. It’s meant to run the pytest suite, that’s it, so make sure to install uv (the package manager for Python) and any of its dependencies. I’ll also be using rsync to copy the project source code and test files into the VM, and ssh to send commands to the VM.

I’d like to call out a couple cloud-init modules: “power_state” and “users”.

“power_state” runs after all other modules have finished and handles shutdown and reboot. I’m initializing a virtual machine image so I can create overlay images for pytest runs, so I want the VM to immediately exit when its done initializing, signaling the base image is ready. So, I set it to “poweroff”, “now”, which will cause QEMU’s emulator process to exit without delay when cloud-init finishes.

“users” gives me an opportunity to define permissions for the VM user that I’ll run the tests as. I haven’t discussed the security policy or threat model for this project, but I’ll avoid giving the user membership in the “sudo” group, except for the ability trigger a poweroff. Without that sudoers rule, running poweroff over SSH will provoke a password prompt for the password-less user. Finally, ssh_authorized_keys lets me specify the SSH keypair I’ll use to connect to the running instance. I generate them separately from the VM, and cat the public key to the end of the user-data config when I create the seed.iso with the cloud-init seed data created by cloud-localds (touched on later). The instance-data file format I mentioned earlier is a great way to interpolate public keys into your cloud-init config, but sometimes messy is fun 😄.

qcow2 format

The Ubuntu image is in qcow2 format. That stands for “QEMU Copy-on-Write 2”, where copy-on-write means storage is not reserved ahead of time and is only allocated when needed. You can observe this by examing our Ubuntu image:

$ qemu-img info vm/ubuntu.img
image: vm/ubuntu.img
file format: qcow2
virtual size: 23.5 GiB (25232932864 bytes)
disk size: 591 MiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false
Child node '/file':
    filename: vm/ubuntu.img
    protocol type: file
    file length: 591 MiB (620102144 bytes)
    disk size: 591 MiB

It has a virtual size and a disk size. The virtual size was set by running qemu-img resize ubuntu.img +20G on Canonical’s release distribution, but the image only takes 591MiB on-disk. So we can set total limits on VM storage use without paying the cost of pre-allocation on the host, which is not the case for raw disk images like .iso.

The old qcow v1 format is deprecated. qcow2 introduced a new header format and better snapshot features. There was an extension to qcow2 some years after it was introduced (originally called qcow3, identified with compat: 1.1 in the image info) that added optional header extensions, enabling compression, encryption, improved clustering, and easier snapshots.

qemu-img create

We have our base Ubuntu image and we’ve given it permission to consume at least 20GiB of memory on the host, which will be more than enough for our test runner. Now is a question of cleanliness, or more precisely, organization and reproducibility. Our tests may have side-effects in cases of errors and normal development, which means the state of the Ubuntu virtual machine will be changed such that our tests operate within a different environment on each run. For example, faasd changes OS networking settings in order to route connections to containers. If the underlying OS settings are different between test runs, we introduce a source non-determinism to our test suite, which may cause false-positives, false-negatives, or flaky tests.

How can we address this? Our goal is to run our test suite in the same environment each time, and build a set of assertions on top of this stable foundation to verify program correctness. One solution is to have a clean image, our ubuntu.img release from Canonical, and simply copy it (cp ubuntu.img test_run.img) before each test. Voilà, a fresh OS for each run. However, revisiting our distinction between virtual size and disk size, we’re copying the bytes on disk so our disk usage will double from 591MiB to >1Gib. Plus we have to pay the time cost (IOPS) to write 591MiB to disk. It’s better than paying the pre-allocation cost for raw image formats, but can we do better?

Looking closely, there is no difference between the base image and the derived image. The name of our image format, copy-on-write, gives us a clue. What if we didn’t have to copy the whole image, and only copied the new blocks we write to? Then our new image would only contain the different blocks against a read-only base image. These space-saving images are often called overlay images, and they’re enabled by the qcow2 format. Let’s take a look how.

qcow2 header

Each qcow2 file has a clearly defined header format. For each overlay image to only contain the differences between itself and a base image (called a backing image by QEMU), it stores the path to the backing image in the file.

Bytes 8-15 describe the offset into the image where the backing image name is stored, and bytes 16-19 store the size of the backing file name in bytes. The theoretical maximum size of the filename is the maximum size for a 32-bit integer, which corresponds a 4+ billion character filename, but QEMU specification limits this to 1023 characters.

Let’s take a look at the qemu-img info for an overlay image. Here’s the command to make a overlay image from a backing image:

$ qemu-img create -f qcow2 -B qcow2 -b ubuntu.img test_run.img
$ qemu-img info test_run.img
image: test_run.img
file format: qcow2
virtual size: 23.5 GiB (25232932864 bytes)
disk size: 196 KiB
cluster_size: 65536
backing file: ubuntu.img
backing file format: qcow2
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false
Child node '/file':
    filename: test.img
    protocol type: file
    file length: 192 KiB (197120 bytes)
    disk size: 196 KiB

The overlay has two extra key values relative to our base ubuntu image: “backing file”, and “backing file format”. Also, even though this image is practically a copy, the disk size is dramatically lower than the base image: 196KiB vs. 591MiB, or more than 3000x smaller than a naive copy would be (note the virtual size is the same).

This comes with a tradeoff though, the overlay image needs a stable reference to the base image. If we rename our base image

mv ubuntu.img foo.img

And try to run our overlay image

qemu-system-aarch64 -bios "$(brew --prefix qemu)/share/qemu/edk2-aarch64-code.fd" -accel hvf -cpu host -machine virt,highmem=on -smp 4 -m 4096 -display none -serial stdio -drive if=virtio,format=qcow2,file=test.img -nic user,model=virtio-net-pci

qemu-system-aarch64: -drive if=virtio,format=qcow2,file=test.img: Could not open backing file: Could not open 'ubuntu.img': No such file or directory

We get an error that QEMU can’t find the backing file. So, anytime we make overlay images, we need the base image to be accessible by QEMU to run them.

cloud-localds and seed images

The overlay image is backed by the Ubuntu release image, and I’ll apply the project’s specific configuration to this overlay. I can then re-use the initialized test runner overlay image for test run instance images we can discard or debug.

How do I initialize the overlay image? I’ve already described the desired configuration of the VM in the user-data section above, but I haven’t applied it to any image. In order to do that, I have to make the user-data and meta-data files available to cloud-init at boot-up time. This strategy has two steps. First, store the configuration files inside a separate raw image. Second, mount that raw image when you boot the VM for the first time using QEMU. Cloud-init should then read them, perform initialization, and shutdown the VM, leaving it in the desired state.

The raw image containing our user-data and meta-data files is typically called a seed image. This image contains static read-only files and is not meant to be written to, so there is no point in using a qcow2 format.

A convenient utility cloud-localds is used to make seed images for cloud-init. It’s distributed and maintained by Canonical. The utility is simple, mostly handling different permutations of options: what disk format for the image, the filesystem choice, or specifying a remote metadata provider. My need is simple and the defaults work well, so I will focus on the core functionality.

cloud-localds creates a temporary directory and copies over the user-data and meta-data files. It then uses genisoimage to create our raw image using the following command:

# "$img" is the final image file, it will contain the files specified by
# the ${files[@]} array
genisoimage \
  -output "$img" \
  -volid cidata \
  -joliet -rock \
  "${files[@]}" # ("$tmp_dir/user-data", "$tmp_dir/meta-data")

The key option is -volid cidata, which sets the ISO volume label to cidata. Cloud-init uses this label to find the configuration files at boot-time. First, cloud-init enumerates all block devices and checks their labels. Raw images (ISO9660) store the labels starting at sector 16 of the image (sectors are fixed data segments 2048 bytes in length). -joliet and -rock are extensions to the ISO9660 format that are not enabled by default for compatibility reasons, even though they’re both over thirty years old at this point. They are not strictly necessary for this use-case, but nothing is lost by enabling the flags.

(Fun fact, the -rock extension is called “Rock Ridge Interchange Protocol”, and is named after the town in Mel Brooks’ “Blazing Saddles”)

Now that I’ve explained how the cloud-localds command works, I’ll run it to create the seed.iso image. cloud-localds only runs on Linux and I’m on MacOS, so I will run it in a docker container and mount the needed files.

# by default docker mounts non-existent paths as directories, so create
# an empty file to force a file mount of our final image
$ touch seed.iso 
$ docker run \
    -v ./seed.iso:/seed.iso \
    -v ./user-data:/user-data \
    -v ./meta-data:/meta-data \
    ubuntu:latest sh -c "
      set -e
      cp /user-data /user-data-copy
      echo '      - $(cat ssh-key.pub)' >> /user-data-copy
      apt-get update && apt-get install -y -qq cloud-image-utils
      cloud-localds /seed.iso /user-data-copy /meta-data
    "
# notice my quick-and-easy way of injecting the ssh public key

Now take a look at the volume label of the seed.iso.

# the label is ASCII text stored in sector 16, where a sector is 2048
# bytes in length
$ hexdump -C -s $((16*2048)) -n 256 seed.iso
00008000  01 43 44 30 30 31 01 00  4c 49 4e 55 58 20 20 20  |.CD001..LINUX   |
00008010  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00008020  20 20 20 20 20 20 20 20  63 69 64 61 74 61 20 20  |        cidata  |
00008030  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
00008040  20 20 20 20 20 20 20 20  00 00 00 00 00 00 00 00  |        ........|
00008050  b7 00 00 00 00 00 00 b7  00 00 00 00 00 00 00 00  |................|
00008060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00008070  00 00 00 00 00 00 00 00  01 00 00 01 01 00 00 01  |................|
00008080  00 08 08 00 0a 00 00 00  00 00 00 0a 14 00 00 00  |................|
00008090  00 00 00 00 00 00 00 16  00 00 00 00 22 00 1c 00  |............"...|
000080a0  00 00 00 00 00 1c 00 08  00 00 00 00 08 00 46 01  |..............F.|
000080b0  01 00 00 00 00 02 00 00  01 00 00 01 01 00 20 20  |..............  |
000080c0  20 20 20 20 20 20 20 20  20 20 20 20 20 20 20 20  |                |
*
00008100

cidata is present! Now cloud-init will be able to identify our device and find the user-data and meta-data within.

qemu-system-aarch64

Now I have a seed image containing our cloud-init configuration, and the overlay image backed by the Ubuntu image. It’s time to initialize the operating system and produce our base image for test runs. The hard part has been done, now it’s time to boot the VM and let cloud-init do its thing.

The following command start the QEMU emulator process:

qemu-system-aarch64 \
    -accel hvf \
    -cpu host \
    -machine virt \
    -smp 4 \
    -m 4G \
    -drive file=seed.iso,if=none,format=raw,readonly=on,id=cidata \
    -device virtio-blk-pci,drive=cidata \
    -drive file=test_runner.img,if=none,format=qcow2,id=hd0 \
    -device virtio-blk-pci,drive=hd0 \
    -device virtio-net-pci,netdev=net0 \
    -netdev user,id=net0 \
    -nographic \
    -bios "$(brew --prefix qemu)/share/qemu/edk2-aarch64-code.fd"

Let’s break this command down line-by-line. First, the qemu-system-aarch64 command starts the 64-bit ARM architecture simulator. My M1 Mac is an ARM processor, so selecting the same instruction set gives me a chance at running code in the VM at near-native performance, as QEMU won’t have to translate each instruction coming from the VM (the guest) to the proper instruction set on my Mac (the host).

To that end, the option -accel hvf causes QEMU to use Apple’s own Hypervisor framework and take advantage of hardware-based virtualization, which is much faster than a software-based one. And -cpu host exposes CPU details from the host to the guest, which means guest software can take advantage of all ISA features from the host if it’s already been optimized with them in mind.

Next, the -machine virt flag indicates the guest OS will be run as a generic virtual machine whose emulated physical characteristics are up to the user to specify. Otherwise, QEMU will limit the hardware to what’s possible on the specific machine platform chosen, for example, emulating a RaspberryPi’s hardware exactly. This flag sets the option highmem=on by default, which allows you to grant more RAM to the virtual machine by expanding the memory address-space to >4G.

So let’s specify some of those physical characteristics. First, -smp 4 grants 4 CPU cores, and -m 4G grants 4 Gibibytes of RAM, to the virtual machine. Next, mount the drives. -drive file=seed.iso,if=none,format=raw,readonly=on,id=cidata is the “Backend” specification for a drive, and -device virtio-blk-pci,drive=cidata is the “Frontend” specification of how that drive is presented to the guest OS. In this case, our seed.iso is a raw image that should only be read, and we expose it to the guest through a “virtio-blk-pci” interface. VirtIO is a high-performance para-virtualized device interface that informs device drivers they’re being virtualized, giving them tools to cooperate with the hypervisor for better performance. I also specify -drive file=test_runner.img,if=none,format=qcow2,id=hd0, the OS image, and mount it in the same way -device virtio-blk-pci,drive=hd0. Same interface, different backing format. Remember, raw images are better for performance, but qcow2 allows virtual machine disk size to grow on-demand, which is better for experimentation.

Next, networking. We specify the backend network interface using -device virtio-net-pci,netdev=net0, and then mount it into the guest with -netdev user,id=net0. -netdev user is QEMU’s user-mode networking feature. QEMU creates a network in the guest and internally manages a NAT server for it. This adds overhead. QEMU now has to intercept packets, rewrite headers, and manage a routing table, all in userspace. If I wanted to improve performance, I would use TAP/bridge networking, but that feature has been limited on MacOS. (A TAP device is a virtual ethernet cable, while a bridge is the layer 2 switch between that virtual cable and the host’s local area network).

We’re getting close to the end. -nographic is a short-hand for -display none and -serial mon:stdio. -display none does not mount a graphical device in the virtual machine, while -serial mon:stdio multiplexes two serial interfaces of data: stdin/stdout from the guest console, and the QEMU monitor console. The monitor console is a serial interface to QEMU that lets you send commands to control and inspect the virtual machine. We’ll touch more on that in the debugging section.

Finally, we specify the BIOS: -bios "$(brew --prefix qemu)/share/qemu/edk2-aarch64-code.fd". The ARM-specific BIOS is bundled with the QEMU’s homebrew release. This is the firmware that takes care of booting the guest OS on the host architecture. Without this, the virtual machine wouldn’t know where to start, or what hardware it has available.

How to debug QEMU

Try to get logs and block the VM from shutting itself down:

-D qemu-log.txt -no-shutdown

enable QEMU monitor:

qemu-system-aarch64 \
  -serial stdio \
  -monitor unix:qemu-monitor.sock \
  ...
nc -U qemu-monitor.sock
# run "info status" to see current VM state

threat model

A Function-as-a-Service runs code on behalf of customers, and must control access to the provider’s own system, while monitoring its behavior for reliability and security.

driving pytest over ssh

This one’s easy. Start the virtual machine.

$ qemu-system-aarch64 \
    -bios "$(brew --prefix qemu)/share/qemu/edk2-aarch64-code.fd" \
    -accel hvf \
    -cpu host \
    -machine virt \
    -smp 4 \
    -m 4G \
    -nographic \
    -drive "if=virtio,format=qcow2,file=test_runner.img \
    -nic user,model=virtio-net-pci,hostfwd=tcp::2222-:22 \
    &>/dev/null \
    & # run in background

Send your software and test files into the virtual machine user’s home directory.

$ rsync \
    -e "$ssh" \
    -qa \
    --include "pyproject.toml" \
    --include "pytest.toml" \
    --include "uv.lock" \
    --include "src/***" \
    --include "tests/***" \
    --exclude "*" \
    "$ROOT/" faas_user@localhost:~

And run pytest.

$ ssh="ssh \
    -o StrictHostKeyChecking=no \
    -o UserKnownHostsFile=/dev/null \
    -o ConnectTimeout=1 \
    -o ConnectionAttempts=1 \
    -o LogLevel=quiet \
    -p 2222 \
    -i ./ssh-key"

$ $ssh faas_user@localhost 'uv run pytest && sudo poweroff'

Then just wait

$ wait
Booting vm
checking for port readiness...Done (0m0.120s)
checking for ssh availability......Done (0m11.687s)
Running pytest...Done (0m2.946s)