devlog Porting FreeBSD to QEMU MicroVM

I haphazardly take notes while doing development, when real debugging is happening I fill folded pieces of paper with the names of functions and variables as they pop into notice and fade away again. I thought during the VPP port it might be nice to both have better notes, but to also share this process more.

I constantly am on the edge with talking about what I am doing, I've internalised that "research shows that people who share their plans are less likely to complete them" and I'm aware that people jump at news rather than actual facts. I don't want anyone to get too excited about in progress things, but I also don't to keep everything to myself.

Reading about people doing ports of Linux and bringing up new stuff in OpenBSD via undeadly.org was a big draw to me when I was coming up. I would attribute people writing about their doings as one of the biggest factors leading to the work I currently do.

This is an experiment in sharing something close to development notes, but written with an audience in mind. I do have other notes like this for other projects. I think an ideal devlog is like Joey Hess's old daily devblog , but for me the balance is probably closer to writing as I go and publishing at milestones.


I read Rob's article on quiz an excellent tool for speeding up Linux kernel deveopment and learned about QEMU MicroVM, it seemed like a great way to combine a crazy idea with a useful idea.

QEMU MicroVM offers a QEMU based implementation of soemthing like the Amazon Firecracker microvm that Colin Percival ported FreeBSD to over the last few years.

Colin has written a lot about the port to Firecracker and given a presentation a few times . The covers some of the technical description of firecracker there, how it works and how it relates to 'normal' machines. Colin's 2023 BSDCan talk is the basis for pretty much everything I know about Firecracker.

I have been thinking about an idea that requires FreeBSD booting super quickly for a couple of months and MicroVM has been the focus of that speculation for the last month. I keep coming up with more reason to have some fun looking hacking.

Running MicroVM

While I tried to get Rob's stuff going on Ubuntu (and then failed to install Debian at all), maybe a good way to start is to just wade in.

qemu-system-x86_64 requires quite a few arguments to run a MicroVM machine, I find it easiest to bundle these into a script I can pass in the kernel and disk to:

#!/bin/sh

kernel=$1
disk=$2

memory=512m
cores=4
netif=tap0

qemu-system-x86_64 -M microvm                                   \
    -cpu max                                                    \
    -m ${memory}                                                \
    -smp ${cores}                                               \
    -kernel ${kernel}                                           \
    -append "earlyprintk=ttyS0 console=ttyS0 root=/dev/vda"     \
    -nodefaults                                                 \
    -no-user-config                                             \
    -nographic                                                  \
    -serial stdio                                               \
    -drive id=test,file=${disk},format=raw,if=none              \
    -device virtio-blk-device,drive=test

Building a Firecracker FreeBSD kernel is straight forward, from a FreeBSD tree run:

$ make -j 16 -s buildkernel KERNCONF=FIRECRACKER TARGET=amd64

That will give you a kernel to feed to QEMU like so:

[tj@computer] $ sh ~/code/scripts/qemu/microvm.sh /usr/obj/usr/home/tj/code/freebsd/worktrees/microvm/amd64.amd64/sys/FIRECRACKER/kernel ~/vms/test.raw
SeaBIOS (version rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org)
Booting from ROM..GDB: no debug ports present
KDB: debugger backends: ddb
KDB: current backend: ddb
---<<BOOT>>---
Copyright (c) 1992-2024 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 15.0-CURRENT #0 thj/microvm-n269722-8ceac8e13dcc: Fri Apr 26 17:55:03 BST 2024
tj@displacementactivity:/usr/obj/usr/home/tj/code/freebsd/worktrees/microvm/amd64.amd64/sys/FIRECRACKER amd64
FreeBSD clang version 18.1.3 (https://github.com/llvm/llvm-project.git llvmorg-18.1.3-0-gc13b7485b879)
WARNING: WITNESS option enabled, expect reduced performance.
CPU: QEMU TCG CPU version 2.5+ (K8-class CPU)
Origin="AuthenticAMD"  Id=0x663  Family=0x6  Model=0x6  Stepping=3
Features=0x1fc3fbfd<FPU,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,ACPI,MMX,FXSR,SSE,S>
Features2=0xfed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F>
AMD Features=0xec500800<SYSCALL,NX,MMX+,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
AMD Features2=0x177<LAHF,CMP,SVM,CR8,ABM,SSE4A,Prefetch>
Structured Extended Features=0x21dc43a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,MPX,RDSEED,ADX,SMAP,<b22>,CLFLUSH>
Structured Extended Features2=0x8041021c<UMIP,PKU,OSPKE,VAES,LA57,RDPID>
Structured Extended Features3=0x10<FSRM>
XSAVE Features=0x5<XSAVEOPT,XINUSE>
AMD Extended Feature Extensions ID EBX=0x204<XSaveErPtr,WBNOINVD>
SVM: NP,NAsids=16
Hypervisor: Origin = "TCGTCGTCGTCG"
real memory  = 536858624 (511 MB)
avail memory = 492081152 (469 MB)
MPTable: <BOCHSCPU 0.1         >
Event timer "LAPIC" quality 600
panic: TSC not initialized
cpuid = 0
time = 1
KDB: stack backtrace:
db_fetch_ksymtab() at db_fetch_ksymtab+0x17b/frame 0xffffffff81404dc0
vpanic() at vpanic+0x135/frame 0xffffffff81404ef0
panic() at panic+0x43/frame 0xffffffff81404f50
lapic_init() at lapic_init+0x4a4/frame 0xffffffff81404f70
mptable_pci_host_res_init() at mptable_pci_host_res_init+0x856/frame 0xffffffff81404f90
lapic_ipi_free() at lapic_ipi_free+0x152f/frame 0xffffffff81404fa0
mi_startup() at mi_startup+0x1c8/frame 0xffffffff81404ff0
KDB: enter: panic
[ thread pid 0 tid 0 ]
Stopped at      kdb_enter+0x33: movq    $0,0x968d12(%rip)
db>

Incredible progress! I was expecting this to be much more of a fight to get anything from the Firecracker image. Colin spoke about binary searching by looking at system load in the virtual machine host dashboard, but here we have serial output immediately. To check that this wasn't a fluke I also tried a generic kernel:

[tj@computer] $ sh ~/code/scripts/qemu/microvm.sh /usr/obj/usr/home/tj/code/freebsd/worktrees/microvm/amd64.amd64/sys/GENERIC/kernel ~/vms/test.raw
SeaBIOS (version rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org)
Booting from ROM..

It just sat there. Good news, we have a solid starting point with the Firecracker kernel.

First steps

The panic message in our first boot is "TSC not initialized" - a quick "grep -R' away in local_apic.c:547 .

#ifdef SMP
#define LOOPS   1000
        /*
         * Calibrate the busy loop waiting for IPI ack in xAPIC mode.
         * lapic_ipi_wait_mult contains the number of iterations which
         * approximately delay execution for 1 microsecond (the
         * argument to lapic_ipi_wait() is in microseconds).
         *
         * We assume that TSC is present and already measured.
         * Possible TSC frequency jumps are irrelevant to the
         * calibration loop below, the CPU clock management code is
         * not yet started, and we do not enter sleep states.
         */
        KASSERT((cpu_feature & CPUID_TSC) != 0 && tsc_freq != 0,
            ("TSC not initialized"));
        if (!x2apic_mode) {
                r = rdtsc();

Commenting out the KASSERT lets us advance as far as configuring uart0 :

...
Event timer "LAPIC" quality 600
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
arc4random: WARNING: initial seeding bypassed the cryptographic random device because it was not yet seeded and th.
ioapic0: Assuming intbase of 0
ioapic0 <Version 2.0> irqs 0-23
random: entropy device external interface
aesni0: <AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS,SHA1,SHA256>
cpu0
isa0: <ISA bus>
orm0: <ISA Option ROM> at iomem 0xef000-0xeffff pnpid ORM0000 on isa0
uart0: <16550 or compatible> at port 0x3f8 irq 4 flags 0x10 on isa0


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x20
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff808f5190
stack pointer           = 0x28:0xffffffff81404e20
frame pointer           = 0x28:0xffffffff81404e40
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, IOPL = 0
current process         = 0 (swapper)
rdi: 0000000000000020 rsi: 0000000000000000 rdx: 00000000000003f9
rcx: 0000000000000000  r8: 0000000000000000  r9: 0000000000000001
rax: ffffffff80e49040 rbx: 0000000000000020 rbp: ffffffff81404e40
r10: 0000000000010000 r11: 0000000000000001 r12: fffff800024f4e64
r13: 0000000000000001 r14: 00000000000000c8 r15: fffff800025b9858
trap number             = 12
panic: page fault
cpuid = 0
time = 1
KDB: stack backtrace:
db_fetch_ksymtab() at db_fetch_ksymtab+0x17b/frame 0xffffffff81404af0
vpanic() at vpanic+0x135/frame 0xffffffff81404c20
panic() at panic+0x43/frame 0xffffffff81404c80
trap() at trap+0xd2b/frame 0xffffffff81404ce0
trap() at trap+0xdd0/frame 0xffffffff81404d50
calltrap() at calltrap+0x8/frame 0xffffffff81404d50
--- trap 0xc, rip = 0xffffffff808f5190, rsp = 0xffffffff81404e20, rbp = 0xffffffff81404e40 ---
pvclock_get_timecount() at pvclock_get_timecount+0x10/frame 0xffffffff81404e40
xen_delay() at xen_delay+0x1d/frame 0xffffffff81404e60
ns8250_bus_attach() at ns8250_bus_attach+0x3e2/frame 0xffffffff81404e90
uart_bus_attach() at uart_bus_attach+0x184/frame 0xffffffff81404ed0
device_attach() at device_attach+0x3aa/frame 0xffffffff81404f10
device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff81404f40
isa_probe_children() at isa_probe_children+0x23a/frame 0xffffffff81404fa0
mi_startup() at mi_startup+0x1c8/frame 0xffffffff81404ff0
KDB: enter: panic
[ thread pid 0 tid 100000 ]
Stopped at      kdb_enter+0x33: movq    $0,0x968d12(%rip)
db>

But then we take a page fault in pvclock_get_timecount() , so it looks like we can't just skip past that first KASSERT. What should we be doing the first time around?

I forced on bootverbose in init_main.c to see if that would give me any more clues. Our error is coming from having a TSC, but none of the TSC reads being valid. Along with there being a TSC we need another variable set to tell us how to read the TSC for calibration.

The source of our error is really early in machine dependant code, but it is caught much later. This means we can print how things are later on in boot, but not when they fail.

print_hypervisor_info:2639 vm_guest 0x1 tsc_freq 0x0 tsc 0x10 hv_high 0x40000001 cpu_high 0xd

I added a print in indentcpu.c at print_hypervisor_info as that was easy to see in the message buffer.

static int
tsc_freq_cpuid_vm(void)
{
        u_int regs[4];

        if (vm_guest == VM_GUEST_NO)
                return (false);
        if (hv_high < 0x40000010)
                return (false);
        do_cpuid(0x40000010, regs);
        tsc_freq = (uint64_t)(regs[0]) * 1000;
        tsc_early_calib_exact = 1;
        return (true);
}

...

/*
 * Calculate TSC frequency using information from the CPUID leaf 0x15 'Time
 * Stamp Counter and Nominal Core Crystal Clock'.  If leaf 0x15 is not
 * functional, as it is on Skylake/Kabylake, try 0x16 'Processor Frequency
 * Information'.  Leaf 0x16 is described in the SDM as informational only, but
 * we can use this value until late calibration is complete.
 */
static bool
tsc_freq_cpuid(uint64_t *res)
{
        u_int regs[4];

        if (cpu_high < 0x15)
                return (false);
        do_cpuid(0x15, regs);
        if (regs[0] != 0 && regs[1] != 0 && regs[2] != 0) {
                *res = (uint64_t)regs[2] * regs[1] / regs[0];
                return (true);
        }
...

The calibration functions look like this, they check vm_guest (0x1 for us or VM_GUEST_VM ) and then look at the value of hv_high or cpu_high in the default tsc_freq_cpuid

With hv_high as 0x40000001 and cpu_high as 0x0D we are failing the checks in both of the obvious tsc_freq functions. Why?

To figure that out we need to look at where both of these values come from.

do_cpuid(0, regs);
cpu_high = regs[0];     /* eax */

hv_high is set int identify_hypervisor_cpuid_base in sys/x86/x86/identcpu.c with this block:

/*
 * If this is the first entry or we found a
 * specific hypervisor, record the base, high value,
 * and vendor identifier.
 */
if (vm_guest != prev_vm_guest || leaf == 0x40000000) {
        hv_base = leaf;
        hv_high = regs[0];
        ((u_int *)&hv_vendor)[0] = regs[1];
        ((u_int *)&hv_vendor)[1] = regs[2];
        ((u_int *)&hv_vendor)[2] = regs[3];
        hv_vendor[12] = '\0';

        /*
         * If we found a specific hypervisor, then
         * we are finished.
         */
        if (vm_guest != VM_GUEST_VM &&
            /*
             * Xen and other hypervisors can expose the
             * HyperV signature in addition to the
             * native one in order to support Viridian
             * extensions for Windows guests.
             *
             * Do the full cpuid scan if HyperV is
             * detected, as the native hypervisor is
             * preferred.
             */
            vm_guest != VM_GUEST_HV)
                break;
}

This comment about VM_GUEST_VM is suspect, we are running as VM_GUEST_VM and this comment implies that it is an early detection, but an error to actually try and use. Other comments in identify_hypervisor_cpuid_base point to lkml threads and a vmware knowledge base (KB1009458):

Testing the CPUID hypervisor present bit

Intel and AMD CPUs have reserved bit 31 of ECX of CPUID leaf 0x1 as the
hypervisor present bit. This bit allows hypervisors to indicate their presence
to the guest operating system. Hypervisors set this bit and physical CPUs (all
existing and future CPUs) set this bit to zero. Guest operating systems can
test bit 31 to detect if they are running inside a virtual machine.

Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF for
software use. Hypervisors can use these leaves to provide an interface to pass
information from the hypervisor to the guest operating system running inside a
virtual machine. The hypervisor bit indicates the presence of a hypervisor and
that it is safe to test these additional software leaves. VMware defines the
0x40000000 leaf as the hypervisor CPUID information leaf. Code running on a
VMware hypervisor can test the CPUID information leaf for the hypervisor
signature. VMware stores the string "VMwareVMware" in EBX, ECX, EDX of CPUID
leaf 0x40000000.

Testing the virtual BIOS DMI information and the hypervisor port

Apart from the CPUID-based method for VMware virtual machine detection,
VMware also provides a fallback mechanism for the following reasons:

This CPUID-based technique will not work for guest code running at CPL3
when VT/AMD-V is not available or not enabled.  The hypervisor present bit
and hypervisor information leaf are only defined for products based on
VMware hardware version 7.

That is really helpful, we aren't running under a real hypervisor, the Hypervisor: Origin = "TCGTCGTCGTCG" in the message buffer indicates that we are running on QEMU's Tiny Code Generator.

This is a fork where I could look at a bunch of stuff, QEMU, the Linux kernel or a running system on Linux with KVM. I don't really want to do any of these, but I already have a clone of QEMU so I'll start by looking at how MicroVM is implemented.

QEMU Microvm

Microvm is implemented in qemu/hw/i386/ by microvm.c , microvm-dt.c (for fw_cfg ) and acpi-microvm.c (not sure about that one).

didn't help

neither did search in the linux github repo

maybe I can get a ubuntu kernel to boot to failure. Based on these instructions I grabbed the linked kernel and through it into by script which didn't work. Not sure if the enable-kvm option is the issue here I booted a test machine into ubuntu and ran through their entire example as is, which worked fine.

Then I tried my kernel in their script and their kernel in my script and neither were happy enough to give me any output. Of these three options non have really gotten me anywhere.

Going back to tsc_freq_cpuid_vm it does this:

        if (hv_high < 0x40000010)
                return (false);
        do_cpuid(0x40000010, regs);

Skipping this check get us to the same pvclock_get_timecount fault as when we skipped the assert entirely. This is trying to read a memory address that isn't there because we aren't Xen, and yes we aren't Xen. Why do we think we are Xen?

isxen (pv.c) runs early on an detects if we are running under Xen. If we are then the init_ops table is swapped from the default one to the Xen one ( xen_pvh_init_ops ), somewhere ( hammer_time_xen ).

Time for a break to try and blinken some lights which turned out to doing some recycling instead.

On these breaks I have found it helps a lot to leave a note somewhere, on return I can try and build and the note will piss off the compiler.

/usr/home/tj/code/freebsd/worktrees/microvm/sys/dev/xen/timer/xen_timer.c:165:2: error: use of undeclared identifier 'look'
  165 |         look here buddy
      |         ^
1 error generated.

I wasn't sure how we were getting to hammer_time_xen , but during the break I figured it is probably being driven by the kernel config.

# Xen HVM Guest Optimizations
# NOTE: XENHVM depends on xenpci and xentimer.
# They must be added or removed together.
# NOTE: These are present in FIRECRACKER because the PVH boot method
# originates from Xen; once that code is untangled these can be removed.
options         XENHVM                  # Xen HVM kernel infrastructure
device          xenpci                  # Xen HVM Hypervisor services driver
device         xentimer                # Xen x86 PV timer device

The Firecracker kernel config comes with lots of helpful potential comments and in the Xen section we have xentimer . Commenting this out and fixing up some parts of the Xen init_ops lets us advance to a fascinating panic:

Statistical lapic calibration failed!  Clocks might be ticking at variable rates.                                     
Falling back to slow lapic calibration.                                                                               
lapic: Divisor 2, Frequency 104591 Hz                                                                                 
Timecounters tick every 10.000 msec                                                                                   
lo0: bpf attached                                                                                                     
vlan: initialized, using hash tables with chaining                                                                    
IPsec: Initialized Security Association Processing.                                                                   
tcp_init: net.inet.tcp.tcbhashsize auto tuned to 4096      
random: unblocking device.                                                                                            
panic: deadlres_td_on_lock: possible deadlock detected for 0xffffffff80f145e0 (swapper), blocked for 139455 ticks

cpuid = 0                                                                                                             
time = 102624                                                                                                         
KDB: stack backtrace:                                                                                                 
db_fetch_ksymtab() at db_fetch_ksymtab+0x17b/frame 0xfffffe004602cd20
vpanic() at vpanic+0x135/frame 0xfffffe004602ce50
panic() at panic+0x43/frame 0xfffffe004602ceb0
profclock() at profclock+0x5fa/frame 0xfffffe004602cef0
fork_exit() at fork_exit+0x82/frame 0xfffffe004602cf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe004602cf30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

And on some other attempts after a long time we get to mount root:

isa_probe_children: probing PnP devices
Device configuration finished.
procfs registered
Timecounter "TSC" frequency 543000 Hz quality 800
Statistical lapic calibration failed!  Clocks might be ticking at variable rates.
Falling back to slow lapic calibration.
lapic: Divisor 2, Frequency 108474 Hz
Timecounters tick every 10.000 msec
lo0: bpf attached
vlan: initialized, using hash tables with chaining
IPsec: Initialized Security Association Processing.
tcp_init: net.inet.tcp.tcbhashsize auto tuned to 4096
random: unblocking device.
WARNING: WITNESS option enabled, expect reduced performance.

Loader variables:

Manual root filesystem specification:
  <fstype>:<device> [options]
      Mount <device> using filesystem <fstype>
      and with the specified (optional) option list.

    eg. ufs:/dev/da0s1a
        zfs:zroot/ROOT/default
        cd9660:/dev/cd0 ro
          (which is equivalent to: mount -t cd9660 -o ro /dev/cd0 /)

  ?               List valid disk boot devices
  .               Yield 1 second (for background tasks)
  <empty line>    Abort manual input

mountroot>

Mount root

Mountroot is the first 'success' in doing a port, there is enough of a system (even if inconsistently) to add disks, even if there aren't drivers for the required things. Next steps are to:

  • figure out why this is slow (it is meant to be fast after all)
  • look at the lapic calibration complaints
  • add virtio devices

I think before we can add Virtio devices we will need to figure out why the MPTable isn't telling us about the 4 cores that QEMU should be passing through.

real memory  = 536858624 (511 MB)                          
Physical memory chunk(s):                                  
0x0000000000001000 - 0x000000000009efff, 647168 bytes (158 pages)                
0x0000000000100000 - 0x00000000001fffff, 1048576 bytes (256 pages)
0x0000000001602000 - 0x000000001edf5fff, 494878720 bytes (120820 pages)
0x000000001fc00000 - 0x000000001fd6cfff, 1495040 bytes (365 pages)
avail memory = 492081152 (469 MB)                          
MPTable: <SMP: Added CPU 0 (BSP)                  
BOCHSCPU 0.1         >                                     
Event timer "LAPIC" quality 600                      
LAPIC: ipi_wait() us multiplier 1 (r 1645530 tsc 543000)   
Pentium Pro MTRR support enabled

Where are my cpus?

The QEMU bits of my launch script right now look like this:

memory=512m
cores=4
netif=tap0
cpu="max"

qemu-system-x86_64 -M microvm                                   \
        -cpu ${cpu}                                             \
        -m ${memory}                                            \
        -smp ${cores}                                           \
        -kernel ${kernel}                                       \
        -nodefaults                                             \
        -no-user-config                                         \
        -nographic                                              \
        -serial stdio                                           \
        -drive id=test,file=${disk},format=raw,if=none          \
        -device virtio-blk-device,drive=test

Requesting that four cpus/cores are created, but from the last message buffer output we only see only CPU, CPU 0 (BSP) .

I'm going to start by finding my missing cpus, in the last dmesg above we have

MPTable: <SMP: Added CPU 0 (BSP)                  
BOCHSCPU 0.1         >

The MPTable: < string is printed by sys/x86/x86/mptable.c

There are a couple of methods that call mptable_walk_table with different call backs, sticking a print in one of those:

static void
mptable_probe_cpus_handler(u_char *entry, void *arg)
{
printf("%s:%d %d\n", __func__, __LINE__, *entry);
        proc_entry_ptr proc;

        switch (*entry) {
        case MPCT_ENTRY_PROCESSOR:

Can give us an idea of the entries it carries.

---<<BOOT>>---                                                 
MP Configuration Table version 1.4 found at 0xffffffff800f4480 
APIC: Using the MPTable enumerator.                            
mptable_probe_cpus_handler:531 0                               
mptable_probe_cpus_handler:531 1                               
mptable_probe_cpus_handler:531 2                               
mptable_probe_cpus_handler:531 3                               
mptable_probe_cpus_handler:531 3                               
mptable_probe_cpus_handler:531 3                               
mptable_probe_cpus_handler:531 3
...

And when matched to the header we see we only have 1 MPCT_ENTRY_PROCESSOR .

/* Base table entries */

#define MPCT_ENTRY_PROCESSOR    0
#define MPCT_ENTRY_BUS          1
#define MPCT_ENTRY_IOAPIC       2
#define MPCT_ENTRY_INT          3
#define MPCT_ENTRY_LOCAL_INT    4

Firecracker uses MPTable, I know this from Colin's BSDCan talk, but this isn't matching to what I thought I was telling QEMU to do.

I'm not sure what is up with this, I think I need another target to compare against

Where's my disks?

From Colin's port we learn that Firecracker gets its virtio configuration from the Linux command line. I don't see any evidence of this in the message buffer right now. At the end of the blog posts there is some help I am certainly missing:

You'll probably also want to build a disk image so that FreeBSD has
something to boot from; place vfs.root.mountfrom=ufs:/dev/vtbd0 into
Firecracker's boot_args to tell FreeBSD to use the disk you attach (aka.
the first Virtio block device) as the root disk.

As I understand, virtio_blk needs to be configured to attach because MicroVM like firecracker has no dynamic configuration for it to probe from. So lets add this to the launch script:

bootargs="vfs.root.mountfrom=ufs:/dev/vtbd0"
...
-smp ${cores}                                           \
-kernel ${kernel}                                       \
-append ${bootargs}                                     \
-nodefaults                                             \
-no-user-config                                         \

Not quite enough, I'm not sure how boot args get from the QEMU command line into the kernel or where they would be stored and generally when doing silly things - like this - I'd prefer to learn and leave reading git logs as the last option. Kind of goes like this:

  • reading code
  • reading a book (lol)
  • searching the internet
  • reading git logs

I know from loader we can set -s , -d and -v for single, debug and verbose and I've already hi-jacked verbose to always be on because we don't have a loader to set it. I will look there.

From looking at various RB_ flags I stumbled onto boot_parse_cmdline . Which a variation of is called from in pv.c (all the other calls are on fun Risc architectures):

} else {
        /* Parse the extra boot information given by Xen */
        if (start_info->cmdline_paddr != 0)
                boot_parse_cmdline_delim(
                    (char *)(start_info->cmdline_paddr + KERNBASE),
                    ", \t\n");
        kmdp = NULL;
        strlcpy(bootmethod, "PVH", sizeof(bootmethod));
}

boothowto |= boot_env_to_howto();

With boothowto and bootverbose I can see things that pick up a command line, but I'm having a hard time having a reference to one somewhere I can print it. I still don't have prints working from the Xen init functions.

init_dynamic_kenv_from is called from the init_dynamic_kenv SYSINIT once we have a functioning VM system. This isn't getting anywhere.

Fine, a quick search of the commit log reveals:

commit 0e1f5ab7db2cd2837f97f169122897b19c185dbd
Author: Colin Percival <cperciva@FreeBSD.org>
Date:   Fri Aug 12 17:54:26 2022 -0700

    virtio_mmio: Support command-line parameters

    The Virtio MMIO bus driver was added in 2014 with support for devices
    exposed via FDT; in 2018 support was added to discover Virtio MMIO
    devices via ACPI tables, as in QEMU.  The Firecracker VMM eschews both
    FDT and ACPI, instead presenting device information via kernel command
    line arguments of the form virtio_mmio.device=<parameters>.

    These command line parameters get converted into kernel environment
    variables; this adds support for parsing those variables and attaching
    virtio_mmio children to nexus.

    There is a case to be made that it would be cleaner to have a new
    "cmdlinebus" attached to nexus and virtio_mmio children attached to
    that.  A future commit might do that.

    Discussed with: imp, jrtc27
    Sponsored by:   https://patreon.com/cperciva
    Differential Revision:  https://reviews.freebsd.org/D36189

It adds this identify method:

static void
vtmmio_cmdline_identify(driver_t *driver, device_t parent)
{
        size_t n;
        char name[] = "virtio_mmio.device_XXXX";
        char * val;

        /* First variable just has its own name. */
        if ((val = kern_getenv("virtio_mmio.device")) == NULL)
                return;
        parsearg(driver, parent, val);
        freeenv(val);

        /* The rest have _%zu suffixes. */
        for (n = 1; n <= 9999; n++) {
                sprintf(name, "virtio_mmio.device_%zu", n);
                if ((val = kern_getenv(name)) == NULL)
                        return;
                parsearg(driver, parent, val);
                freeenv(val);
        }
}

Which while we can trace that it is called, it uses kern_getenv to look for compatible devices AND that doesn't help me check what the hell the command line is.

Cheated for nothing - don't worry the rules are made up.

I grabbed a copy of the Xen extra boot information into a temp variable, waited for a panic to ddb and printed that and got a -dsr - which were the test flags I passed to QEMU when I started poking at this.

I took a stab in the dark and set instead:

bootargs="vfs.root.mountfrom=ufs:/dev/vtbd0\t\nvirtio_mmio.device"

And I got a different mountroot prompt!

Trying to mount root from ufs:/dev/vtbd0\t\nvirtio_mmio.device []...
mountroot: waiting for device /dev/vtbd0\t\nvirtio_mmio.device...
Mounting from ufs:/dev/vtbd0\t\nvirtio_mmio.device failed with error 19.

Loader variables:
  vfs.root.mountfrom=ufs:/dev/vtbd0\t\nvirtio_mmio.device

Manual root filesystem specification:
  <fstype>:<device> [options]

Kicking that into ddb and boom, there is the -append arg from QEMU.

db> x/s *cmdline
kernbase+0x560: vfs.root.mountfrom=ufs:/dev/vtbd0\t\nvirtio_mmio.device

So! However QEMU is passing through to the VM the virtio device configuration it isn't via this command line, there must be some other way. This effort has confirmed that the bootargs are making it through to the kernel, but we aren't getting everything we need.

I guess I need to look at what QEMU is doing and what a successful Linux boot is like

Looking at qemu

While donating platelets I had a read of the QEMU MicroVM implementation and the short git log for the file and hit this commit:

commit f6f7e2d88d0b29d8b6e1a12a5f3f9f31faff9846
Author: Gerd Hoffmann <kraxel@redhat.com>
Date:   Tue Sep 15 14:09:00 2020 +0200

    microvm/acpi: disable virtio-mmio cmdline hack

    ... in case we are using ACPI.

    Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
    Reviewed-by: Igor Mammedov <imammedo@redhat.com>
    Reviewed-by: Sergio Lopez <slp@redhat.com>
    Message-id: 20200915120909.20838-13-kraxel@redhat.com

The commit before this one gives context too:

commit 67eb6a4007fd8f9073020e506453ff5b7c25cb34
Author: Gerd Hoffmann <kraxel@redhat.com>
Date:   Tue Sep 15 14:08:59 2020 +0200

    microvm/acpi: use seabios with acpi=on

    With acpi=off continue to use qboot.

    Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
    Reviewed-by: Igor Mammedov <imammedo@redhat.com>
    Reviewed-by: Sergio Lopez <slp@redhat.com>
    Message-id: 20200915120909.20838-12-kraxel@redhat.com

Despite MicroVM saying it doesn't support ACPI there is a lot of ACPI specific code in the driver. My devices weren't in the kernel command line because the VM was booting with ACPI. It is also using a different firmware because it supports ACPI.

For MicroVM a tiny fast boot firmware called "qboot" was written, but in my testing when I have locked everything up I have seen plenty of 'Seabios" messages - I knew that wasn't what the documentation promised as the default, but I was also getting boot messages most the time.

I think the default bios with ACPI is Seabios and qboot otherwise.

I added -machine acpi=off to my boot script flags and this time got to:

Mounting from ufs:/dev/vtbd0 failed with error 2; retrying for 2 more seconds
Attempted recovery for standard superblock: failed
Attempted extraction of recovery data from standard superblock: failed
Attempt to find boot zone recovery data.
Finding an alternate superblock failed.
Check for only non-critical errors in standard superblock
Failed, superblock has critical errors
Attempted recovery for standard superblock: failed
Attempted extraction of recovery data from standard superblock: failed
Attempt to find boot zone recovery data.
Finding an alternate superblock failed.
Check for only non-critical errors in standard superblock

The mountroot errors just looped until I killed the VM. The errors make perfect sense, I have no idea what test.raw is, there is a good chance is is just an empty file. I gave qemu fbsd14.raw for my vms directory, but that also didn't boot so there are things to do.

Booting with the new firmware we also have a MPTable with multiple CPUs:

avail memory = 491782144 (469 MB)
MPTable: <mptable_setup_cpus_handler:552 0
SMP: Added CPU 0 (BSP)
mptable_setup_cpus_handler:552 0
SMP: Added CPU 1 (AP)
mptable_setup_cpus_handler:552 0
SMP: Added CPU 2 (AP)
mptable_setup_cpus_handler:552 0
SMP: Added CPU 3 (AP)

...

cpu0 BSP:                                                            
     ID: 0x00000000   VER: 0x00050014 LDR: 0x00000000 DFR: 0xffffffff
  lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000001ff
  timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400
SMP: AP CPU #1 Launched!                                             
cpu1 AP:                                                             
     ID: 0x01000000   VER: 0x00050014 LDR: 0x00000000 DFR: 0xffffffff
  lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000001ff
  timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400
SMP: AP CPU #3 Launched!                                             
cpu3 AP:                                                             
     ID: 0x03000000   VER: 0x00050014 LDR: 0x00000000 DFR: 0xffffffff
  lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000001ff
  timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400
SMP: AP CPU #2 Launched!                                             
cpu2 AP:                                                             
     ID: 0x02000000   VER: 0x00050014 LDR: 0x00000000 DFR: 0xffffffff
  lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000001ff
  timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400

Generally the console seems to be much more responsive, rather than the error we had before:

Timecounter "TSC" frequency 543000 Hz quality 800
Statistical lapic calibration failed!  Clocks might be ticking at variable rates.
Falling back to slow lapic calibration.
lapic: Divisor 2, Frequency 108474 Hz

We now have

Timecounter "TSC" frequency 543000 Hz quality -100

Mounting a disk

Time to check what test.raw is:

[tj@displacementactivity] $ sudo mdconfig -a ~/vms/test.raw
Password:
md0
[tj@displacementactivity] $ gpart show md0
=>       34  104857526  md0  GPT  (50G)
         34        122    1  freebsd-boot  (61K)
        156      66584    2  efi  (33M)
      66740    2097152    3  freebsd-swap  (1.0G)
    2163892  102693452    4  freebsd-ufs  (49G)
  104857344        216       - free -  (108K)

Which suggests that the boot script needs to know the correct partition:

bootargs="vfs.root.mountfrom=ufs:/dev/vtbd0p4"

Which gets us all the way to a login prompt! As I am ready to declare victory I login and the system silently hangs, but it works on a second try.

If we don't think about how much time was spent debugging the wrong QEMU bios firmware dur to poor documentation then that was quite an easy process.

From my minimal testing there are a few things to look at next:

  • reboot and shutdown don't
  • network devices appear and configure, but don't send packets
  • it just hangs sometimes
  • boot takes much longer than I would like (100ms max please)

Still more to do.