Realmode bhyve

I have been poking around bhyve, seeing what is up and I came across this article about writing a Linux kvm driver from scratch . In the article is an example of minimal program to run as a first test in the kvm driver:

; Output to port 0x3f8
mov dx, 0x3f8

; Store the address of the message in bx, so we can increment it
mov bx, message

loop:
    ; Load a byte from `bx` into the `al` register
    mov al, [bx]

    ; Jump to the `hlt` instruction if we encountered the NUL terminator
    cmp al, 0
    je end

    ; Output to the serial port
    out dx, al
    ; Increment `bx` by one byte to point to the next character
    inc bx

    jmp loop

end:
    hlt

message:
    db "Hello, KVM!", 0

That seems fun, a nice small example of getting some code running. I don't really want to write my own bhyve, I like the one we have, but it might be nice to try and get this running.

I assembled the example:

nasm -fbin nello.S nello

And looked around to see how to load a bios in bhyve. bhyve(8) has some examples at the end, it looks like the -l flag can be used to set a bootrom (bios) like so:

$ sudo bhyve -l bootrom,./nello nello

vm exit[0]
        reason          VMX
        rip             0x000000000000fff0
        inst_length     3
        status          0
        exit_reason     48 (EPT violation)
        qualification   0x0000000000000784
        inst_type               0
        inst_error              0

Well that didn't work. I poked a bit in bhyve, but it wasn't clear what to do about an EPT violation. The examples also mentioned using /usr/local/share/uefi-firmware/BHYVE_UEFI_CODE.fd , I opted for the CSM version:

$ sudo bhyve -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI_CSM.fd hello

I had a poke around the CSM bootrom and while it is always fun to use hexdump, it really didn't help me understand what was wrong with my example assembly.

I tried with BHYVE_UEFI_CSM.fd and guess what I got:

vm exit[0]
        reason          VMX
        rip             0x000000000000fff0
        inst_length     3
        status          0
        exit_reason     48 (EPT violation)
        qualification   0x0000000000000784
        inst_type               0
        inst_error              0

The same trap!

I think that means I need to figure out the minimal viable bhyve command that will run known good bootrom before I try running that example. The last example in bhyve(8) is:

Run a UEFI virtual machine with a VARS file to save EFI variables.  Note
that bhyve will write guest modifications to the given VARS file.  Be
sure to create a per-guest copy of the template VARS file from /usr.

      bhyve -c 2 -m 4g -w -H \
        -s 0,hostbridge \
        -s 31,lpc -l com1,stdio \
        -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI_CODE.fd,BHYVE_UEFI_VARS.fd
         uefivm

-w waits for the debugger and -H emulates halt to save power, no need for those. So I tried:

bhyve -s 0,hostbridge -s 31,lpc -l com1,stdio -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI_CSM.fd hello

And that worked:

Boot Failed. CDROM 0
Boot Failed. Harddisk 1
UEFI Interactive Shell v2.1
EDK II
UEFI v2.40 (BHYVE, 0x00010000)
Error. No mapping found
Press ESC in 1 seconds to skip startup.nsh or any other key to continue.

Now to try my bios:

$ sudo bhyve -s 31,lpc -l com1,stdio  -l bootrom,./nello hello
bhyve: ROM size 65552 is not a multiple of the page size
Device emulation initialization error: No such file or directory

32 (the raw unpadded 16 bit program size) is also not a multiple of the page size, I padded out the example using TIMES 4096 - ($ - $$) db 0 from a bootsector nasm example

This has not succeeded.

Fine, whatever, I will use gdb to look at what is going on. bhyve supports the -G flag to integrate with gdb. I added

-G wlocalhost:1234

to the bhyve command asking bhyve to wait for gdb to attach and continue listening on localhost port 1234.

(gdb) target remote localhost:1234
Remote debugging using localhost:1234
warning: No executable has been specified and target does not support
determining executable automatically.  Try using the "file" command.
0x000000000000fff0 in ?? ()
(gdb) x/32i 0x000000000000fff0
=> 0xfff0:      add    %al,(%rax)
   0xfff2:      add    %al,(%rax)
   ...
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) x/32x 0x000000000000fff0
0xfff0: 0x00000000      0x00000000      0x00000000      0x00000000
0x10000:        0x00000000      0x00000000      0x00000000      0x00000000
...
0x10060:        0x00000000      0x00000000      0x00000000      0x00000000
(gdb) x/32x 0x0
0x0:    0x00000000      0x00000000      0x00000000      0x00000000
0x10:   0x00000000      0x00000000      0x00000000      0x00000000
0x20:   0x00000000      0x00000000      0x00000000      0x00000000
0x30:   0x00000000      0x00000000      0x00000000      0x00000000

Connecting and poking around shows the obvious places are all zeros (or sometimes all 1s).

gdb has a 'find' command for searching memory, our example is pretty distinctive so it should find it.

Didn't work for me this time

Stepping immediately just starts the program, for nello we are stopped with rip as 0x000000000000ffef.

0x000000000000ffef in ?? ()
(gdb) x/64x $rip
0xffef: 0x960000ff      0x00ffff00      0x00000200      0x46f00000
0xffff: 0x00000000      0x00000000      0x00000000      0x00000000

disassembly time, FreeBSD's llvm-objdump doesn't have support for 16 bit x86 (fair), so I grabbed binutils and used a command like this:

x86_64-unknown-freebsd15.0-objdump -b binary -m i386 -D -Maddr16,data16 -Mintel nello

Working from objdump I tweaked some offsets to get bytes into the correct places with padding, but there wasn't an obvious clue what was up. I couldn't associate the memory I could read in gdb to anything from my binary.

$ hexdump -C nello
00000000  ba f8 03 bb 11 00 8a 07  3c 00 74 04 ee 43 eb f6  |........<.t..C..|
00000010  f4 48 65 6c 6c 6f 2c 20  62 68 79 76 65 21 00 90  |.Hello, bhyve!..|
00000020  90 90 90 90 90 90 90 90  90 90 90 90 90 90 90 90  |................|
*
0000fff0  e9 0d 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00010000

I turned to qemu to see if that helped:

$ qemu-system-i386 -bios nello -S -s -nographic

(gdb) target remote localhost:1234
Remote debugging using localhost:1234
warning: No executable has been specified and target does not support
determining executable automatically.  Try using the "file" command.
0x0000fff0 in ?? ()
(gdb) x/32xb 0xffff0000
0xffff0000:     0xba    0xf8    0x03    0xbb    0x11    0x00    0x8a    0x07
0xffff0008:     0x3c    0x00    0x74    0x04    0xee    0x43    0xeb    0xf6
0xffff0010:     0xf4    0x48    0x65    0x6c    0x6c    0x6f    0x2c    0x20
0xffff0018:     0x62    0x68    0x79    0x76    0x65    0x21    0x00    0x90
(gdb) x/16xb 0xfffffff0
0xfffffff0:     0xe9    0x0d    0x00    0x00    0x00    0x00    0x00    0x00
0xfffffff8:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
(gdb) c
Continuing.

That all looks good, it matches up with our hexdump of the bios example. If I hit ^C then we stop at 0x00000011 .

^C
Program received signal SIGINT, Interrupt.
0x00000011 in ?? ()

If we recall that we are running in 16 bit mode in the last sector and convert that off set into the memory dumps we find the byte value 0xf4 an x86 halt instruction.

"HLT causes the 80386 to stop execution. Following a halt, execution can
only be resumed by the receipt of an enabled interrupt or by a reset of
the computer."

- Programming the 80386

So we did what we wanted to and stopped, but qemu gave us no output. I think that has confirmed that the bios image is now correct if not functional. So either we are running fine in bhyve and just not getting output, or there is something else up.

In the example minimal Linux hypervisor they just did a straight printf for an IO vmexit. Lets catch the vmexit handlers in bhyve and see what is up:

diff --git a/usr.sbin/bhyve/amd64/vmexit.c b/usr.sbin/bhyve/amd64/vmexit.c         
index e0b9aec2d17a..e1669c2b5051 100644                                            
--- a/usr.sbin/bhyve/amd64/vmexit.c                                                
+++ b/usr.sbin/bhyve/amd64/vmexit.c                                                
@@ -72,6 +72,7 @@ vm_inject_fault(struct vcpu *vcpu, int vector, int errcode_valid,
 static int                                                                        
 vmexit_inout(struct vmctx *ctx, struct vcpu *vcpu, struct vm_run *vmrun)          
 {                                                                                 
+fprintf(stderr, "%s:%d\n", __func__, __LINE__)                                    
        struct vm_exit *vme;                                                       
        int error;                                                                 
        int bytes, port, in;

I reconfigured my test script to output serial to /dev/nmdm0A so I would get printfs from bhyve, but nothing.

Our assembly doesn't do what we think it should does.

Adding port configuration from this so and osdev wiki led my modified bhyve to print on calls to vmexit_inout .

$ sudo sh ./run.sh nello
outputting serial to /dev/nmdm0B
waiting for gdb
vmexit_inout:75
vmexit_inout:75
vmexit_inout:75
vmexit_inout:75
vmexit_inout:75
vmexit_inout:75
vmexit_inout:75
vmexit_inout:75
vmexit_inout:75
vmexit_inout:75

Those 10 vmexit_input lines match up perfectly with the configuration and test example. This is an excellent debugging sign.

With an extreme amount of further faffing I discovered that the loop in the example I started from was not making it to the first print statement. I confirmed this by stripping away all of the configuation and just spat out some characters explicitly.

In the hexdump nasm was loading the wrong address into bx , but even with the correct address in bx I got no output. As I only wanted to say hello from real mode, I'm done. Debugging segments (segments, not even once) in a pre bios environment where you can't single step just isn't my idea of fun.

The example I started from was run from the base address, by writing their own kvm driver they were able to configure the instruction pointer and segments to look sensible. Me - an idiot, decided to work with the brain melting x86 hardware as it is.

Most of my fighting here was because gdb connecting to the bhyve stub isn't able to read guest memory in the bios region. Neither qemu or bhyve let me single step instructions, which just makes debugging here tedious.

OS Dev wiki is a great resource, but it is very annoying to have lots of "you shouldn't do this" everywhere when you push their 'perfect path'. I just want to know what I need to know.

If you want to play with real mode in bhyve you can start from this, minimal, working example:

; A 64k bios for bhyve which does nothing at all
bits 16
equ PORT 0x3f8

%macro outb 1
        mov al, %1
        out dx, al
%endmacro

start:
        mov dx, PORT                    ; store the port

        outb 0x0a                       ; print a message
        outb 'b'
        outb 'h'
        outb 'y'
        outb 'v'
        outb 'e'
        outb '!'
        outb 0x0a
end:
    hlt                                 ; hang around

TIMES 0xFFF0 - ($ - $$) db 0            ; pad out to reset vector
; cpu is going to start from 0xFFF0, with CS set to 0xF000 basically we are
; going to start at 0xFFFFFFF0, with only 16 bytes to play with, but we can
; just to start of the 64k segment reasonably easily.
jmp start                               

TIMES 0x10000 - ($ - $$) db 0           ; padd out to 64k

Hopefully that end isn't too negative, I had a lot of fun doing this, I just don't want to do anymore of it.

Kernel debugging QEMU

Debugging the FreeBSD kernel in QEMU is really straight forward. Get yourself some kernel symbols (either from the build dir or for /usr/lib/debug/kernel* ), load them into gdb, kick of QEMU with -s -S and connect.

$ qemu-system-x86_64 ... -s -S ...

The -S flag causes QEMU to wait for the gdbstub to start execution rather than letting CPUs free immediately. So we need to connect with gdb to get going.

$ gdb kernel.debug
(gdb) target remote localhost:1234
(gbd) c

Usage is documented a little in the manual (as much as anything is ever gdb documented).

I got the steps from a FreeBSD 10 guide and other than the symbols file changing in between everything is about the same. It is nice when things don't change.

gdb is nice and all, but can we use a debugger that ships with FreeBSD?

The lldb documentation suggests that we can do something like this:

(lldb) platform list
Available platforms:
host: Local FreeBSD user platform plug-in.
remote-freebsd: Remote FreeBSD user platform plug-in.
remote-gdb-server: A platform that uses the GDB remote protocol as the communication transport.
qemu-user: Platform for debugging binaries under user mode qemu
(lldb) platform select remote-gdb-server
  Platform: remote-gdb-server
 Connected: no
(lldb) platform connect connect://localhost:1234
  Platform: remote-gdb-server
  Hostname: (null)
 Connected: yes

It will say everything is connected, but when we try to continue we get an error.

 (lldb) c
error: invalid target, create a target using the 'target create' command

From a stackoverflow question I found the gdb-remote command which does what the documentation don't:

(lldb) gdb-remote localhost:1234
Process 1 stopped
* thread #1, stop reason = signal SIGTRAP
    frame #0: 0x000000000000fff0
->  0xfff0: addb   %al, (%rax)
    0xfff2: addb   %al, (%rax)
    0xfff4: addb   %al, (%rax)
    0xfff6: addb   %al, (%rax)
(lldb) c
Process 1 resuming
Process 1 stopped
* thread #4, stop reason = signal SIGINT
    frame #0: 0xffffffff805ee459
->  0xffffffff805ee459: incl   %eax
    0xffffffff805ee45b: cmpl   0x812097(%rip), %eax
    0xffffffff805ee461: jl     0xffffffff805ee450
    0xffffffff805ee463: nopw   %cs:(%rax,%rax)
(lldb) list udp_output
error: Could not find function named: "udp_output"

Things are confused.

(lldb) image list
[  0] 7A0DEA14 0x0000000000200000 /sbin/init
      /usr/lib/debug/sbin/init.debug

(lldb) f
error: Command requires a process which is currently stopped.
(lldb) c
error: Process is running.  Use 'process interrupt' to pause execution.

Out of the box like this I think lldb needs some more configuration. Frustratingly as always the documentation around debugging is seriously lacking.

gdb isn't perfect either:

End of the file was already reached, use "list ." to list the current location again
(gdb) list .

Fatal signal: Segmentation fault
----- Backtrace -----
0x1350771 ???
0x14717f6 ???
0x1471f7f ???
0x82fbf457f handle_signal
        /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_sig.c:300
0x82fbf3b3a thr_sighandler
        /home/pkgbuild/worktrees/main/lib/libthr/thread/thr_sig.c:243
0x8230ef2d2 ???
0x173da31 ???
0x137ac37 ???
0x1384a35 ???
0x17cb202 ???
0x1470faa ???
0x1471335 ???
0x14709a2 ???
0x8262127c2 ???
0x1471d8d ???
0x14705fc ???
0x17ff86f ???
0x1cab092 ???
0x1caab64 ???
0x1576a59 ???
0x1573c40 ???
0x1250760 ???
0x83000d839 __libc_start1
        /home/pkgbuild/worktrees/main/lib/libc/csu/libc_start1.c:157
0x125064f ???
---------------------
A fatal error internal to GDB has been detected, further
debugging is not possible.  GDB will now terminate.

This is a bug, please report it.  For instructions, see:
<https://www.gnu.org/software/gdb/bugs/>.

zsh: segmentation fault (core dumped)  gdb

https://people.freebsd.org/~avg/kyivbsd/KyivBSD2010.pdf

a links post? https://www.unitedbsd.com/blog/775-remote-debugging-the-running-openbsd-kernel https://forums.freebsd.org/threads/setting-up-a-bhyve-vm-for-kernel-development-and-debugging.90935/

devlog Porting FreeBSD to QEMU MicroVM

I haphazardly take notes while doing development, when real debugging is happening I fill folded pieces of paper with the names of functions and variables as they pop into notice and fade away again. I thought during the VPP port it might be nice to both have better notes, but to also share this process more.

I constantly am on the edge with talking about what I am doing, I've internalised that "research shows that people who share their plans are less likely to complete them" and I'm aware that people jump at news rather than actual facts. I don't want anyone to get too excited about in progress things, but I also don't to keep everything to myself.

Reading about people doing ports of Linux and bringing up new stuff in OpenBSD via undeadly.org was a big draw to me when I was coming up. I would attribute people writing about their doings as one of the biggest factors leading to the work I currently do.

This is an experiment in sharing something close to development notes, but written with an audience in mind. I do have other notes like this for other projects. I think an ideal devlog is like Joey Hess's old daily devblog , but for me the balance is probably closer to writing as I go and publishing at milestones.


I read Rob's article on quiz an excellent tool for speeding up Linux kernel deveopment and learned about QEMU MicroVM, it seemed like a great way to combine a crazy idea with a useful idea.

QEMU MicroVM offers a QEMU based implementation of soemthing like the Amazon Firecracker microvm that Colin Percival ported FreeBSD to over the last few years.

Colin has written a lot about the port to Firecracker and given a presentation a few times . The covers some of the technical description of firecracker there, how it works and how it relates to 'normal' machines. Colin's 2023 BSDCan talk is the basis for pretty much everything I know about Firecracker.

I have been thinking about an idea that requires FreeBSD booting super quickly for a couple of months and MicroVM has been the focus of that speculation for the last month. I keep coming up with more reason to have some fun looking hacking.

Running MicroVM

While I tried to get Rob's stuff going on Ubuntu (and then failed to install Debian at all), maybe a good way to start is to just wade in.

qemu-system-x86_64 requires quite a few arguments to run a MicroVM machine, I find it easiest to bundle these into a script I can pass in the kernel and disk to:

#!/bin/sh

kernel=$1
disk=$2

memory=512m
cores=4
netif=tap0

qemu-system-x86_64 -M microvm                                   \
    -cpu max                                                    \
    -m ${memory}                                                \
    -smp ${cores}                                               \
    -kernel ${kernel}                                           \
    -append "earlyprintk=ttyS0 console=ttyS0 root=/dev/vda"     \
    -nodefaults                                                 \
    -no-user-config                                             \
    -nographic                                                  \
    -serial stdio                                               \
    -drive id=test,file=${disk},format=raw,if=none              \
    -device virtio-blk-device,drive=test

Building a Firecracker FreeBSD kernel is straight forward, from a FreeBSD tree run:

$ make -j 16 -s buildkernel KERNCONF=FIRECRACKER TARGET=amd64

That will give you a kernel to feed to QEMU like so:

[tj@computer] $ sh ~/code/scripts/qemu/microvm.sh /usr/obj/usr/home/tj/code/freebsd/worktrees/microvm/amd64.amd64/sys/FIRECRACKER/kernel ~/vms/test.raw
SeaBIOS (version rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org)
Booting from ROM..GDB: no debug ports present
KDB: debugger backends: ddb
KDB: current backend: ddb
---<<BOOT>>---
Copyright (c) 1992-2024 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
    The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 15.0-CURRENT #0 thj/microvm-n269722-8ceac8e13dcc: Fri Apr 26 17:55:03 BST 2024
tj@displacementactivity:/usr/obj/usr/home/tj/code/freebsd/worktrees/microvm/amd64.amd64/sys/FIRECRACKER amd64
FreeBSD clang version 18.1.3 (https://github.com/llvm/llvm-project.git llvmorg-18.1.3-0-gc13b7485b879)
WARNING: WITNESS option enabled, expect reduced performance.
CPU: QEMU TCG CPU version 2.5+ (K8-class CPU)
Origin="AuthenticAMD"  Id=0x663  Family=0x6  Model=0x6  Stepping=3
Features=0x1fc3fbfd<FPU,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,ACPI,MMX,FXSR,SSE,S>
Features2=0xfed8320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F>
AMD Features=0xec500800<SYSCALL,NX,MMX+,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
AMD Features2=0x177<LAHF,CMP,SVM,CR8,ABM,SSE4A,Prefetch>
Structured Extended Features=0x21dc43a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,ERMS,MPX,RDSEED,ADX,SMAP,<b22>,CLFLUSH>
Structured Extended Features2=0x8041021c<UMIP,PKU,OSPKE,VAES,LA57,RDPID>
Structured Extended Features3=0x10<FSRM>
XSAVE Features=0x5<XSAVEOPT,XINUSE>
AMD Extended Feature Extensions ID EBX=0x204<XSaveErPtr,WBNOINVD>
SVM: NP,NAsids=16
Hypervisor: Origin = "TCGTCGTCGTCG"
real memory  = 536858624 (511 MB)
avail memory = 492081152 (469 MB)
MPTable: <BOCHSCPU 0.1         >
Event timer "LAPIC" quality 600
panic: TSC not initialized
cpuid = 0
time = 1
KDB: stack backtrace:
db_fetch_ksymtab() at db_fetch_ksymtab+0x17b/frame 0xffffffff81404dc0
vpanic() at vpanic+0x135/frame 0xffffffff81404ef0
panic() at panic+0x43/frame 0xffffffff81404f50
lapic_init() at lapic_init+0x4a4/frame 0xffffffff81404f70
mptable_pci_host_res_init() at mptable_pci_host_res_init+0x856/frame 0xffffffff81404f90
lapic_ipi_free() at lapic_ipi_free+0x152f/frame 0xffffffff81404fa0
mi_startup() at mi_startup+0x1c8/frame 0xffffffff81404ff0
KDB: enter: panic
[ thread pid 0 tid 0 ]
Stopped at      kdb_enter+0x33: movq    $0,0x968d12(%rip)
db>

Incredible progress! I was expecting this to be much more of a fight to get anything from the Firecracker image. Colin spoke about binary searching by looking at system load in the virtual machine host dashboard, but here we have serial output immediately. To check that this wasn't a fluke I also tried a generic kernel:

[tj@computer] $ sh ~/code/scripts/qemu/microvm.sh /usr/obj/usr/home/tj/code/freebsd/worktrees/microvm/amd64.amd64/sys/GENERIC/kernel ~/vms/test.raw
SeaBIOS (version rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org)
Booting from ROM..

It just sat there. Good news, we have a solid starting point with the Firecracker kernel.

First steps

The panic message in our first boot is "TSC not initialized" - a quick "grep -R' away in local_apic.c:547 .

#ifdef SMP
#define LOOPS   1000
        /*
         * Calibrate the busy loop waiting for IPI ack in xAPIC mode.
         * lapic_ipi_wait_mult contains the number of iterations which
         * approximately delay execution for 1 microsecond (the
         * argument to lapic_ipi_wait() is in microseconds).
         *
         * We assume that TSC is present and already measured.
         * Possible TSC frequency jumps are irrelevant to the
         * calibration loop below, the CPU clock management code is
         * not yet started, and we do not enter sleep states.
         */
        KASSERT((cpu_feature & CPUID_TSC) != 0 && tsc_freq != 0,
            ("TSC not initialized"));
        if (!x2apic_mode) {
                r = rdtsc();

Commenting out the KASSERT lets us advance as far as configuring uart0 :

...
Event timer "LAPIC" quality 600
random: registering fast source Intel Secure Key RNG
random: fast provider: "Intel Secure Key RNG"
arc4random: WARNING: initial seeding bypassed the cryptographic random device because it was not yet seeded and th.
ioapic0: Assuming intbase of 0
ioapic0 <Version 2.0> irqs 0-23
random: entropy device external interface
aesni0: <AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS,SHA1,SHA256>
cpu0
isa0: <ISA bus>
orm0: <ISA Option ROM> at iomem 0xef000-0xeffff pnpid ORM0000 on isa0
uart0: <16550 or compatible> at port 0x3f8 irq 4 flags 0x10 on isa0


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x20
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff808f5190
stack pointer           = 0x28:0xffffffff81404e20
frame pointer           = 0x28:0xffffffff81404e40
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, IOPL = 0
current process         = 0 (swapper)
rdi: 0000000000000020 rsi: 0000000000000000 rdx: 00000000000003f9
rcx: 0000000000000000  r8: 0000000000000000  r9: 0000000000000001
rax: ffffffff80e49040 rbx: 0000000000000020 rbp: ffffffff81404e40
r10: 0000000000010000 r11: 0000000000000001 r12: fffff800024f4e64
r13: 0000000000000001 r14: 00000000000000c8 r15: fffff800025b9858
trap number             = 12
panic: page fault
cpuid = 0
time = 1
KDB: stack backtrace:
db_fetch_ksymtab() at db_fetch_ksymtab+0x17b/frame 0xffffffff81404af0
vpanic() at vpanic+0x135/frame 0xffffffff81404c20
panic() at panic+0x43/frame 0xffffffff81404c80
trap() at trap+0xd2b/frame 0xffffffff81404ce0
trap() at trap+0xdd0/frame 0xffffffff81404d50
calltrap() at calltrap+0x8/frame 0xffffffff81404d50
--- trap 0xc, rip = 0xffffffff808f5190, rsp = 0xffffffff81404e20, rbp = 0xffffffff81404e40 ---
pvclock_get_timecount() at pvclock_get_timecount+0x10/frame 0xffffffff81404e40
xen_delay() at xen_delay+0x1d/frame 0xffffffff81404e60
ns8250_bus_attach() at ns8250_bus_attach+0x3e2/frame 0xffffffff81404e90
uart_bus_attach() at uart_bus_attach+0x184/frame 0xffffffff81404ed0
device_attach() at device_attach+0x3aa/frame 0xffffffff81404f10
device_probe_and_attach() at device_probe_and_attach+0x70/frame 0xffffffff81404f40
isa_probe_children() at isa_probe_children+0x23a/frame 0xffffffff81404fa0
mi_startup() at mi_startup+0x1c8/frame 0xffffffff81404ff0
KDB: enter: panic
[ thread pid 0 tid 100000 ]
Stopped at      kdb_enter+0x33: movq    $0,0x968d12(%rip)
db>

But then we take a page fault in pvclock_get_timecount() , so it looks like we can't just skip past that first KASSERT. What should we be doing the first time around?

I forced on bootverbose in init_main.c to see if that would give me any more clues. Our error is coming from having a TSC, but none of the TSC reads being valid. Along with there being a TSC we need another variable set to tell us how to read the TSC for calibration.

The source of our error is really early in machine dependant code, but it is caught much later. This means we can print how things are later on in boot, but not when they fail.

print_hypervisor_info:2639 vm_guest 0x1 tsc_freq 0x0 tsc 0x10 hv_high 0x40000001 cpu_high 0xd

I added a print in indentcpu.c at print_hypervisor_info as that was easy to see in the message buffer.

static int
tsc_freq_cpuid_vm(void)
{
        u_int regs[4];

        if (vm_guest == VM_GUEST_NO)
                return (false);
        if (hv_high < 0x40000010)
                return (false);
        do_cpuid(0x40000010, regs);
        tsc_freq = (uint64_t)(regs[0]) * 1000;
        tsc_early_calib_exact = 1;
        return (true);
}

...

/*
 * Calculate TSC frequency using information from the CPUID leaf 0x15 'Time
 * Stamp Counter and Nominal Core Crystal Clock'.  If leaf 0x15 is not
 * functional, as it is on Skylake/Kabylake, try 0x16 'Processor Frequency
 * Information'.  Leaf 0x16 is described in the SDM as informational only, but
 * we can use this value until late calibration is complete.
 */
static bool
tsc_freq_cpuid(uint64_t *res)
{
        u_int regs[4];

        if (cpu_high < 0x15)
                return (false);
        do_cpuid(0x15, regs);
        if (regs[0] != 0 && regs[1] != 0 && regs[2] != 0) {
                *res = (uint64_t)regs[2] * regs[1] / regs[0];
                return (true);
        }
...

The calibration functions look like this, they check vm_guest (0x1 for us or VM_GUEST_VM ) and then look at the value of hv_high or cpu_high in the default tsc_freq_cpuid

With hv_high as 0x40000001 and cpu_high as 0x0D we are failing the checks in both of the obvious tsc_freq functions. Why?

To figure that out we need to look at where both of these values come from.

do_cpuid(0, regs);
cpu_high = regs[0];     /* eax */

hv_high is set int identify_hypervisor_cpuid_base in sys/x86/x86/identcpu.c with this block:

/*
 * If this is the first entry or we found a
 * specific hypervisor, record the base, high value,
 * and vendor identifier.
 */
if (vm_guest != prev_vm_guest || leaf == 0x40000000) {
        hv_base = leaf;
        hv_high = regs[0];
        ((u_int *)&hv_vendor)[0] = regs[1];
        ((u_int *)&hv_vendor)[1] = regs[2];
        ((u_int *)&hv_vendor)[2] = regs[3];
        hv_vendor[12] = '\0';

        /*
         * If we found a specific hypervisor, then
         * we are finished.
         */
        if (vm_guest != VM_GUEST_VM &&
            /*
             * Xen and other hypervisors can expose the
             * HyperV signature in addition to the
             * native one in order to support Viridian
             * extensions for Windows guests.
             *
             * Do the full cpuid scan if HyperV is
             * detected, as the native hypervisor is
             * preferred.
             */
            vm_guest != VM_GUEST_HV)
                break;
}

This comment about VM_GUEST_VM is suspect, we are running as VM_GUEST_VM and this comment implies that it is an early detection, but an error to actually try and use. Other comments in identify_hypervisor_cpuid_base point to lkml threads and a vmware knowledge base (KB1009458):

Testing the CPUID hypervisor present bit

Intel and AMD CPUs have reserved bit 31 of ECX of CPUID leaf 0x1 as the
hypervisor present bit. This bit allows hypervisors to indicate their presence
to the guest operating system. Hypervisors set this bit and physical CPUs (all
existing and future CPUs) set this bit to zero. Guest operating systems can
test bit 31 to detect if they are running inside a virtual machine.

Intel and AMD have also reserved CPUID leaves 0x40000000 - 0x400000FF for
software use. Hypervisors can use these leaves to provide an interface to pass
information from the hypervisor to the guest operating system running inside a
virtual machine. The hypervisor bit indicates the presence of a hypervisor and
that it is safe to test these additional software leaves. VMware defines the
0x40000000 leaf as the hypervisor CPUID information leaf. Code running on a
VMware hypervisor can test the CPUID information leaf for the hypervisor
signature. VMware stores the string "VMwareVMware" in EBX, ECX, EDX of CPUID
leaf 0x40000000.

Testing the virtual BIOS DMI information and the hypervisor port

Apart from the CPUID-based method for VMware virtual machine detection,
VMware also provides a fallback mechanism for the following reasons:

This CPUID-based technique will not work for guest code running at CPL3
when VT/AMD-V is not available or not enabled.  The hypervisor present bit
and hypervisor information leaf are only defined for products based on
VMware hardware version 7.

That is really helpful, we aren't running under a real hypervisor, the Hypervisor: Origin = "TCGTCGTCGTCG" in the message buffer indicates that we are running on QEMU's Tiny Code Generator.

This is a fork where I could look at a bunch of stuff, QEMU, the Linux kernel or a running system on Linux with KVM. I don't really want to do any of these, but I already have a clone of QEMU so I'll start by looking at how MicroVM is implemented.

QEMU Microvm

Microvm is implemented in qemu/hw/i386/ by microvm.c , microvm-dt.c (for fw_cfg ) and acpi-microvm.c (not sure about that one).

didn't help

neither did search in the linux github repo

maybe I can get a ubuntu kernel to boot to failure. Based on these instructions I grabbed the linked kernel and through it into by script which didn't work. Not sure if the enable-kvm option is the issue here I booted a test machine into ubuntu and ran through their entire example as is, which worked fine.

Then I tried my kernel in their script and their kernel in my script and neither were happy enough to give me any output. Of these three options non have really gotten me anywhere.

Going back to tsc_freq_cpuid_vm it does this:

        if (hv_high < 0x40000010)
                return (false);
        do_cpuid(0x40000010, regs);

Skipping this check get us to the same pvclock_get_timecount fault as when we skipped the assert entirely. This is trying to read a memory address that isn't there because we aren't Xen, and yes we aren't Xen. Why do we think we are Xen?

isxen (pv.c) runs early on an detects if we are running under Xen. If we are then the init_ops table is swapped from the default one to the Xen one ( xen_pvh_init_ops ), somewhere ( hammer_time_xen ).

Time for a break to try and blinken some lights which turned out to doing some recycling instead.

On these breaks I have found it helps a lot to leave a note somewhere, on return I can try and build and the note will piss off the compiler.

/usr/home/tj/code/freebsd/worktrees/microvm/sys/dev/xen/timer/xen_timer.c:165:2: error: use of undeclared identifier 'look'
  165 |         look here buddy
      |         ^
1 error generated.

I wasn't sure how we were getting to hammer_time_xen , but during the break I figured it is probably being driven by the kernel config.

# Xen HVM Guest Optimizations
# NOTE: XENHVM depends on xenpci and xentimer.
# They must be added or removed together.
# NOTE: These are present in FIRECRACKER because the PVH boot method
# originates from Xen; once that code is untangled these can be removed.
options         XENHVM                  # Xen HVM kernel infrastructure
device          xenpci                  # Xen HVM Hypervisor services driver
device         xentimer                # Xen x86 PV timer device

The Firecracker kernel config comes with lots of helpful potential comments and in the Xen section we have xentimer . Commenting this out and fixing up some parts of the Xen init_ops lets us advance to a fascinating panic:

Statistical lapic calibration failed!  Clocks might be ticking at variable rates.                                     
Falling back to slow lapic calibration.                                                                               
lapic: Divisor 2, Frequency 104591 Hz                                                                                 
Timecounters tick every 10.000 msec                                                                                   
lo0: bpf attached                                                                                                     
vlan: initialized, using hash tables with chaining                                                                    
IPsec: Initialized Security Association Processing.                                                                   
tcp_init: net.inet.tcp.tcbhashsize auto tuned to 4096      
random: unblocking device.                                                                                            
panic: deadlres_td_on_lock: possible deadlock detected for 0xffffffff80f145e0 (swapper), blocked for 139455 ticks

cpuid = 0                                                                                                             
time = 102624                                                                                                         
KDB: stack backtrace:                                                                                                 
db_fetch_ksymtab() at db_fetch_ksymtab+0x17b/frame 0xfffffe004602cd20
vpanic() at vpanic+0x135/frame 0xfffffe004602ce50
panic() at panic+0x43/frame 0xfffffe004602ceb0
profclock() at profclock+0x5fa/frame 0xfffffe004602cef0
fork_exit() at fork_exit+0x82/frame 0xfffffe004602cf30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe004602cf30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

And on some other attempts after a long time we get to mount root:

isa_probe_children: probing PnP devices
Device configuration finished.
procfs registered
Timecounter "TSC" frequency 543000 Hz quality 800
Statistical lapic calibration failed!  Clocks might be ticking at variable rates.
Falling back to slow lapic calibration.
lapic: Divisor 2, Frequency 108474 Hz
Timecounters tick every 10.000 msec
lo0: bpf attached
vlan: initialized, using hash tables with chaining
IPsec: Initialized Security Association Processing.
tcp_init: net.inet.tcp.tcbhashsize auto tuned to 4096
random: unblocking device.
WARNING: WITNESS option enabled, expect reduced performance.

Loader variables:

Manual root filesystem specification:
  <fstype>:<device> [options]
      Mount <device> using filesystem <fstype>
      and with the specified (optional) option list.

    eg. ufs:/dev/da0s1a
        zfs:zroot/ROOT/default
        cd9660:/dev/cd0 ro
          (which is equivalent to: mount -t cd9660 -o ro /dev/cd0 /)

  ?               List valid disk boot devices
  .               Yield 1 second (for background tasks)
  <empty line>    Abort manual input

mountroot>

Mount root

Mountroot is the first 'success' in doing a port, there is enough of a system (even if inconsistently) to add disks, even if there aren't drivers for the required things. Next steps are to:

  • figure out why this is slow (it is meant to be fast after all)
  • look at the lapic calibration complaints
  • add virtio devices

I think before we can add Virtio devices we will need to figure out why the MPTable isn't telling us about the 4 cores that QEMU should be passing through.

real memory  = 536858624 (511 MB)                          
Physical memory chunk(s):                                  
0x0000000000001000 - 0x000000000009efff, 647168 bytes (158 pages)                
0x0000000000100000 - 0x00000000001fffff, 1048576 bytes (256 pages)
0x0000000001602000 - 0x000000001edf5fff, 494878720 bytes (120820 pages)
0x000000001fc00000 - 0x000000001fd6cfff, 1495040 bytes (365 pages)
avail memory = 492081152 (469 MB)                          
MPTable: <SMP: Added CPU 0 (BSP)                  
BOCHSCPU 0.1         >                                     
Event timer "LAPIC" quality 600                      
LAPIC: ipi_wait() us multiplier 1 (r 1645530 tsc 543000)   
Pentium Pro MTRR support enabled

Where are my cpus?

The QEMU bits of my launch script right now look like this:

memory=512m
cores=4
netif=tap0
cpu="max"

qemu-system-x86_64 -M microvm                                   \
        -cpu ${cpu}                                             \
        -m ${memory}                                            \
        -smp ${cores}                                           \
        -kernel ${kernel}                                       \
        -nodefaults                                             \
        -no-user-config                                         \
        -nographic                                              \
        -serial stdio                                           \
        -drive id=test,file=${disk},format=raw,if=none          \
        -device virtio-blk-device,drive=test

Requesting that four cpus/cores are created, but from the last message buffer output we only see only CPU, CPU 0 (BSP) .

I'm going to start by finding my missing cpus, in the last dmesg above we have

MPTable: <SMP: Added CPU 0 (BSP)                  
BOCHSCPU 0.1         >

The MPTable: < string is printed by sys/x86/x86/mptable.c

There are a couple of methods that call mptable_walk_table with different call backs, sticking a print in one of those:

static void
mptable_probe_cpus_handler(u_char *entry, void *arg)
{
printf("%s:%d %d\n", __func__, __LINE__, *entry);
        proc_entry_ptr proc;

        switch (*entry) {
        case MPCT_ENTRY_PROCESSOR:

Can give us an idea of the entries it carries.

---<<BOOT>>---                                                 
MP Configuration Table version 1.4 found at 0xffffffff800f4480 
APIC: Using the MPTable enumerator.                            
mptable_probe_cpus_handler:531 0                               
mptable_probe_cpus_handler:531 1                               
mptable_probe_cpus_handler:531 2                               
mptable_probe_cpus_handler:531 3                               
mptable_probe_cpus_handler:531 3                               
mptable_probe_cpus_handler:531 3                               
mptable_probe_cpus_handler:531 3
...

And when matched to the header we see we only have 1 MPCT_ENTRY_PROCESSOR .

/* Base table entries */

#define MPCT_ENTRY_PROCESSOR    0
#define MPCT_ENTRY_BUS          1
#define MPCT_ENTRY_IOAPIC       2
#define MPCT_ENTRY_INT          3
#define MPCT_ENTRY_LOCAL_INT    4

Firecracker uses MPTable, I know this from Colin's BSDCan talk, but this isn't matching to what I thought I was telling QEMU to do.

I'm not sure what is up with this, I think I need another target to compare against

Where's my disks?

From Colin's port we learn that Firecracker gets its virtio configuration from the Linux command line. I don't see any evidence of this in the message buffer right now. At the end of the blog posts there is some help I am certainly missing:

You'll probably also want to build a disk image so that FreeBSD has
something to boot from; place vfs.root.mountfrom=ufs:/dev/vtbd0 into
Firecracker's boot_args to tell FreeBSD to use the disk you attach (aka.
the first Virtio block device) as the root disk.

As I understand, virtio_blk needs to be configured to attach because MicroVM like firecracker has no dynamic configuration for it to probe from. So lets add this to the launch script:

bootargs="vfs.root.mountfrom=ufs:/dev/vtbd0"
...
-smp ${cores}                                           \
-kernel ${kernel}                                       \
-append ${bootargs}                                     \
-nodefaults                                             \
-no-user-config                                         \

Not quite enough, I'm not sure how boot args get from the QEMU command line into the kernel or where they would be stored and generally when doing silly things - like this - I'd prefer to learn and leave reading git logs as the last option. Kind of goes like this:

  • reading code
  • reading a book (lol)
  • searching the internet
  • reading git logs

I know from loader we can set -s , -d and -v for single, debug and verbose and I've already hi-jacked verbose to always be on because we don't have a loader to set it. I will look there.

From looking at various RB_ flags I stumbled onto boot_parse_cmdline . Which a variation of is called from in pv.c (all the other calls are on fun Risc architectures):

} else {
        /* Parse the extra boot information given by Xen */
        if (start_info->cmdline_paddr != 0)
                boot_parse_cmdline_delim(
                    (char *)(start_info->cmdline_paddr + KERNBASE),
                    ", \t\n");
        kmdp = NULL;
        strlcpy(bootmethod, "PVH", sizeof(bootmethod));
}

boothowto |= boot_env_to_howto();

With boothowto and bootverbose I can see things that pick up a command line, but I'm having a hard time having a reference to one somewhere I can print it. I still don't have prints working from the Xen init functions.

init_dynamic_kenv_from is called from the init_dynamic_kenv SYSINIT once we have a functioning VM system. This isn't getting anywhere.

Fine, a quick search of the commit log reveals:

commit 0e1f5ab7db2cd2837f97f169122897b19c185dbd
Author: Colin Percival <cperciva@FreeBSD.org>
Date:   Fri Aug 12 17:54:26 2022 -0700

    virtio_mmio: Support command-line parameters

    The Virtio MMIO bus driver was added in 2014 with support for devices
    exposed via FDT; in 2018 support was added to discover Virtio MMIO
    devices via ACPI tables, as in QEMU.  The Firecracker VMM eschews both
    FDT and ACPI, instead presenting device information via kernel command
    line arguments of the form virtio_mmio.device=<parameters>.

    These command line parameters get converted into kernel environment
    variables; this adds support for parsing those variables and attaching
    virtio_mmio children to nexus.

    There is a case to be made that it would be cleaner to have a new
    "cmdlinebus" attached to nexus and virtio_mmio children attached to
    that.  A future commit might do that.

    Discussed with: imp, jrtc27
    Sponsored by:   https://patreon.com/cperciva
    Differential Revision:  https://reviews.freebsd.org/D36189

It adds this identify method:

static void
vtmmio_cmdline_identify(driver_t *driver, device_t parent)
{
        size_t n;
        char name[] = "virtio_mmio.device_XXXX";
        char * val;

        /* First variable just has its own name. */
        if ((val = kern_getenv("virtio_mmio.device")) == NULL)
                return;
        parsearg(driver, parent, val);
        freeenv(val);

        /* The rest have _%zu suffixes. */
        for (n = 1; n <= 9999; n++) {
                sprintf(name, "virtio_mmio.device_%zu", n);
                if ((val = kern_getenv(name)) == NULL)
                        return;
                parsearg(driver, parent, val);
                freeenv(val);
        }
}

Which while we can trace that it is called, it uses kern_getenv to look for compatible devices AND that doesn't help me check what the hell the command line is.

Cheated for nothing - don't worry the rules are made up.

I grabbed a copy of the Xen extra boot information into a temp variable, waited for a panic to ddb and printed that and got a -dsr - which were the test flags I passed to QEMU when I started poking at this.

I took a stab in the dark and set instead:

bootargs="vfs.root.mountfrom=ufs:/dev/vtbd0\t\nvirtio_mmio.device"

And I got a different mountroot prompt!

Trying to mount root from ufs:/dev/vtbd0\t\nvirtio_mmio.device []...
mountroot: waiting for device /dev/vtbd0\t\nvirtio_mmio.device...
Mounting from ufs:/dev/vtbd0\t\nvirtio_mmio.device failed with error 19.

Loader variables:
  vfs.root.mountfrom=ufs:/dev/vtbd0\t\nvirtio_mmio.device

Manual root filesystem specification:
  <fstype>:<device> [options]

Kicking that into ddb and boom, there is the -append arg from QEMU.

db> x/s *cmdline
kernbase+0x560: vfs.root.mountfrom=ufs:/dev/vtbd0\t\nvirtio_mmio.device

So! However QEMU is passing through to the VM the virtio device configuration it isn't via this command line, there must be some other way. This effort has confirmed that the bootargs are making it through to the kernel, but we aren't getting everything we need.

I guess I need to look at what QEMU is doing and what a successful Linux boot is like

Looking at qemu

While donating platelets I had a read of the QEMU MicroVM implementation and the short git log for the file and hit this commit:

commit f6f7e2d88d0b29d8b6e1a12a5f3f9f31faff9846
Author: Gerd Hoffmann <kraxel@redhat.com>
Date:   Tue Sep 15 14:09:00 2020 +0200

    microvm/acpi: disable virtio-mmio cmdline hack

    ... in case we are using ACPI.

    Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
    Reviewed-by: Igor Mammedov <imammedo@redhat.com>
    Reviewed-by: Sergio Lopez <slp@redhat.com>
    Message-id: 20200915120909.20838-13-kraxel@redhat.com

The commit before this one gives context too:

commit 67eb6a4007fd8f9073020e506453ff5b7c25cb34
Author: Gerd Hoffmann <kraxel@redhat.com>
Date:   Tue Sep 15 14:08:59 2020 +0200

    microvm/acpi: use seabios with acpi=on

    With acpi=off continue to use qboot.

    Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
    Reviewed-by: Igor Mammedov <imammedo@redhat.com>
    Reviewed-by: Sergio Lopez <slp@redhat.com>
    Message-id: 20200915120909.20838-12-kraxel@redhat.com

Despite MicroVM saying it doesn't support ACPI there is a lot of ACPI specific code in the driver. My devices weren't in the kernel command line because the VM was booting with ACPI. It is also using a different firmware because it supports ACPI.

For MicroVM a tiny fast boot firmware called "qboot" was written, but in my testing when I have locked everything up I have seen plenty of 'Seabios" messages - I knew that wasn't what the documentation promised as the default, but I was also getting boot messages most the time.

I think the default bios with ACPI is Seabios and qboot otherwise.

I added -machine acpi=off to my boot script flags and this time got to:

Mounting from ufs:/dev/vtbd0 failed with error 2; retrying for 2 more seconds
Attempted recovery for standard superblock: failed
Attempted extraction of recovery data from standard superblock: failed
Attempt to find boot zone recovery data.
Finding an alternate superblock failed.
Check for only non-critical errors in standard superblock
Failed, superblock has critical errors
Attempted recovery for standard superblock: failed
Attempted extraction of recovery data from standard superblock: failed
Attempt to find boot zone recovery data.
Finding an alternate superblock failed.
Check for only non-critical errors in standard superblock

The mountroot errors just looped until I killed the VM. The errors make perfect sense, I have no idea what test.raw is, there is a good chance is is just an empty file. I gave qemu fbsd14.raw for my vms directory, but that also didn't boot so there are things to do.

Booting with the new firmware we also have a MPTable with multiple CPUs:

avail memory = 491782144 (469 MB)
MPTable: <mptable_setup_cpus_handler:552 0
SMP: Added CPU 0 (BSP)
mptable_setup_cpus_handler:552 0
SMP: Added CPU 1 (AP)
mptable_setup_cpus_handler:552 0
SMP: Added CPU 2 (AP)
mptable_setup_cpus_handler:552 0
SMP: Added CPU 3 (AP)

...

cpu0 BSP:                                                            
     ID: 0x00000000   VER: 0x00050014 LDR: 0x00000000 DFR: 0xffffffff
  lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000001ff
  timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400
SMP: AP CPU #1 Launched!                                             
cpu1 AP:                                                             
     ID: 0x01000000   VER: 0x00050014 LDR: 0x00000000 DFR: 0xffffffff
  lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000001ff
  timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400
SMP: AP CPU #3 Launched!                                             
cpu3 AP:                                                             
     ID: 0x03000000   VER: 0x00050014 LDR: 0x00000000 DFR: 0xffffffff
  lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000001ff
  timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400
SMP: AP CPU #2 Launched!                                             
cpu2 AP:                                                             
     ID: 0x02000000   VER: 0x00050014 LDR: 0x00000000 DFR: 0xffffffff
  lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000001ff
  timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400

Generally the console seems to be much more responsive, rather than the error we had before:

Timecounter "TSC" frequency 543000 Hz quality 800
Statistical lapic calibration failed!  Clocks might be ticking at variable rates.
Falling back to slow lapic calibration.
lapic: Divisor 2, Frequency 108474 Hz

We now have

Timecounter "TSC" frequency 543000 Hz quality -100

Mounting a disk

Time to check what test.raw is:

[tj@displacementactivity] $ sudo mdconfig -a ~/vms/test.raw
Password:
md0
[tj@displacementactivity] $ gpart show md0
=>       34  104857526  md0  GPT  (50G)
         34        122    1  freebsd-boot  (61K)
        156      66584    2  efi  (33M)
      66740    2097152    3  freebsd-swap  (1.0G)
    2163892  102693452    4  freebsd-ufs  (49G)
  104857344        216       - free -  (108K)

Which suggests that the boot script needs to know the correct partition:

bootargs="vfs.root.mountfrom=ufs:/dev/vtbd0p4"

Which gets us all the way to a login prompt! As I am ready to declare victory I login and the system silently hangs, but it works on a second try.

If we don't think about how much time was spent debugging the wrong QEMU bios firmware dur to poor documentation then that was quite an easy process.

From my minimal testing there are a few things to look at next:

  • reboot and shutdown don't
  • network devices appear and configure, but don't send packets
  • it just hangs sometimes
  • boot takes much longer than I would like (100ms max please)

Still more to do.

Thoughts on retro computing

I got an Atari Portfolio speaking to me yesterday, but actually doing anything with this machine will probably require buying expensive unreliable old accessories.

This is what happened with the 68k Powerbook I acquired last year too. That is a cool computer, but it needs a new power supply and it could really do with some more RAM. RAM which of course is impossible to buy and impossible to make due to a weird connector no one can find.

I really like computers, old new and new ones alike. I love the weird portable machines that were being made when everyone decided they didn't only need a personal computer, but an on the person computer.

But , I don't have nostalgia for these computers.

Many many years ago, in the #hackrf irc channel I was told off for telling someone about getting an im-me as a radio toy. The im-me is a wonderful hack, but the thing new functionality hacked into repurposed devices is that it is incredibly non-democratic in how it is available.

The device itself is normally not being made and as soon as media coverage lands for the cool hack eBay scalpers shoot up prices taking the 'cheap' repurposed gadget out of the hands of those that really need it.

Instead for the im-me you could pick up a super cheap TI dev board with the same radio or the wonderful open hardware supported by a company run by great people YardStickOne .

I am drawn to these old portable, person sized computers they reflect the desire to bring the power and joy of computers with us everywhere we go and well, they are really cool.

They are complex, simple, understandable machines, when they were made both the height of technology and the trailing edge.

The 80C88 wasn't winning any contests for performance in 1989. I have ideas in the works that build on using these little computers, but if anything I'm loath to invest further in them.

How can I really justify investing in keeping old computers going when for similar amounts of money and time I can build new computers which are just as weird, but understandable and expandable with tools and parts I can actually get.

Where the Book 8088 and the Pocket386 fit into this world of new old Commodore spins I have no idea.

netmap on cxgbe interfaces

Last year my testbed machine got the luxury of 100Gbit Chelsio T6 network interfaces with a pair of T62100-LP-CR cards. I have been using these recently for VPP development on FreeBSD and while they work well I have been having a lot of trouble doing any packet generation with them.

I got pretty miserable performance using netmaps pkt-gen on FreeBSD and trying Linux the comparative oddness of the T6 drivers seems to trip on the TRex packet generator suite on Linux.

From reading Chelsio's own documentation it seemed I was missing something, they have benchmarks using pkt-gen showing link saturation for small packets at 40Gbit/s on their T5 cards. What was up?

When I run a pkt-gen (lives in tools/tools/netmap ) benchmark I get a little under 1 Gbit/s of traffic:

$ sudo ./pkt-gen -f tx -i cc0 -S 00:07:43:74:3f:e9 -D 00:07:43:74:3f:e1
...
250.554583 main_thread [2713] 9.012 Mpps (9.587 Mpkts 4.326 Gbps in 1063799 usec) 512.00 avg_batch 0 min_space

~9 Mpps isn't that great, I paid for 100 Gbit and 4 just isn't enough.

FreeBSD has a zero copy userspace networking framework called netmap , it cleverly hooks into drivers early in their packet processing pipeline and if there is a netmap application running steals them so the host doesn't see them. Aiding with support, if the app doesn't want the given packet it can be given back to the driver and normal processing can occur.

Drivers can be natively supported with netmap or if support isn't available an emulated mode driver can be used.

Eventually after looking really really hard at the Chelsio examples I saw that the network interface names were different from mine. They were prefixed with a 'v', 'vcxl' rather than 'cxl'.

A bunch of googling later I read the top cxgbe(4) correctly:

The cxgbe driver uses different names for devices based on the associated
ASIC:

      ASIC    Port Name    Parent Device    Virtual Interface
      T4      cxgbe        t4nex            vcxgbe
      T5      cxl          t5nex            vcxl
      T6      cc           t6nex            vcc

To get a virtual interface we need to set the hw.cxgbe.num_vis sysctl to a value greater than 1. I added the following to my /boot/loader.conf :

# Add a virtual port for cxgbe                                     
hw.cxgbe.num_vis="2"            # Number of VIs per port           
hw.cxgbe.nnmrxq_vi="8"          # Number of netmap RX queues per VI
hw.cxgbe.nnmtxq_vi="8"          # Number of netmap TX queues per VI
hw.cxgbe.nrxq_vi="8"            # Number of RX queues per VI       
hw.cxgbe.ntxq_vi="8"            # Number of TX queues per VI

On a reboot I have a shiny new vcc0 device, letting pkt-gen run there I get much nicer results:

$ sudo ./pkt-gen -f tx -i vcc0 -S 00:07:43:74:3f:e9 -D 00:07:43:74:3f:e1 
...
329.706767 main_thread [2713] 68.770 Mpps (73.060 Mpkts 33.010 Gbps in 1062380 usec) 480.61 avg_batch 99999 min_space

~70 Mpps and 33 Gbps of 64 byte packets is pretty great , with an imix I can saturate the 100Gbps with pkt-gen .

I didn't find anything about this differentiator by searching, pkt-gen doesn't mention it. There are some dmesg lines, but if you don't know, you don't know.

238.184192 [1135] generic_netmap_attach     Emulated adapter for cc0 created (prev was NULL)
238.192821 [ 320] generic_netmap_register   Emulated adapter for cc0 activated
238.200244 [1889] netmap_interp_ringid      invalid ring id 1
238.206046 [ 295] generic_netmap_unregister Emulated adapter for cc0 deactivated

previous next