跳转至

QEMU 启动流程分析

主要贡献者

对一个软件源码建立最直观印象的方法,就是观测它的初始化流程。QEMU 是一个 C 为主体的项目,所以理所当然我们要从 main 函数开始追踪。

以 qemu-system-riscv64 为例,我们使用如下命令启动:

qemu-system-riscv64 -M virt -s -S -nographic

这个命令让 QEMU 创建了一个 virt 机器模型,以 -nographic 模式运行,并开启了 gdbstub 远程调试功能。

其中,-s 会让 QEMU 在 TCP 1234 端口等待 gdb 连接,-S 会让 guest 在 gdb 明确发出继续命令前保持暂停;对于 virt(以及 sifive_u)这类机器,不指定 -bios 时默认等同于 -bios default,会自动加载随 QEMU 发布的默认 OpenSBI 固件。你可以使用任意支持 RISC-V 的 gdb(例如 riscv64-elf-gdb)连接该 gdbstub。

Note

我们先观察 virt Machine 是如何创建的,然后再探讨一下 QEMU 是如何加载 OpenSBI 的二进制程序到 virt Machine 的内存里面的,最后再看 QEMU 是如何初始化 vCPU 指向第一条 Guest 指令的。

把以上流程梳理清楚,我们就对 QEMU 的工作原理有更多的了解了,其他 Machine 的初始化和准备工作,基本类似。

所以本小节,讲述的是一个方法论,让你快速分析不同的 Machine,或者对于本训练营没有覆盖到知识点,你也有足够的手段去研究它。

概览

  • 从 main 入口追踪 QEMU 启动主流程
  • virt Machine 初始化与类型注册路径
  • 加载 OpenSBI 与客户机程序到内存
  • vCPU 第一条指令的启动路径
  • 主循环与 IO 线程的职责划分

virt Machine 初始化

最方便的办法,是通过 gdb 来反向定位源码。QEMU 源码中充斥各种 C 宏魔法,对于刚接触 QEMU 的人来说,看起来会有一些吃力,使用 gdb 调试,可以更有针对性分析我们关注的部分,抽丝剥茧,层层拨开 QEMU 源码的神秘面纱。

virt Machine 是一个 QEMU 的 QOM Object(下一章节会讲),我们可以在它的 class 初始化或者对象实例化的源码位置,打一个断点,来观察调用栈。这里我们先搜索一下 virt Machine 源码里关于 typeinfo 相关的代码:

static const TypeInfo virt_machine_typeinfo = {
    .name       = MACHINE_TYPE_NAME("virt"),
    .parent     = TYPE_MACHINE,
    .class_init = virt_machine_class_init,
    .instance_init = virt_machine_instance_init,
    .instance_size = sizeof(RISCVVirtState),
    .interfaces = (const InterfaceInfo[]) {
         { TYPE_HOTPLUG_HANDLER },
         { }
    },
};

static void virt_machine_init_register_types(void)
{
    type_register_static(&virt_machine_typeinfo);
}

type_init(virt_machine_init_register_types)

所以,我们可以给 virt_machine_class_init()virt_machine_instance_init() 分别下两个断点。

使用下面的命令用 gdb 调试 QEMU :

gdb qemu-system-riscv64 -ex "set args -M virt -nographic"
(gdb) b virt_machine_class_init
(gdb) b virt_machine_instance_init
(gdb) r
...

首先我们跟踪到的,是 virt_machine_class_init 函数,看一下调用栈:

(gdb) bt
#0  virt_machine_class_init (oc=0x5555572012d0, data=0x0) at ../hw/riscv/virt.c:1916
#1  0x0000555555dc0d11 in type_initialize (ti=0x555556ffab50) at ../qom/object.c:1123
#2  object_class_foreach_tramp (key=<optimized out>, value=0x555556ffab50, opaque=0x7fffffffd8a0) at ../qom/object.c:1110
#3  0x00007ffff6f7380b in g_hash_table_foreach () from /usr/lib/libglib-2.0.so.0
#4  0x0000555555dc15f1 in object_class_foreach (fn=0x555555dbf370 <object_class_get_list_tramp>, implements_type=<optimized out>, include_abstract=false,
    opaque=0x7fffffffd898) at ../qom/object.c:87
#5  object_class_get_list (implements_type=implements_type@entry=0x555555fa02e2 "machine", include_abstract=include_abstract@entry=false)
    at ../qom/object.c:1189
#6  0x0000555555b22cc1 in select_machine (errp=0x7fffffffd930, qdict=0x555557009e70) at ../system/vl.c:1675
#7  qemu_create_machine (qdict=0x555557009e70) at ../system/vl.c:2187
#8  qemu_init (argc=<optimized out>, argv=0x7fffffffdb98) at ../system/vl.c:3759
#9  0x00005555558cbff9 in main (argc=<optimized out>, argv=<optimized out>) at ../system/main.c:71

我们再看看 virt_machine_instance_init 的调用栈:

(gdb) c
Continuing.

Thread 1 "qemu-system-ris" hit Breakpoint 2, virt_machine_instance_init (obj=0x55555720cca0) at /home/zevorn/qemu/include/hw/riscv/virt.h:35
35      DECLARE_INSTANCE_CHECKER(RISCVVirtState, RISCV_VIRT_MACHINE,
(gdb) bt
#0  virt_machine_instance_init (obj=0x55555720cca0) at /home/zevorn/qemu/include/hw/riscv/virt.h:35
#1  0x0000555555dc1c4c in object_init_with_type (obj=0x55555720cca0, ti=0x555556ffab50) at ../qom/object.c:428
#2  object_initialize_with_type (obj=0x55555720cca0, size=<optimized out>, type=0x555556ffab50) at ../qom/object.c:570
#3  0x0000555555dc1ed7 in object_new_with_type (type=0x555556ffab50) at ../qom/object.c:774
#4  0x0000555555dc1f48 in object_new_with_class (klass=klass@entry=0x5555572012d0) at ../qom/object.c:782
#5  0x0000555555b248a8 in qemu_create_machine (qdict=0x555557009e70) at ../system/vl.c:2190
#6  qemu_init (argc=<optimized out>, argv=0x7fffffffdb98) at ../system/vl.c:3759
#7  0x00005555558cbff9 in main (argc=<optimized out>, argv=<optimized out>) at ../system/main.c:71
(gdb)

这里可以看到,main 函数首先调用了 qemu_init 函数,然后再调用 qemu_create_machine 函数来创建 machine:

main()  qemu_init()  qemu_create_machine()

然后先创建 class,然后再实例化 QOM 对象,这和前面讲解 QOM 给出的流程一致。

另外我们还比较关注这两个函数内部的实现,这时候可以进一步阅读代码:

static void virt_machine_class_init(ObjectClass *oc, const void *data)
{
    MachineClass *mc = MACHINE_CLASS(oc);
    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(oc);

    mc->desc = "RISC-V VirtIO board";
    mc->init = virt_machine_init;
    mc->max_cpus = VIRT_CPUS_MAX;
    mc->default_cpu_type = TYPE_RISCV_CPU_BASE;
    mc->block_default_type = IF_VIRTIO;
    mc->no_cdrom = 1;
    mc->pci_allow_0_address = true;
    mc->possible_cpu_arch_ids = riscv_numa_possible_cpu_arch_ids;
    mc->cpu_index_to_instance_props = riscv_numa_cpu_index_to_props;
    mc->get_default_cpu_node_id = riscv_numa_get_default_cpu_node_id;
    mc->numa_mem_supported = true;
    /* platform instead of architectural choice */
    mc->cpu_cluster_has_numa_boundary = true;
    mc->default_ram_id = "riscv_virt_board.ram";
    assert(!mc->get_hotplug_handler);
    mc->get_hotplug_handler = virt_machine_get_hotplug_handler;

    hc->plug = virt_machine_device_plug_cb;

    machine_class_allow_dynamic_sysbus_dev(mc, TYPE_RAMFB_DEVICE);
    machine_class_allow_dynamic_sysbus_dev(mc, TYPE_UEFI_VARS_SYSBUS);
#ifdef CONFIG_TPM
    machine_class_allow_dynamic_sysbus_dev(mc, TYPE_TPM_TIS_SYSBUS);
#endif

    /* class property intialization */
    object_class_property_add_bool(oc, "aclint", virt_get_aclint,
                                   virt_set_aclint);
    object_class_property_set_description(oc, "aclint",
                                          "(TCG only) Set on/off to "
                                          "enable/disable emulating "
                                          "ACLINT devices");
    ...
}

static void virt_machine_instance_init(Object *obj)
{
    RISCVVirtState *s = RISCV_VIRT_MACHINE(obj);

    virt_flash_create(s);

    s->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6);
    s->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
    s->acpi = ON_OFF_AUTO_AUTO;
    s->iommu_sys = ON_OFF_AUTO_AUTO;
}

可以看到,instance 里面的逻辑不多,我们重点看 class 初始化的函数,里面有一个函数的注册很值得关注,virt_machine_init,其他代码更像是配置了一些属性,我们把解析放在代码后面讲:

static void virt_machine_init(MachineState *machine)
{
    RISCVVirtState *s = RISCV_VIRT_MACHINE(machine);
    MemoryRegion *system_memory = get_system_memory();
    MemoryRegion *mask_rom = g_new(MemoryRegion, 1);
    DeviceState *mmio_irqchip, *virtio_irqchip, *pcie_irqchip;
    int i, base_hartid, hart_count;
    int socket_count = riscv_socket_count(machine);

    s->memmap = virt_memmap;

    /* Check socket count limit */
    if (VIRT_SOCKETS_MAX < socket_count) {
        error_report("number of sockets/nodes should be less than %d",
            VIRT_SOCKETS_MAX);
        exit(1);
    }

    if (!virt_aclint_allowed() && s->have_aclint) {
        error_report("'aclint' is only available with TCG acceleration");
        exit(1);
    }

    /* Initialize sockets */
    mmio_irqchip = virtio_irqchip = pcie_irqchip = NULL;
    for (i = 0; i < socket_count; i++) {
        g_autofree char *soc_name = g_strdup_printf("soc%d", i);

        if (!riscv_socket_check_hartids(machine, i)) {
            error_report("discontinuous hartids in socket%d", i);
            exit(1);
        }

        base_hartid = riscv_socket_first_hartid(machine, i);
        if (base_hartid < 0) {
            error_report("can't find hartid base for socket%d", i);
            exit(1);
        }

        hart_count = riscv_socket_hart_count(machine, i);
        if (hart_count < 0) {
            error_report("can't find hart count for socket%d", i);
            exit(1);
        }

        /* cpu object property initialization */
        object_initialize_child(OBJECT(machine), soc_name, &s->soc[i],
                                TYPE_RISCV_HART_ARRAY);
        object_property_set_str(OBJECT(&s->soc[i]), "cpu-type",
                                machine->cpu_type, &error_abort);
        ...

        if (virt_aclint_allowed() && s->have_aclint) {
            if (s->aia_type == VIRT_AIA_TYPE_APLIC_IMSIC) {
                /* Per-socket ACLINT MTIMER */
                riscv_aclint_mtimer_create(s->memmap[VIRT_CLINT].base +
                            i * RISCV_ACLINT_DEFAULT_MTIMER_SIZE,
                            ...);
            } else {
                /* Per-socket ACLINT MSWI, MTIMER, and SSWI */
                riscv_aclint_swi_create(s->memmap[VIRT_CLINT].base +
                            i * s->memmap[VIRT_CLINT].size,
                        base_hartid, hart_count, false);
                riscv_aclint_mtimer_create(s->memmap[VIRT_CLINT].base +
                            i * s->memmap[VIRT_CLINT].size +
                            RISCV_ACLINT_SWI_SIZE,
                            ...);
                riscv_aclint_swi_create(s->memmap[VIRT_ACLINT_SSWI].base +
                            i * s->memmap[VIRT_ACLINT_SSWI].size,
                        base_hartid, hart_count, true);
            }
        } else if (tcg_enabled()) {
            /* Per-socket SiFive CLINT */
        }

        /* Per-socket interrupt controller */
        if (s->aia_type == VIRT_AIA_TYPE_NONE) {
            s->irqchip[i] = virt_create_plic(s->memmap, i,
                                             base_hartid, hart_count);
        } else {
            s->irqchip[i] = virt_create_aia(s->aia_type, s->aia_guests,
                                            s->memmap, i, base_hartid,
                                            hart_count);
        }

        /* Try to use different IRQCHIP instance based device type */
        if (i == 0) {
            mmio_irqchip = s->irqchip[i];
            virtio_irqchip = s->irqchip[i];
            pcie_irqchip = s->irqchip[i];
        }
        ...
    }


    /* register system main memory (actual RAM) */
    memory_region_add_subregion(system_memory, s->memmap[VIRT_DRAM].base,
                                machine->ram);

    /* boot rom */
    memory_region_init_rom(mask_rom, NULL, "riscv_virt_board.mrom",
                           s->memmap[VIRT_MROM].size, &error_fatal);
    memory_region_add_subregion(system_memory, s->memmap[VIRT_MROM].base,
                                mask_rom);

    /* other subsystem initialization */
    ...

    /* machine done */
    s->machine_done.notify = virt_machine_done;
    qemu_add_machine_init_done_notifier(&s->machine_done);
}

别看这个函数很长,实际上它做的事情很简单:

virt_machine_init()
  → riscv_socket_count()       // 获取 socket 数量
  → object_initialize_child()  // 创建每个 socket 的 CPU/SoC 对象
  → memory_region_init_*()     // 初始化/映射 MemoryRegion
  → create_fdt()               // 生成 DTB(未指定 -dtb 时)

CPU socket 主要是可以按照簇(cluster)来创建多组 CPU 核心,然后按照地址空间初始化各种设备, 最后为 virt 在内存中生成一个设备树,方便运行 linux kernel。

那么,是在什么位置加载的 OpenSBI 二进制程序呢?

加载客户机程序

有一点可以明确:用于承载 OpenSBI 的内存区域(MemoryRegion,通常是 DRAM 或 ROM)需要先被创建并初始化好,才能把固件/镜像加载进去。按照这个思路,我们可以找到如下代码(实际上 QEMU 是在整个 virt Machine 就绪以后,才开始加载客户机程序,它被安排在 machine_done 中实现):

static void virt_machine_done(Notifier *notifier, void *data)
{
    ...
    firmware_end_addr = riscv_find_and_load_firmware(machine, firmware_name,
                                                     &start_addr, NULL);
    ...
}

target_ulong riscv_find_and_load_firmware(MachineState *machine,
                                          const char *default_machine_firmware,
                                          hwaddr *firmware_load_addr,
                                          symbol_fn_t sym_cb)
{
    char *firmware_filename;
    target_ulong firmware_end_addr = *firmware_load_addr;

    firmware_filename = riscv_find_firmware(machine->firmware,
                                            default_machine_firmware);

    if (firmware_filename) {
        /* If not "none" load the firmware */
        firmware_end_addr = riscv_load_firmware(firmware_filename,
                                                firmware_load_addr, sym_cb);
        g_free(firmware_filename);
    }

    return firmware_end_addr;
}

char *riscv_find_firmware(const char *firmware_filename,
                          const char *default_machine_firmware)
{
    char *filename = NULL;

    if ((!firmware_filename) || (!strcmp(firmware_filename, "default"))) {
        /*
         * The user didn't specify -bios, or has specified "-bios default".
         * That means we are going to load the OpenSBI binary included in
         * the QEMU source.
         */
        filename = riscv_find_bios(default_machine_firmware);
    } else if (strcmp(firmware_filename, "none")) {
        filename = riscv_find_bios(firmware_filename);
    }

    return filename;
}

可以看到,如果我们不指定 -bios,或者显式使用 -bios default,那么 QEMU 将加载随 QEMU 发布的默认 OpenSBI 固件;如果显式使用 -bios none,则不会自动加载任何固件。

vCPU 执行的第一条指令

如果你使用 riscv64-elf-gdb 远程连接上 QEMU 以后,你会发现:

riscv64-elf-gdb -ex "target remote localhost:1234"
0x0000000000001000 in ?? ()
(gdb) display /i $pc
1: x/i $pc
=> 0x1000:      auipc   t0,0x

你会发现,vCPU 的第一条指令在 0x1000 这个地址。再往下单步几次,你会发现来到 0x80000000 这个地址,这是 OpenSBI 的第一条指令:

(gdb) si
0x0000000000001004 in ?? ()
1: x/i $pc
=> 0x1004:      addi    a2,t0,40
(gdb)
0x0000000000001008 in ?? ()
1: x/i $pc
=> 0x1008:      csrr    a0,mhartid
(gdb)
0x000000000000100c in ?? ()
1: x/i $pc
=> 0x100c:      ld      a1,32(t0)
(gdb)
0x0000000000001010 in ?? ()
1: x/i $pc
=> 0x1010:      ld      t0,24(t0)
(gdb)
0x0000000000001014 in ?? ()
1: x/i $pc
=> 0x1014:      jr      t0
(gdb)
0x0000000080000000 in ?? ()
1: x/i $pc
=> 0x80000000:  add     s0,a0,zero
(gdb)
0x0000000080000004 in ?? ()
1: x/i $pc
=> 0x80000004:  add     s1,a1,zero

此处对应源码初始化的位置在:

static void riscv_cpu_reset_hold(Object *obj, ResetType type)
{
    ...
#ifndef CONFIG_USER_ONLY
    env->misa_mxl = mcc->def->misa_mxl_max;
    env->priv = PRV_M;
    env->mstatus &= ~(MSTATUS_MIE | MSTATUS_MPRV);
    if (env->misa_mxl > MXL_RV32) {
        /*
         * The reset status of SXL/UXL is undefined, but mstatus is WARL
         * and we must ensure that the value after init is valid for read.
         */
        ...
    }
    env->mcause = 0;
    env->miclaim = MIP_SGEIP;
    env->pc = env->resetvec;  /* resetvec is set in resetvec_cb */
    env->bins = 0;
    env->two_stage_lookup = false;

    ...
}
可以看到 env->pc 被赋值为 env->resetvec,而 resetvec 的默认值是:

//cpu_bits.h
/* Default Reset Vector address */
#define DEFAULT_RSTVEC      0x1000

到这里,我相信你已经掌握了一些分析 QEMU 初始化流程的方法。

Task

尝试分析 virt Machine 的串口设备模型是如何初始化的,请结合 memory region 分析。

主循环与 IO 线程

完成初始化后,QEMU 会进入主循环,入口在 system/runstate.cqemu_main_loop(),它不断调用 main_loop_wait() 处理事件,直到收到退出条件:

/* system/runstate.c */
int qemu_main_loop(void)
{
    int status = EXIT_SUCCESS;

    while (!main_loop_should_exit(&status)) {
        main_loop_wait(false);
    }

    return status;
}

主循环里,main_loop_wait() 会根据定时器截止时间计算 poll 超时,等待宿主事件后再运行已到期的定时器:

/* util/main-loop.c */
void main_loop_wait(int nonblocking)
{
    int64_t timeout_ns;

    timeout_ns = qemu_soonest_timeout(timeout_ns,
                                      timerlistgroup_deadline_ns(
                                          &main_loop_tlg));

    ret = os_host_main_loop_wait(timeout_ns);
    qemu_clock_run_all_timers();
}

如果你把它理解成“QEMU 的事件泵”,就很容易把这段逻辑和设备的定时器、BH(bottom half)以及 monitor 事件联系起来。

IO-Thread 则用于把 I/O 事件从主线程拆出去,它是一个 QOM 对象(TYPE_IOTHREAD),内部维护自己的 AioContextGMainContext。核心运行循环在 iothread_run()

/* iothread.c */
static void *iothread_run(void *opaque)
{
    IOThread *iothread = opaque;

    while (iothread->running) {
        aio_poll(iothread->ctx, true);
        if (iothread->running && qatomic_read(&iothread->run_gcontext)) {
            g_main_loop_run(iothread->main_loop);
        }
    }

    return NULL;
}

这意味着:主线程负责全局事件与定时器调度,而 I/O 密集的设备(例如 block、virtio)可以绑定到指定的 IO-Thread,在它自己的 AioContext 中处理事件。这样既能降低主线程负载,又能让 I/O 路径有更稳定的时序与并发控制。