Index
ch2 专业阶段新增了 GPGPU 体系结构与建模专题(2 篇),是本期训练营的重点实验项目。该实验基于 QEMU 实现了一个 RISC-V GPGPU 设备原型,目前已完成原型开发。
简化架构图(参考 Vortex 设计):
+----------------------------------------------------------------------------------------+
| Guest OS (RISC-V or Other Arch) GPGPU App/Kernel --> GPGPU Driver (MMIO + DMA) |
+--------------------------------------+-------------------------------------------------+
| PCIe
=======================================|==================================================
QEMU | Device Model
=======================================|==================================================
+--------------------------------------v-------------------------------------------------+
| PCIe Frontend (gpgpu.c) |
| |
| BAR0 (CTRL 1MB) BAR2 (VRAM 64MB) BAR4 (DOORBELL 64KB) |
| +----------------------+ +----------------------+ +----------------------+ |
| | Kernel Dispatch | | | | DMA Engine | |
| | kernel_addr/args | | BAR map (PCIe window)+--+ | src/dst/size/ctrl | |
| | grid_dim (X,Y,Z) | | | | | MSI-X (4 vectors) | |
| | block_dim (X,Y,Z) | | | | | IRQ enable/pending | |
| | Global Control | +----------------------+ | +----------------------+ |
| | IRQ Status | | |
| +----------+-----------+ | |
| | dispatch | map |
+------------+------------------------------------------+--------------------------------+
| |
+------------v------------------------------------------+--------------------------------+
| SIMT Backend (gpgpu_core.c) | |
| | |
| +----------------------+ | |
| | VRAM (64MB) | <-- PCIe BAR2 maps here +----+ |
| | GPU Local Memory | |
| +----------^-----------+ |
| | ld/st |
| |
| Grid --> Block(0,0) Block(1,0) Block(2,0) ... |
| | |
| v |
| +--- Block ------------------------------------------------------------+ |
| | | |
| | +--- Warp 0 --------+ +--- Warp 1 --------+ +--- Warp 2 --+ | |
| | | Lane 0 .. Lane 31 | | Lane 0 .. Lane 31 | | Lane 0..31 | | |
| | | +----+ +----+ | | +----+ +----+ | | +----+ | | |
| | | | PC | | PC | | | | PC | | PC | | | | PC | | ... | |
| | | | x0 | | x0 | | | | x0 | | x0 | | | | x0 | | | |
| | | |... | |... | | | |... | |... | | | |... | | | |
| | | |x31 | |x31 | | | |x31 | |x31 | | | |x31 | | | |
| | | +----+ +----+ | | +----+ +----+ | | +----+ | | |
| | | active_mask (32b) | | active_mask (32b) | | active_mask | | |
| | +-------------------+ +-------------------+ +-------------+ | |
| | | |
| | barrier / sync mhartid = [block|warp|thread] | |
| +----------------------------------------------------------------------+ |
| |
+----------------------------------------------------------------------------------------+
主要特性:
- SIMT 执行模型:支持 Thread/Block/Grid 层级的线程组织与 Warp 调度
- PCIe 设备实现:作为标准 PCIe 设备挂载,支持 BAR/MMIO、DMA、MSI-X
- QTest 测试框架:集成 QEMU QTest 基础设施进行设备级自动化测试
- 前后端分层架构:PCIe 前端负责命令队列与寄存器交互,cmodel 后端执行 kernel 计算
考核方式:
- 基于 Qtest 框架搭建 GPGPU 测题集,用于验证功能完备性,根据测题 Pass 数目计算学员得分
- 开放题目:基于该 GPGPU 设计一个简单的 AI 软件栈(编程模型 + 驱动),类 cuda 风格
- 开放题目:直接将 Vortex 的 simx 集成到 QEMU 当中,并将其 AI 软件栈适配 ArceOS/rCore