logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

2023-02-20-Helios-aarch64.md (24851B)


  1. ---
  2. title: Porting Helios to aarch64 for my FOSDEM talk, part one
  3. date: 2023-02-20
  4. ---
  5. [Helios] is a microkernel written in the [Hare] programming language, and the
  6. subject of a talk I did at FOSDEM earlier this month. You can watch the talk
  7. here if you like:
  8. [Helios]: https://sr.ht/~sircmpwn/helios
  9. [Hare]: https://harelang.org
  10. <iframe title="FOSDEM 2023: Introducing the Helios microkernel" src="https://spacepub.space/videos/embed/f6435a6c-34e0-4602-ad5d-f791643111ab" allowfullscreen="" sandbox="allow-same-origin allow-scripts allow-popups" width="560" height="315" frameborder="0"></iframe>
  11. A while ago I promised someone that I would not do any talks on Helios until I
  12. could present them from Helios itself, and at FOSDEM I made good on that
  13. promise: my talk was presented from a Raspberry Pi 4 running Helios. The kernel
  14. was originally designed for x86\_64 (though we were careful to avoid painting
  15. ourselves into any corners so that we could port it to more architectures later
  16. on), and I initially planned to write an Intel HD Graphics driver so that I
  17. could drive the projector from my laptop. But, after a few days spent trying to
  18. comprehend the IHD manuals, I decided it would be *much* easier to port the
  19. entire system to aarch64 and write a driver for the much-simpler RPi GPU
  20. instead. 42 days later the port was complete, and a week or so after that I
  21. successfully presented the talk at FOSDEM. In a series of blog posts, I will
  22. take a look at those 42 days of work and explain how the aarch64 port works.
  23. Today's post focuses on the bootloader.
  24. The Helios boot-up process is:
  25. 1. Bootloader starts up and loads the kernel, then jumps to it
  26. 2. The kernel configures the system and loads the init process
  27. 3. Kernel provides runtime services to init (and any subsequent processes)
  28. In theory, the port to aarch64 would address these steps in order, but in
  29. practice step (2) relies heavily on the runtime services provided by step (3),
  30. so much of the work was ordered 1, 3, 2. This blog post focuses on part 1, I'll
  31. cover parts 2 and 3 and all of the fun problems they caused in later posts.
  32. In any case, the bootloader was the first step. Some basic changes to the build
  33. system established boot/+aarch64 as the aarch64 bootloader, and a simple
  34. qemu-specific ARM kernel was prepared which just gave a little "hello world" to
  35. demonstrate the multi-arch build system was working as intended. More build
  36. system refinements would come later, but it's off to the races from here.
  37. Targeting qemu's aarch64 virt platform was useful for most of the initial
  38. debugging and bring-up (and is generally useful at all times, as a much easier
  39. platform to debug than real hardware); the first tests on real hardware came
  40. much later.
  41. Booting up is a sore point on most systems. It involves a lot of arch-specific
  42. procedures, but also generally calls for custom binary formats and annoying
  43. things like disk drivers &mdash; which don't belong in a microkernel. So the
  44. Helios bootloaders are separated from the kernel proper, which is a simple ELF
  45. executable. The bootloader loads this ELF file into memory, configures a few
  46. simple things, then passes some information along to the kernel entry point. The
  47. bootloader's memory and other resources are hereafter abandoned and are later
  48. reclaimed for general use.
  49. On aarch64 the boot story is pretty abysmal, and I wanted to avoid adding the
  50. SoC-specific complexity which is endemic to the platform. Thus, two solutions
  51. are called for: [EFI] and [device trees]. At the bootloader level, EFI is the
  52. more important concern. For qemu-virt and Raspberry Pi, [edk2] is the
  53. free-software implementation of choice when it comes to EFI. The first order of
  54. business is producing an executable which can be loaded by EFI, which is, rather
  55. unfortunately, based on the Windows [COFF/PE32+] format. I took inspiration from
  56. Linux and made an disgusting EFI stub solution, which involves hand-writing a
  57. PE32+ header in assembly and doing some truly horrifying things with binutils to
  58. massage everything into order. Much of the header is lifted from Linux:
  59. [EFI]: https://uefi.org/specifications
  60. [device trees]: https://www.devicetree.org/specifications/
  61. [edk2]: https://github.com/tianocore/edk2
  62. [COFF/PE32+]: https://learn.microsoft.com/en-us/windows/win32/debug/pe-format
  63. ```
  64. .section .text.head
  65. .global base
  66. base:
  67. .L_head:
  68. /* DOS header */
  69. .ascii "MZ"
  70. .skip 58
  71. .short .Lpe_header - .L_head
  72. .align 4
  73. .Lpe_header:
  74. .ascii "PE\0\0"
  75. .short 0xAA64 /* Machine = AARCH64 */
  76. .short 2 /* NumberOfSections */
  77. .long 0 /* TimeDateStamp */
  78. .long 0 /* PointerToSymbolTable */
  79. .long 0 /* NumberOfSymbols */
  80. .short .Lsection_table - .Loptional_header /* SizeOfOptionalHeader */
  81. /* Characteristics:
  82. * IMAGE_FILE_EXECUTABLE_IMAGE |
  83. * IMAGE_FILE_LINE_NUMS_STRIPPED |
  84. * IMAGE_FILE_DEBUG_STRIPPED */
  85. .short 0x206
  86. .Loptional_header:
  87. .short 0x20b /* Magic = PE32+ (64-bit) */
  88. .byte 0x02 /* MajorLinkerVersion */
  89. .byte 0x14 /* MinorLinkerVersion */
  90. .long _data - .Lefi_header_end /* SizeOfCode */
  91. .long __pecoff_data_size /* SizeOfInitializedData */
  92. .long 0 /* SizeOfUninitializedData */
  93. .long _start - .L_head /* AddressOfEntryPoint */
  94. .long .Lefi_header_end - .L_head /* BaseOfCode */
  95. .Lextra_header:
  96. .quad 0 /* ImageBase */
  97. .long 4096 /* SectionAlignment */
  98. .long 512 /* FileAlignment */
  99. .short 0 /* MajorOperatingSystemVersion */
  100. .short 0 /* MinorOperatingSystemVersion */
  101. .short 0 /* MajorImageVersion */
  102. .short 0 /* MinorImageVersion */
  103. .short 0 /* MajorSubsystemVersion */
  104. .short 0 /* MinorSubsystemVersion */
  105. .long 0 /* Reserved */
  106. .long _end - .L_head /* SizeOfImage */
  107. .long .Lefi_header_end - .L_head /* SizeOfHeaders */
  108. .long 0 /* CheckSum */
  109. .short 10 /* Subsystem = EFI application */
  110. .short 0 /* DLLCharacteristics */
  111. .quad 0 /* SizeOfStackReserve */
  112. .quad 0 /* SizeOfStackCommit */
  113. .quad 0 /* SizeOfHeapReserve */
  114. .quad 0 /* SizeOfHeapCommit */
  115. .long 0 /* LoaderFlags */
  116. .long 6 /* NumberOfRvaAndSizes */
  117. .quad 0 /* Export table */
  118. .quad 0 /* Import table */
  119. .quad 0 /* Resource table */
  120. .quad 0 /* Exception table */
  121. .quad 0 /* Certificate table */
  122. .quad 0 /* Base relocation table */
  123. .Lsection_table:
  124. .ascii ".text\0\0\0" /* Name */
  125. .long _etext - .Lefi_header_end /* VirtualSize */
  126. .long .Lefi_header_end - .L_head /* VirtualAddress */
  127. .long _etext - .Lefi_header_end /* SizeOfRawData */
  128. .long .Lefi_header_end - .L_head /* PointerToRawData */
  129. .long 0 /* PointerToRelocations */
  130. .long 0 /* PointerToLinenumbers */
  131. .short 0 /* NumberOfRelocations */
  132. .short 0 /* NumberOfLinenumbers */
  133. /* IMAGE_SCN_CNT_CODE | IMAGE_SCN_MEM_READ | IMAGE_SCN_MEM_EXECUTE */
  134. .long 0x60000020
  135. .ascii ".data\0\0\0" /* Name */
  136. .long __pecoff_data_size /* VirtualSize */
  137. .long _data - .L_head /* VirtualAddress */
  138. .long __pecoff_data_rawsize /* SizeOfRawData */
  139. .long _data - .L_head /* PointerToRawData */
  140. .long 0 /* PointerToRelocations */
  141. .long 0 /* PointerToLinenumbers */
  142. .short 0 /* NumberOfRelocations */
  143. .short 0 /* NumberOfLinenumbers */
  144. /* IMAGE_SCN_CNT_INITIALIZED_DATA | IMAGE_SCN_MEM_READ | IMAGE_SCN_MEM_WRITE */
  145. .long 0xc0000040
  146. .balign 0x10000
  147. .Lefi_header_end:
  148. .global _start
  149. _start:
  150. stp x0, x1, [sp, -16]!
  151. adrp x0, base
  152. add x0, x0, #:lo12:base
  153. adrp x1, _DYNAMIC
  154. add x1, x1, #:lo12:_DYNAMIC
  155. bl relocate
  156. cmp w0, #0
  157. bne 0f
  158. ldp x0, x1, [sp], 16
  159. b bmain
  160. 0:
  161. /* relocation failed */
  162. add sp, sp, -16
  163. ret
  164. ```
  165. The specific details about how any of this works are complex and unpleasant,
  166. I'll refer you to the spec if you're curious, and offer a general suggestion
  167. that cargo-culting my work here would be a lot easier than understanding it
  168. should you need to build something similar.[^1]
  169. [^1]: A cursory review of this code while writing this blog post draws my
  170. attention to a few things that ought to be improved as well.
  171. Note the entry point for later; we store two arguments from EFI (x0 and x1) on
  172. the stack and eventually branch to bmain.
  173. This file is assisted by the linker script:
  174. ```
  175. ENTRY(_start)
  176. OUTPUT_FORMAT(elf64-littleaarch64)
  177. SECTIONS {
  178. /DISCARD/ : {
  179. *(.rel.reloc)
  180. *(.eh_frame)
  181. *(.note.GNU-stack)
  182. *(.interp)
  183. *(.dynsym .dynstr .hash .gnu.hash)
  184. }
  185. . = 0xffff800000000000;
  186. .text.head : {
  187. _head = .;
  188. KEEP(*(.text.head))
  189. }
  190. .text : ALIGN(64K) {
  191. _text = .;
  192. KEEP(*(.text))
  193. *(.text.*)
  194. . = ALIGN(16);
  195. *(.got)
  196. }
  197. . = ALIGN(64K);
  198. _etext = .;
  199. .dynamic : {
  200. *(.dynamic)
  201. }
  202. .data : ALIGN(64K) {
  203. _data = .;
  204. KEEP(*(.data))
  205. *(.data.*)
  206. /* Reserve page tables */
  207. . = ALIGN(4K);
  208. L0 = .;
  209. . += 512 * 8;
  210. L1_ident = .;
  211. . += 512 * 8;
  212. L1_devident = .;
  213. . += 512 * 8;
  214. L1_kernel = .;
  215. . += 512 * 8;
  216. L2_kernel = .;
  217. . += 512 * 8;
  218. L3_kernel = .;
  219. . += 512 * 8;
  220. }
  221. .rela.text : {
  222. *(.rela.text)
  223. *(.rela.text*)
  224. }
  225. .rela.dyn : {
  226. *(.rela.dyn)
  227. }
  228. .rela.plt : {
  229. *(.rela.plt)
  230. }
  231. .rela.got : {
  232. *(.rela.got)
  233. }
  234. .rela.data : {
  235. *(.rela.data)
  236. *(.rela.data*)
  237. }
  238. .pecoff_edata_padding : {
  239. BYTE(0);
  240. . = ALIGN(512);
  241. }
  242. __pecoff_data_rawsize = ABSOLUTE(. - _data);
  243. _edata = .;
  244. .bss : ALIGN(4K) {
  245. KEEP(*(.bss))
  246. *(.bss.*)
  247. *(.dynbss)
  248. }
  249. . = ALIGN(64K);
  250. __pecoff_data_size = ABSOLUTE(. - _data);
  251. _end = .;
  252. }
  253. ```
  254. Items of note here are the careful treatment of relocation sections
  255. (cargo-culted from earlier work on RISC-V with Hare; not actually necessary as
  256. qbe generates PIC for aarch64)[^2] and the extra symbols used to gather
  257. information for the PE32+ header. Padding is also added in the required places,
  258. and static aarch64 page tables are defined for later use.
  259. [^2]: PIC stands for "position independent code". EFI can load executables at
  260. any location in memory and the code needs to be prepared to deal with that;
  261. PIC is the tool we use for this purpose.
  262. This is built as a shared object, and the Makefile ~~mutilates~~ reformats the
  263. resulting ELF file to produce a PE32+ executable:
  264. ```
  265. $(BOOT)/bootaa64.so: $(BOOT_OBJS) $(BOOT)/link.ld
  266. $(LD) -Bsymbolic -shared --no-undefined \
  267. -T $(BOOT)/link.ld \
  268. $(BOOT_OBJS) \
  269. -o $@
  270. $(BOOT)/bootaa64.efi: $(BOOT)/bootaa64.so
  271. $(OBJCOPY) -Obinary \
  272. -j .text.head -j .text -j .dynamic -j .data \
  273. -j .pecoff_edata_padding \
  274. -j .dynstr -j .dynsym \
  275. -j .rel -j .rel.* -j .rel* \
  276. -j .rela -j .rela.* -j .rela* \
  277. $< $@
  278. ```
  279. With all of this mess sorted, and the PE32+ entry point branching to bmain, we
  280. can finally enter some Hare code:
  281. ```
  282. export fn bmain(
  283. image_handle: efi::HANDLE,
  284. systab: *efi::SYSTEM_TABLE,
  285. ) efi::STATUS = {
  286. // ...
  287. };
  288. ```
  289. Getting just this far took 3 full days of work.
  290. Initially, the Hare code incorporated a lot of proof-of-concept work from Alexey
  291. Yerin's "carrot" kernel prototype for RISC-V, which also booted via EFI.
  292. Following the early bringing-up of the bootloader environment, this was
  293. refactored into a more robust and general-purpose EFI support layer for Helios,
  294. which will be applicable to future ports. You can review the EFI support
  295. module's haredocs [here](https://mirror.drewdevault.com/efi.html). The purpose
  296. of this module is to provide an idiomatic Hare-oriented interface to the EFI
  297. boot services, which the bootloader makes use of mainly to read files from the
  298. boot media and examine the system's memory map.
  299. Let's take a look at the first few lines of bmain:
  300. ```
  301. efi::init(image_handle, systab)!;
  302. const eficons = eficons_init(systab);
  303. log::setcons(&eficons);
  304. log::printfln("Booting Helios aarch64 via EFI");
  305. if (readel() == el::EL3) {
  306. log::printfln("Booting from EL3 is not supported");
  307. return efi::STATUS::LOAD_ERROR;
  308. };
  309. let mem = allocator { ... };
  310. init_mmap(&mem);
  311. init_pagetables();
  312. ```
  313. Significant build system overhauls were required such that Hare modules from
  314. the kernel like log (and, later, other modules like elf) could be incorporated
  315. into the bootloader, simplifying the process of implementing more complex
  316. bootloaders. The first call of note here is init\_mmap, which scans the EFI
  317. memory map and prepares a simple high-watermark allocator to be used by the
  318. bootloader to allocate memory for the kernel image and other items of interest.
  319. It's quite simple, it just finds the largest area of general-purpose memory and
  320. sets up an allocator with it:
  321. ```
  322. // Loads the memory map from EFI and initializes a page allocator using the
  323. // largest area of physical memory.
  324. fn init_mmap(mem: *allocator) void = {
  325. const iter = efi::iter_mmap()!;
  326. let maxphys: uintptr = 0, maxpages = 0u64;
  327. for (true) {
  328. const desc = match (efi::mmap_next(&iter)) {
  329. case let desc: *efi::MEMORY_DESCRIPTOR =>
  330. yield desc;
  331. case void =>
  332. break;
  333. };
  334. if (desc.DescriptorType != efi::MEMORY_TYPE::CONVENTIONAL) {
  335. continue;
  336. };
  337. if (desc.NumberOfPages > maxpages) {
  338. maxphys = desc.PhysicalStart;
  339. maxpages = desc.NumberOfPages;
  340. };
  341. };
  342. assert(maxphys != 0, "No suitable memory area found for kernel loader");
  343. assert(maxpages <= types::UINT_MAX);
  344. pagealloc_init(mem, maxphys, maxpages: uint);
  345. };
  346. ```
  347. init\_pagetables is next. This populates the page tables reserved by the linker
  348. with the desired higher-half memory map, illustrated in the comments shown here:
  349. ```
  350. fn init_pagetables() void = {
  351. // 0xFFFF0000xxxxxxxx - 0xFFFF0200xxxxxxxx: identity map
  352. // 0xFFFF0200xxxxxxxx - 0xFFFF0400xxxxxxxx: identity map (dev)
  353. // 0xFFFF8000xxxxxxxx - 0xFFFF8000xxxxxxxx: kernel image
  354. //
  355. // L0[0x000] => L1_ident
  356. // L0[0x004] => L1_devident
  357. // L1_ident[*] => 1 GiB identity mappings
  358. // L0[0x100] => L1_kernel
  359. // L1_kernel[0] => L2_kernel
  360. // L2_kernel[0] => L3_kernel
  361. // L3_kernel[0] => 4 KiB kernel pages
  362. L0[0x000] = PT_TABLE | &L1_ident: uintptr | PT_AF;
  363. L0[0x004] = PT_TABLE | &L1_devident: uintptr | PT_AF;
  364. L0[0x100] = PT_TABLE | &L1_kernel: uintptr | PT_AF;
  365. L1_kernel[0] = PT_TABLE | &L2_kernel: uintptr | PT_AF;
  366. L2_kernel[0] = PT_TABLE | &L3_kernel: uintptr | PT_AF;
  367. for (let i = 0u64; i < len(L1_ident): u64; i += 1) {
  368. L1_ident[i] = PT_BLOCK | (i * 0x40000000): uintptr |
  369. PT_NORMAL | PT_AF | PT_ISM | PT_RW;
  370. };
  371. for (let i = 0u64; i < len(L1_devident): u64; i += 1) {
  372. L1_devident[i] = PT_BLOCK | (i * 0x40000000): uintptr |
  373. PT_DEVICE | PT_AF | PT_ISM | PT_RW;
  374. };
  375. };
  376. ```
  377. In short, we want three larger memory regions to be available: an identity map,
  378. where physical memory addresses correlate 1:1 with virtual memory, an identity
  379. map configured for device MMIO (e.g. with caching disabled), and an area to load
  380. the kernel image. The first two are straightforward, they use uniform 1 GiB
  381. mappings to populate their respective page tables. The latter is slightly more
  382. complex, ultimately the kernel is loaded in 4 KiB pages so we need to set up
  383. intermediate page tables for that purpose.
  384. We cannot actually enable these page tables until we're finished making use of
  385. the EFI boot services &mdash; the EFI specification requires us to preserve the
  386. online memory map at this stage of affairs. However, this does lay the
  387. groundwork for the kernel loader: we have an allocator to provide pages of
  388. memory, and page tables to set up virtual memory mappings that can be activated
  389. once we're done with EFI. bmain thus proceeds with loading the kernel:
  390. ```
  391. const kernel = match (efi::open("\\helios", efi::FILE_MODE::READ)) {
  392. case let file: *efi::FILE_PROTOCOL =>
  393. yield file;
  394. case let err: efi::error =>
  395. log::printfln("Error: no kernel found at /helios");
  396. return err: efi::STATUS;
  397. };
  398. log::printfln("Load kernel /helios");
  399. const kentry = match (load(&mem, kernel)) {
  400. case let err: efi::error =>
  401. return err: efi::STATUS;
  402. case let entry: uintptr =>
  403. yield entry: *kentry;
  404. };
  405. efi::close(kernel)!;
  406. ```
  407. The loader itself (the "load" function here) is a relatively straightforward ELF
  408. loader; if you've seen one you've seen them all. Nevertheless, you may browse it
  409. [online][0] if you so wish. The only item of note here is the function used for
  410. mapping kernel pages:
  411. [0]: https://git.sr.ht/~sircmpwn/helios/tree/02d0490487c7a0fb4b0367b95819e808b98f87fb/item/boot/%2Baarch64/loader.ha
  412. ```
  413. // Maps a physical page into the kernel's virtual address space.
  414. fn kmmap(virt: uintptr, phys: uintptr, flags: uintptr) void = {
  415. assert(virt & ~0x1ff000 == 0xffff800000000000: uintptr);
  416. const offs = (virt >> 12) & 0x1ff;
  417. L3_kernel[offs] = PT_PAGE | PT_NORMAL | PT_AF | PT_ISM | phys | flags;
  418. };
  419. ```
  420. The assertion enforces a constraint which is implemented by our kernel linker
  421. script, namely that all loadable kernel program headers are located within the
  422. kernel's reserved address space. With this constraint in place, the
  423. implementation is simpler than many mmap implementations; we can assume that
  424. L3\_kernel is the correct page table and just load it up with the desired
  425. physical address and mapping flags.
  426. Following the kernel loader, the bootloader addresses other items of interest,
  427. such as loading the device tree and boot modules &mdash; which includes, for
  428. instance, the init process image and an initramfs. It also allocates & populates
  429. data structures with information which will be of later use to the kernel,
  430. including the memory map. This code is relatively straightforward and not
  431. particularly interesting; most of these processes takes advantage of the same
  432. straightforward Hare function:
  433. ```
  434. // Loads a file into continuous pages of memory and returns its physical
  435. // address.
  436. fn load_file(
  437. mem: *allocator,
  438. file: *efi::FILE_PROTOCOL,
  439. ) (uintptr | efi::error) = {
  440. const info = efi::file_info(file)?;
  441. const fsize = info.FileSize: size;
  442. let npage = fsize / PAGESIZE;
  443. if (fsize % PAGESIZE != 0) {
  444. npage += 1;
  445. };
  446. let base: uintptr = 0;
  447. for (let i = 0z; i < npage; i += 1) {
  448. const phys = pagealloc(mem);
  449. if (base == 0) {
  450. base = phys;
  451. };
  452. const nbyte = if ((i + 1) * PAGESIZE > fsize) {
  453. yield fsize % PAGESIZE;
  454. } else {
  455. yield PAGESIZE;
  456. };
  457. let dest = (phys: *[*]u8)[..nbyte];
  458. const n = efi::read(file, dest)?;
  459. assert(n == nbyte);
  460. };
  461. return base;
  462. };
  463. ```
  464. It is not necessary to map these into virtual memory anywhere, the kernel later
  465. uses the identity-mapped physical memory region in the higher half to read
  466. them. Tasks of interest resume at the end of bmain:
  467. ```
  468. efi::exit_boot_services();
  469. init_mmu();
  470. enter_kernel(kentry, ctx);
  471. ```
  472. Once we exit boot services, we are free to configure the MMU according to our
  473. desired specifications and make good use of all of the work done earlier to
  474. prepare a kernel memory map. Thus, init\_mmu:
  475. ```
  476. // Initializes the ARM MMU to our desired specifications. This should take place
  477. // *after* EFI boot services have exited because we're going to mess up the MMU
  478. // configuration that it depends on.
  479. fn init_mmu() void = {
  480. // Disable MMU
  481. const sctlr_el1 = rdsctlr_el1();
  482. wrsctlr_el1(sctlr_el1 & ~SCTLR_EL1_M);
  483. // Configure MAIR
  484. const mair: u64 =
  485. (0xFF << 0) | // Attr0: Normal memory; IWBWA, OWBWA, NTR
  486. (0x00 << 8); // Attr1: Device memory; nGnRnE, OSH
  487. wrmair_el1(mair);
  488. const tsz: u64 = 64 - 48;
  489. const ips = rdtcr_el1() & TCR_EL1_IPS_MASK;
  490. const tcr_el1: u64 =
  491. TCR_EL1_IPS_42B_4T | // 4 TiB IPS
  492. TCR_EL1_TG1_4K | // Higher half: 4K granule size
  493. TCR_EL1_SH1_IS | // Higher half: inner shareable
  494. TCR_EL1_ORGN1_WB | // Higher half: outer write-back
  495. TCR_EL1_IRGN1_WB | // Higher half: inner write-back
  496. (tsz << TCR_EL1_T1SZ) | // Higher half: 48 bits
  497. TCR_EL1_TG0_4K | // Lower half: 4K granule size
  498. TCR_EL1_SH0_IS | // Lower half: inner sharable
  499. TCR_EL1_ORGN0_WB | // Lower half: outer write-back
  500. TCR_EL1_IRGN0_WB | // Lower half: inner write-back
  501. (tsz << TCR_EL1_T0SZ); // Lower half: 48 bits
  502. wrtcr_el1(tcr_el1);
  503. // Load page tables
  504. wrttbr0_el1(&L0[0]: uintptr);
  505. wrttbr1_el1(&L0[0]: uintptr);
  506. invlall();
  507. // Enable MMU
  508. const sctlr_el1: u64 =
  509. SCTLR_EL1_M | // Enable MMU
  510. SCTLR_EL1_C | // Enable cache
  511. SCTLR_EL1_I | // Enable instruction cache
  512. SCTLR_EL1_SPAN | // SPAN?
  513. SCTLR_EL1_NTLSMD | // NTLSMD?
  514. SCTLR_EL1_LSMAOE | // LSMAOE?
  515. SCTLR_EL1_TSCXT | // TSCXT?
  516. SCTLR_EL1_ITD; // ITD?
  517. wrsctlr_el1(sctlr_el1);
  518. };
  519. ```
  520. There are a lot of bits here! Figuring out which ones to enable or disable was a
  521. project in and of itself. One of the major challenges, funnily enough, was
  522. finding the correct ARM manual to reference to understand all of these
  523. registers. I'll save you some time and [link to it][1] directly, should you ever
  524. find yourself writing similar code. Some question marks in comments towards the
  525. end point out some flags that I'm still not sure about. The ARM CPU is *very*
  526. configurable and identifying the configuration that produces the desired
  527. behavior for a general-purpose kernel requires some effort.
  528. [1]: https://mirror.drewdevault.com/ARMARM.pdf
  529. After this function completes, the MMU is initialized and we are up and running
  530. with the kernel memory map we prepared earlier; the kernel is loaded in the
  531. higher half and the MMU is prepared to service it. So, we can jump to the kernel
  532. via enter\_kernel:
  533. ```
  534. @noreturn fn enter_kernel(entry: *kentry, ctx: *bootctx) void = {
  535. const el = readel();
  536. switch (el) {
  537. case el::EL0 =>
  538. abort("Bootloader running in EL0, breaks EFI invariant");
  539. case el::EL1 =>
  540. // Can boot immediately
  541. entry(ctx);
  542. case el::EL2 =>
  543. // Boot from EL2 => EL1
  544. //
  545. // This is the bare minimum necessary to get to EL1. Future
  546. // improvements might be called for here if anyone wants to
  547. // implement hardware virtualization on aarch64. Good luck to
  548. // this future hacker.
  549. // Enable EL1 access to the physical counter register
  550. const cnt = rdcnthctl_el2();
  551. wrcnthctl_el2(cnt | 0b11);
  552. // Enable aarch64 in EL1 & SWIO, disable most other EL2 things
  553. // Note: I bet someday I'll return to this line because of
  554. // Problems
  555. const hcr: u64 = (1 << 1) | (1 << 31);
  556. wrhcr_el2(hcr);
  557. // Set up SPSR for EL1
  558. // XXX: Magic constant I have not bothered to understand
  559. wrspsr_el2(0x3c4);
  560. enter_el1(ctx, entry);
  561. case el::EL3 =>
  562. // Not supported, tested earlier on
  563. abort("Unsupported boot configuration");
  564. };
  565. };
  566. ```
  567. Here we see the detritus from one of many battles I fought to port this kernel:
  568. the EL2 => EL1 transition. aarch64 has several "exception levels", which are
  569. semantically similar to the x86\_64 concept of protection rings. EL0 is used for
  570. userspace code, which is not applicable under these circumstances; an assertion
  571. sanity-checks this invariant. EL1 is the simplest case, this is used for normal
  572. kernel code and in this situation we can jump directly to the kernel. The EL2
  573. case is used for hypervisor code, and this presented me with a challenge. When I
  574. tested my bootloader in qemu-virt, it worked initially, but on real hardware it
  575. failed. After much wailing and gnashing of teeth, the cause was found to be that
  576. our bootloader was started in EL2 on real hardware, and EL1 on qemu-virt. qemu
  577. can be configured to boot in EL2, which was crucial in debugging this problem,
  578. via -M virt,virtualization=on. From this environment I was able to identify a
  579. few important steps to drop to EL1 and into the kernel, though from the comments
  580. you can probably ascertain that this process was not well-understood. I do have
  581. a better understanding of it now than I did when this code was written, but the
  582. code is still serviceable and I see no reason to change it at this stage.
  583. At this point, 14 days into the port, I successfully reached kmain on qemu-virt.
  584. Some initial kernel porting work was done after this, but when I was prepared to
  585. test it on real hardware I ran into this EL2 problem &mdash; the first kmain on
  586. real hardware ran at T+18.
  587. That sums it up for the aarch64 EFI bootloader work. 24 days later the kernel
  588. and userspace ports would be complete, and a couple of weeks after that it was
  589. running on stage at FOSDEM. The next post will cover the kernel port (maybe more
  590. than one post will be required, we'll see), and the final post will address the
  591. userspace port and the inner workings of the slidedeck demo that was shown on
  592. stage. Look forward to it, and thanks for reading!