logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git
commit: f2a63f8b8455572d2fe4fc67b63daa7424afd9e5
parent ffea51faf570a6f6a4022db6c003d3817804af0f
Author: Drew DeVault <sir@cmpwn.com>
Date:   Wed,  7 Sep 2022 14:11:48 +0200

Kernel hacking with Hare, part 1

Diffstat:

Acontent/blog/Kernel-hacking-with-Hare-part-1.md292+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 292 insertions(+), 0 deletions(-)

diff --git a/content/blog/Kernel-hacking-with-Hare-part-1.md b/content/blog/Kernel-hacking-with-Hare-part-1.md @@ -0,0 +1,292 @@ +--- +title: Notes from kernel hacking in Hare, part 1 +date: 2022-09-07 +--- + +One of the goals for the [Hare][0] programming language is to be able to write +kernels, such as my [Helios][1] project. Kernels are complex beasts which exist +in a somewhat unique problem space and have constraints that many userspace +programs are not accustomed to. To illustrate this, I'm going to highlight a +scenario where Hare's low-level types and manual memory management approach +shines to enable a difficult use-case. + +[0]: https://harelang.org/ +[1]: https://git.sr.ht/~sircmpwn/helios + +Helios is a micro-kernel. During system initialization, its job is to load the +initial task into memory, prepare the initial set of kernel objects for its use, +provide it with information about the system, then jump to userspace and fuck +off until someone needs it again. I'm going to focus on the "providing +information" step here. + +The information the kernel needs to provide includes details about the +capabilities that init has access to (such as working with I/O ports), +information about system memory, the address of the framebuffer, and so on. This +information is provided to init in the bootinfo structure, which is mapped into +its address space, and passed to init via a register which points to this +structure.[^1] + +[^1]: %rdi, if you were curious. Helios uses the System-V ABI, where %rdi is + used as the first parameter to a function call. This isn't exactly a function + call but the precedent is useful. + +```hare +// The bootinfo structure. +export type bootinfo = struct { + argv: str, + + // Capability ranges + memory: cap_range, + devmem: cap_range, + userimage: cap_range, + stack: cap_range, + bootinfo: cap_range, + unused: cap_range, + + // Other details + arch: *arch_bootinfo, + ipcbuf: uintptr, + modules: []module_desc, + memory_info: []memory_desc, + devmem_info: []memory_desc, + tls_base: uintptr, + tls_size: size, +}; +``` + +Parts of this structure are static (such as the capability number ranges for +each capability assigned to init), and others are dynamic - such as structures +describing the memory layout (N items where N is the number of memory regions), +or the kernel command line. But, we're in a kernel -- dynamically allocating +data is not so straightforward, especially for units smaller than a page\![^2] +Moreover, the data structures allocated here need to be visible to userspace, +and kernel memory is typically not available to userspace. A further +complication is the three different address spaces we're working with here: a +bootinfo object has a physical memory address, a kernel address, and a userspace +address &mdash; three addresses to refer to a single object in different +contexts. + +[^2]: 4096 bytes. + +Here's an example of what the code shown in this article is going to produce: + +![A 64 by 64 grid of cells representing a page of physical memory. The first set +of cells are colored blue; the next set green; then purple; the remainder are +brown.](https://l.sr.ht/FpGq.png) + +This is a single page of physical memory which has been allocated for the +bootinfo data, where each cell is a byte. The bootinfo structure itself comes +first, in blue. Following this is an arch-specific bootinfo structure, in green: + +```hare +// x86_64-specific boot information +export type arch_bootinfo = struct { + // Page table capabilities + pdpt: cap_range, + pd: cap_range, + pt: cap_range, + + // vbe_mode_info physical address from multiboot (or zero) + vbe_mode_info: uintptr, +}; +``` + +After this, in purple, is the kernel command line. These three structures are +always consistently allocated for any boot configuration, so the code which +sets up the bootinfo page (the code we're going to read now) always provisions +them. Following these three items is a large area of free space (indicated in +brown) which will be used to populate further dynamically allocated bootinfo +structures, such as descriptions of physical memory regions. + +The code to set this up is `bootinfo_init`, which is responsible for allocating +a suitable page, filling in the bootinfo structure, and preparing a vector to +dynamically allocate additional data on this page. It also sets up the arch +bootinfo and argv, so the page looks like this diagram when the function +returns. And here it is, in its full glory: + +```hare +// Initializes the bootinfo context. +export fn bootinfo_init(heap: *heap, argv: str) bootinfo_ctx = { + let cslot = caps::cslot { ... }; + let page = objects::init(ctype::PAGE, &cslot, &heap.memory)!; + let phys = objects::page_phys(page); + let info = mem::phys_tokernel(phys): *bootinfo; + + const bisz = size(bootinfo); + let bootvec = (info: *[*]u8)[bisz..arch::PAGESIZE - bisz][..0]; + + let ctx = bootinfo_ctx { + page = cslot, + info = info, + arch = null: *arch_bootinfo, // Fixed up below + bootvec = bootvec, + }; + + let (vec, user) = mkbootvec(&ctx, size(arch_bootinfo), size(uintptr)); + ctx.arch = vec: *[*]u8: *arch_bootinfo; + info.arch = user: *arch_bootinfo; + + let (vec, user) = mkbootvec(&ctx, len(argv), 1); + vec[..] = strings::toutf8(argv)[..]; + info.argv = *(&types::string { + data = user: *[*]u8, + length = len(argv), + capacity = len(argv), + }: *str); + + return ctx; +}; +``` + +The first three lines are fairly straightforward. Helios uses capability-based +security, similar in design to [seL4][seL4]. All kernel objects &mdash; such as +pages of physical memory &mdash; are utilized through the capability system. The +first two lines set aside a slot to store the page capability in, then allocate +a page using that slot. The next two lines grab the page's physical address and +use `mem::phys_tokernel` to convert it to an address in the kernel's virtual +address space, so that the kernel can write data to this page. + +[seL4]: https://sel4.systems/ + +The next two lines are where it starts to get a little bit interesting: + +```hare +const bisz = size(bootinfo); +let bootvec = (info: *[*]u8)[bisz..arch::PAGESIZE - bisz][..0]; +``` + +This casts the "info" variable (of type \*bootinfo) to a pointer to an +*unbounded* array of bytes (\*\[\*\]u8). This is a little bit dangerous! Hare's +arrays are bounds tested by default and using an unbounded type disables this +safety feature. We want to get a bounded slice again soon, which is what the +first slicing operator here does: `[bisz..arch::PAGESIZE - bisz]`. This obtains +a *bounded* slice of bytes which starts from the end of the bootinfo structure +and continues to the end of the page. + +The last expression, another slicing expression, is a little bit unusual. A +slice type in Hare has the following internal representation: + +```hare +type slice = struct { + data: nullable *void, + length: size, + capacity: size, +}; +``` + +When you slice an unbounded array, you get a slice whose length and capacity +fields are equal to the length of the slicing operation, in this case +`arch::PAGESIZE - bisz`. But when you slice a *bounded* slice, the length field +takes on the length of the slicing expression but the capacity field is +calculated from the original slice. So by slicing our new bounded slice to the +0th index (\[..0\]), we obtain the following slice: + +```hare +slice { + data = &(info: *[*]bootinfo)[1]: *[*]u8, + length = 0, + capacity = arch::PAGESIZE - bisz, +}; +``` + +In plain English, this is a slice whose base address is the address following +the bootinfo structure and whose capacity is the remainder of the free space on +its page, with a length of zero. This is something we can use <span +class="rainbow">static append</span> with\![^3] + +[^3]: Thanks to [Rahul of W3Bits](https://w3bits.com/rainbow-text/) for this CSS. + +<style> +.rainbow { + font-weight: bold; + background-image: linear-gradient(to left, violet, indigo, blue, green, yellow, orange, red); + background-clip: text; + background-size: 800% 800%; + animation: rainbow 8s ease infinite; + -webkit-text-fill-color: transparent; +} + +@keyframes rainbow { + 0%{background-position:0% 50%} + 50%{background-position:100% 25%} + 100%{background-position:0% 50%} +} +</style> + +```hare +// Allocates a buffer in the bootinfo vector, returning the kernel vector and a +// pointer to the structure in the init vspace. +fn mkbootvec(info: *bootinfo_ctx, sz: size, al: size) ([]u8, uintptr) = { + const prevlen = len(info.bootvec); + let padding = 0z; + if (prevlen % al != 0) { + padding = al - prevlen % al; + }; + static append(info.bootvec, [0...], sz + padding); + const vec = info.bootvec[prevlen + padding..]; + return (vec, INIT_BOOTINFO_ADDR + size(bootinfo): uintptr prevlen: uintptr); +}; +``` + +In Hare, slices can be dynamically grown and shrunk using the *append*, +*insert*, and *delete* keywords. This is pretty useful, but not applicable for +our kernel &mdash; remember, no dynamic memory allocation. Attempting to use +append in Helios would cause a linking error because the necessary runtime code +is absent from the kernel's Hare runtime. However, you can also *statically* +append to a slice, as shown here. So long as the slice has a sufficient capacity +to store the appended data, a static append or insert will succeed. If not, an +assertion is thrown at runtime, much like a normal bounds test. + +This function makes good use of it to dynamically allocate memory from the +bootinfo page. Given a desired size and alignment, it statically appends a +suitable number of zeroes to the page, takes a slice of the new data, and +returns both that slice (in the kernel's address space) and the address that +data will have in the user address space. If we return to the earlier function, +we can see how this is used to allocate space for the arch\_bootinfo structure: + +```hare +let (vec, user) = mkbootvec(&ctx, size(arch_bootinfo), size(uintptr)); +ctx.arch = vec: *[*]u8: *arch_bootinfo; +info.arch = user: *arch_bootinfo; +``` + +The "ctx" variable is used by the kernel to keep track of its state while +setting up the init task, and we stash the kernel's pointer to this data +structure in there, and the user's pointer in the bootinfo structure itself. + +This is also used to place argv into the bootinfo page: + +```hare +let (vec, user) = mkbootvec(&ctx, len(argv), 1); +vec[..] = strings::toutf8(argv)[..]; +info.argv = *(&types::string { + data = user: *[*]u8, + length = len(argv), + capacity = len(argv), +}: *str); +``` + +Here we allocate a vector whose length is the length of the argument string, +with an alignment of one, and then copy argv into it as a UTF-8 slice. Slice +copy expressions like this one are a type-safe and memory-safe way to memcpy in +Hare. Then we do something a bit more interesting. + +Like slices, strings have an internal representation in Hare which includes a +data pointer, length, and capacity. The types module provides a struct with this +representation so that you can do low-level string manipulation in Hare should +the task call for it. Hare's syntax allows us to take the address of a literal +value, such as a types::string struct, using the & operator. Then we cast it to +a pointer to a string and dereference it. Ta-da! We set the bootinfo argv field +to a str value which uses the user address of the argument vector. + +Some use-cases call for this level of fine control over the precise behavior of +your program. Hare's goal is to accommodate this need with little fanfare. Here +we've drawn well outside of the lines of Hare's safety features, but sometimes +it's useful and necessary to do so. And Hare provides us with the tools to get +the safety harness back on quickly, such as we saw with the construction of the +bootvec slice. This code is pretty weird but to an experienced Hare programmer +(which, I must admit, the world has very few of) it should make sense. + +I hope you found this interesting! I'm going back to kernel hacking. Next up is +loading the userspace ELF image into its address space. I had this working +before but decided to rewrite it. Wish me good luck!