commit: 6d075a727b0748af0d8952ced9ebdf10713972b6
parent 8d8096d33335eeb63c62ccecb58e2e09da3b6afd
Author: Drew DeVault <sir@cmpwn.com>
Date: Sun, 2 Oct 2022 23:33:32 +0200
Kernel hacking part 2
Diffstat:
1 file changed, 492 insertions(+), 0 deletions(-)
diff --git a/content/blog/Kernel-hacking-with-Hare-part-2.md b/content/blog/Kernel-hacking-with-Hare-part-2.md
@@ -0,0 +1,492 @@
+---
+title: "Notes from kernel hacking in Hare, part 2: multi-threading"
+date: 2022-10-02
+---
+
+I have long promised that Hare would not have multi-threading, and it seems that
+I have broken that promise. However, I have remained true to the
+not-invented-here approach which is typical of my style by introducing it only
+after designing an entire kernel to implement it on top of.[^1]
+
+[^1]: Jokes aside, for those curious about multi-threading and Hare: our
+ official stance is not actually as strict as "no threads, period", though in
+ practice for many people it might amount to that. There is nothing stopping
+ you from linking to pthreads or calling clone(2) to spin up threads in a Hare
+ program, but the standard library explicitly provides no multi-threading
+ support, synchronization primitives, or re-entrancy guarantees. That's not to
+ say, however, that one could not build their own Hare standard library which
+ does offer these features — and, in fact, that is exactly what the
+ Vulcan test framework for Helios provides in its Hare libraries.
+
+For some background, [Helios][0] is a micro-kernel written in Hare. In addition
+to the project, the [Vulcan][1] system is a small userspace designed to test the
+kernel.
+
+[0]: https://git.sr.ht/~sircmpwn/helios
+[1]: https://git.sr.ht/~sircmpwn/helios/tree/master/item/vulcan
+
+![A picture of a laptop running Helios and showing the results of the Vulcan test suite](https://l.sr.ht/JIya.jpg)
+
+While I don't anticipate multi-threaded processes playing a huge role in the
+complete Ares system in the future, they do have a place. In the long term, I
+would like to be able to provide an implementation of pthreads for porting
+existing software to the system. A more immediate concern is how to test the
+various kernel primitives provided by Helios, such as those which facilitate
+inter-process communication (IPC). It's much easier to test these with threads
+than with processes, since spawning threads does not require spinning up a new
+address space.
+
+```hare
+@test fn notification::wait() void = {
+ const note = helios::newnote()!;
+ defer helios::destroy(note)!;
+
+ const thread = threads::new(¬ification_wait, note)!;
+ threads::start(thread)!;
+ defer threads::join(thread)!;
+
+ helios::signal(note)!;
+};
+
+fn notification_wait(note: u64) void = {
+ const note = note: helios::cap;
+ helios::wait(note)!;
+};
+```
+
+So how does it work? Let's split this up into two domains: kernelspace and
+userspace.
+
+### Threads in the kernel
+
+The basic primitive for threads and processes in Helios is a "task", which is
+simply an object which receives some CPU time. A task has a capability space (so
+it can invoke operations against kernel objects), an virtual address space (so
+it has somewhere to map the process image and memory), and some state, such as
+the values of its CPU registers. The task-related structures are:
+
+```hare
+// A task capability.
+export type task = struct {
+ caps::capability,
+ state: uintptr,
+ @offset(caps::LINK_OFFS) link: caps::link,
+};
+
+// Scheduling status of a task.
+export type task_status = enum {
+ ACTIVE,
+ BLOCKED, // XXX: Can a task be both blocked and suspended?
+ SUSPENDED,
+};
+
+// State for a task.
+export type taskstate = struct {
+ regs: arch::state,
+ cspace: caps::cslot,
+ vspace: caps::cslot,
+ ipc_buffer: uintptr,
+ status: task_status,
+ // XXX: This is a virtual address, should be physical
+ next: nullable *taskstate,
+ prev: nullable *taskstate,
+};
+```
+
+Here's a footnote to explain some off-topic curiosities about this code: [^2]
+
+[^2]: Capabilities are essentially references to kernel objects. The kernel
+ object for a task is the taskstate struct, and there can be many task
+ capabilities which refer to this. Any task which possesses a task capability in
+ its capability space can invoke operations against this task, such as reading
+ or writing its registers.<br /><br />
+ The link field is used to create a linked list of capabilities across the
+ system. It has a doubly linked list for the next and previous capability, and
+ a link to its parent capability, such as the memory capability from which the
+ task state was allocated. The list is organized such that copies of the same
+ capability are always adjacent to one another, and children always follow
+ their parents.
+ <br /><br />
+ The answer to the XXX comment in task\_status is yes, by the way. Something to
+ fix later.
+
+The most interesting part of this structure is arch::state, which stores the
+task's CPU registers. On x86\_64,[^3] this structure is defined as follows:
+
+[^3]: Only x86\_64 is supported for now, but a RISC-V port is in-progress and I
+ intend to do arm64 in the future.
+
+```hare
+export type state = struct {
+ fs: u64,
+ fsbase: u64,
+
+ r15: u64,
+ r14: u64,
+ r13: u64,
+ r12: u64,
+ r11: u64,
+ r10: u64,
+ r9: u64,
+ r8: u64,
+ rbp: u64,
+ rdi: u64,
+ rsi: u64,
+ rdx: u64,
+ rcx: u64,
+ rbx: u64,
+ rax: u64,
+
+ intno: u64,
+ errcode: u64,
+
+ rip: u64,
+ cs: u64,
+ rflags: u64,
+ rsp: u64,
+ ss: u64,
+};
+```
+
+This structure is organized in part according to hardware constraints and in
+part at the discretion of the kernel implementer. The last five fields, from
+%rip to %ss, are constrained by the hardware. When an interrupt occurs, the CPU
+pushes each of these registers to the stack, in this order, then transfers
+control to the system interrupt handler. The next two registers serve a special
+purpose within our interrupt implementation, and the remainder are ordered
+arbitrarily.
+
+In order to switch between two tasks, we need to save all of this state
+somewhere, then load the same state for another task when returning from the
+kernel to userspace. The save/restore process is handled in the interrupt
+handler, in assembly:
+
+```
+.global isr_common
+isr_common:
+ _swapgs
+ push %rax
+ push %rbx
+ push %rcx
+ push %rdx
+ push %rsi
+ push %rdi
+ push %rbp
+ push %r8
+ push %r9
+ push %r10
+ push %r11
+ push %r12
+ push %r13
+ push %r14
+ push %r15
+
+ // Note: fsbase is handled elsewhere
+ push $0
+ push %fs
+
+ cld
+
+ mov %rsp, %rdi
+ mov $_kernel_stack_top, %rsp
+ call arch.isr_handler
+_isr_exit:
+ mov %rax, %rsp
+
+ // Note: fsbase is handled elsewhere
+ pop %fs
+ pop %r15
+
+ pop %r15
+ pop %r14
+ pop %r13
+ pop %r12
+ pop %r11
+ pop %r10
+ pop %r9
+ pop %r8
+ pop %rbp
+ pop %rdi
+ pop %rsi
+ pop %rdx
+ pop %rcx
+ pop %rbx
+ pop %rax
+
+ _swapgs
+
+ // Clean up error code and interrupt #
+ add $16, %rsp
+
+ iretq
+```
+
+I'm not going to go into too much detail on interrupts for this post (maybe in a
+later post), but what's important here is the chain of push/pop instructions.
+This automatically saves the CPU state for each task when entering the kernel.
+The syscall handler has something similar.
+
+This suggests a question: where's the stack?
+
+Helios has a single kernel stack,[^4] which is moved to %rsp from
+$\_kernel\_stack\_top in this code. This is different from systems like Linux,
+which have one kernel stack per thread; the rationale behind this design choice
+is out of scope for this post.[^5] However, the "stack" being pushed to here is
+not, in fact, a traditional stack.
+
+[^4]: For now; in the future it will have one stack per CPU.
+[^5]: Man, I could just go on and on and on.
+
+x86\_64 has an interesting feature wherein an interrupt can be configured to use
+a special "interrupt stack". The [task state segment][tss] is a bit of a
+historical artifact which is of little interest to Helios, but in long mode
+(64-bit mode) it serves a new purpose: to provide a list of addresses where
+up to seven interrupt stacks are stored. The [interrupt descriptor table][idt]
+includes a 3-bit "IST" field which, when nonzero, instructs the CPU to set the
+stack pointer to the corresponding address in the TSS when that interrupt fires.
+Helios sets all of these to one, then does something interesting:
+
+[tss]: https://wiki.osdev.org/Task_State_Segment
+[idt]: https://wiki.osdev.org/Interrupt_Descriptor_Table#Structure_on_x86-64
+
+```hare
+// Stores a pointer to the current state context.
+export let context: **state = null: **state;
+
+fn init_tss(i: size) void = {
+ cpus[i].tstate = taskstate { ... };
+ context = &cpus[i].tstate.ist[0]: **state;
+};
+
+// ...
+
+export fn save() void = {
+ // On x86_64, most registers are saved and restored by the ISR or
+ // syscall service routines.
+ let active = *context: *[*]state;
+ let regs = &active[-1];
+
+ regs.fsbase = rdmsr(0xC0000100);
+};
+
+export fn restore(regs: *state) void = {
+ wrmsr(0xC0000100, regs.fsbase);
+
+ const regs = regs: *[*]state;
+ *context = ®s[1];
+};
+```
+
+We store a pointer to the active task's state struct in the TSS when we enter
+userspace, and when an interrupt occurs, the CPU automatically places that state
+into %rsp so we can trivially push all of the task's registers into it.
+
+There is some weirdness to note here: the stack grows downwards. Each time you
+push, the stack pointer is decremented, then the pushed value is written there.
+So, we have to fill in this structure from the bottom up. Accordingly, we have
+to do something a bit unusual here: we don't store a pointer to the context
+object, but a pointer to the *end* of the context object. This is what
+&active[-1] does here.
+
+Hare has some memory safety features by default, such as bounds testing array
+accesses. Here we have to take advantage of some of Hare's escape hatches to
+accomplish the goal. First, we cast the pointer to an *unbounded array* of
+states — that's what the \*\[\*\] is for. Then we can take the address of
+element -1 without the compiler snitching on us.
+
+There is also a separate step here to save the fsbase register. This will be
+important later.
+
+This provides us with enough pieces to enter userspace:
+
+```hare
+// Immediately enters this task in userspace. Only used during system
+// initialization.
+export @noreturn fn enteruser(task: *caps::capability) void = {
+ const state = objects::task_getstate(task);
+ assert(objects::task_schedulable(state));
+ active = state;
+ objects::vspace_activate(&state.vspace)!;
+ arch::restore(&state.regs);
+ arch::enteruser();
+};
+```
+
+What we need next is a scheduler, and a periodic interrupt to invoke it, so that
+we can switch tasks every so often.
+
+Scheduler design is a complex subject which can have design, performance, and
+complexity implications ranging from subtle to substantial. For Helios's present
+needs we use a simple round-robin scheduler: each task gets the same time
+slice and we just switch to them one after another.
+
+The easy part is simply getting periodic interrupts. Again, this blog post isn't
+about interrupts, so I'll just give you the reader's digest:
+
+```hare
+arch::install_irq(arch::PIT_IRQ, &pit_irq);
+arch::pit_setphase(100);
+
+// ...
+
+fn pit_irq(state: *arch::state, irq: u8) void = {
+ sched::switchtask();
+ arch::pic_eoi(arch::PIT_IRQ);
+};
+```
+
+The PIT, or programmable interrupt timer, is a feature on x86\_64 which provides
+exactly what we need: periodic interrupts. This code configures it to tick at
+100 Hz and sets up a little IRQ handler which calls sched::switchtask to perform
+the actual context switch.
+
+Recall that, by the time sched::switchtask is invoked, the CPU and interrupt
+handler have already stashed all of the current task's registers into its state
+struct. All we have to do now is pick out the next task and restore its state.
+
+```hare
+// see idle.s
+let idle: arch::state;
+
+// Switches to the next task.
+export fn switchtask() void = {
+ // Save state
+ arch::save();
+
+ match (next()) {
+ case let task: *objects::taskstate =>
+ active = task;
+ objects::vspace_activate(&task.vspace)!;
+ arch::restore(&task.regs);
+ case null =>
+ arch::restore(&idle);
+ };
+};
+
+fn next() nullable *objects::taskstate = {
+ let next = active.next;
+ for (next != active) {
+ if (next == null) {
+ next = tasks;
+ continue;
+ };
+ const cand = next as *objects::taskstate;
+ if (objects::task_schedulable(cand)) {
+ return cand;
+ };
+ next = cand.next;
+ };
+ const next = next as *objects::taskstate;
+ if (objects::task_schedulable(next)) {
+ return next;
+ };
+ return null;
+};
+```
+
+Pretty straightforward. The scheduler [maintains a linked list][list] of tasks,
+picks the next one which is schedulable,[^6] then runs it. If there are no
+schedulable tasks, it runs the idle task.
+
+[list]: https://git.sr.ht/~sircmpwn/helios/tree/673c27e57684deeedbe88e1e6b7b940fdca5fc87/item/sched/tasks.ha
+
+[^6]: A task is schedulable if it is configured properly (with a cspace, vspace,
+ and IPC buffer) and is not currently blocking (i.e. waiting on I/O or
+ something).
+
+Err, wait, what's the idle task? Simple: it's another state object (i.e. a set
+of CPU registers) which essentially works as a statically allocated do-nothing
+thread.
+
+```hare
+const idle_frame: [2]uintptr = [0, &pause: uintptr];
+
+// Initializes the state for the idle thread.
+export fn init_idle(idle: *state) void = {
+ *idle = state {
+ cs = seg::KCODE << 3,
+ ss = seg::KDATA << 3,
+ rflags = (1 << 21) | (1 << 9),
+ rip = &pause: uintptr: u64,
+ rbp = &idle_frame: uintptr: u64,
+ ...
+ };
+};
+```
+
+"pause" is a simple loop:
+
+```hare
+.global arch.pause
+arch.pause:
+ hlt
+ jmp arch.pause
+```
+
+In the situation where every task is blocking on I/O, there's nothing for the
+CPU to do until the operation finishes. So, we simply halt and wait for the next
+interrupt to wake us back up, hopefully unblocking some tasks so we can schedule
+them again. A more sophisticated kernel might take this opportunity to go into a
+lower power state, perhaps, but for now this is quite sufficient.
+
+With this last piece in place, we now have a multi-threaded operating system.
+But there is one more piece to consider: when a task yields its time slice.
+
+Just because a task receives CPU time does not mean that it needs to use it. A
+task which has nothing useful to do can yield its time slice back to the kernel
+through the "yieldtask" syscall. On the face of it, this is quite simple:
+
+```hare
+// Yields the current time slice and switches to the next task.
+export @noreturn fn yieldtask() void = {
+ arch::sysret_set(&active.regs, 0, 0);
+ switchtask();
+ arch::enteruser();
+};
+```
+
+The "sysret\_set" updates the registers in the task state which correspond with
+system call return values to (0, 0), indicating a successful return from the
+yield syscall. But we don't actually return at all: we switch to the next task
+and then return to *that*.
+
+In addition to being called from userspace, this is also useful whenever the
+kernel blocks a thread on some I/O or IPC operation. For example, tasks can wait
+on "notification" objects, which another task can signal to wake them up —
+a simple synchronization primitive. The implementation makes good use of
+sched::yieldtask:
+
+```hare
+// Blocks the active task until this notification is signalled. Does not return
+// if the operation is blocking.
+export fn wait(note: *caps::capability) uint = {
+ match (nbwait(note)) {
+ case let word: uint =>
+ return word;
+ case errors::would_block =>
+ let note = note_getstate(note);
+ assert(note.recv == null); // TODO: support multiple receivers
+ note.recv = sched::active;
+ sched::active.status = task_status::BLOCKED;
+ sched::yieldtask();
+ };
+};
+```
+
+Finally, that's the last piece.
+
+### Threads in userspace
+
+Phew! That was a lot of kernel pieces to unpack. And now for userspace... in the
+next post! This one is getting pretty long. Here's what you have to look forward
+to:
+
+- Preparing the task and all of the objects it needs (such as a stack)
+- High-level operations: join, detach, exit, suspend, etc
+- Thread-local storage...
+ - in the Hare compiler
+ - in the ELF loader
+ - at runtime
+- Putting it all together to test the kernel
+
+We'll see you next time!