Kernel-hacking-with-Hare-part-2.md - drewdevault.com - [mirror] blog and personal website of Drew DeVault

Kernel-hacking-with-Hare-part-2.md (15598B)

---
title: "Notes from kernel hacking in Hare, part 2: multi-threading"
date: 2022-10-02
---
I have long promised that Hare would not have multi-threading, and it seems that
I have broken that promise. However, I have remained true to the
not-invented-here approach which is typical of my style by introducing it only
after designing an entire kernel to implement it on top of.[^1]
[^1]: Jokes aside, for those curious about multi-threading and Hare: our
  official stance is not actually as strict as "no threads, period", though in
  practice for many people it might amount to that. There is nothing stopping
  you from linking to pthreads or calling clone(2) to spin up threads in a Hare
  program, but the standard library explicitly provides no multi-threading
  support, synchronization primitives, or re-entrancy guarantees. That's not to
  say, however, that one could not build their own Hare standard library which
  does offer these features &mdash; and, in fact, that is exactly what the
  Vulcan test framework for Helios provides in its Hare libraries.
For some background, [Helios][0] is a micro-kernel written in Hare. In addition
to the project, the [Vulcan][1] system is a small userspace designed to test the
kernel.
[0]: https://git.sr.ht/~sircmpwn/helios
[1]: https://git.sr.ht/~sircmpwn/helios/tree/master/item/vulcan
![A picture of a laptop running Helios and showing the results of the Vulcan test suite](https://redacted.moe/f/a8018226.jpg)
While I don't anticipate multi-threaded processes playing a huge role in the
complete Ares system in the future, they do have a place. In the long term, I
would like to be able to provide an implementation of pthreads for porting
existing software to the system. A more immediate concern is how to test the
various kernel primitives provided by Helios, such as those which facilitate
inter-process communication (IPC). It's much easier to test these with threads
than with processes, since spawning threads does not require spinning up a new
address space.
```hare
@test fn notification::wait() void = {
	const note = helios::newnote()!;
	defer helios::destroy(note)!;
	const thread = threads::new(&notification_wait, note)!;
	threads::start(thread)!;
	defer threads::join(thread)!;
	helios::signal(note)!;
};
fn notification_wait(note: u64) void = {
	const note = note: helios::cap;
	helios::wait(note)!;
};
```
So how does it work? Let's split this up into two domains: kernelspace and
userspace.
### Threads in the kernel
The basic primitive for threads and processes in Helios is a "task", which is
simply an object which receives some CPU time. A task has a capability space (so
it can invoke operations against kernel objects), an virtual address space (so
it has somewhere to map the process image and memory), and some state, such as
the values of its CPU registers. The task-related structures are:
```hare
// A task capability.
export type task = struct {
	caps::capability,
	state: uintptr,
	@offset(caps::LINK_OFFS) link: caps::link,
};
// Scheduling status of a task.
export type task_status = enum {
	ACTIVE,
	BLOCKED, // XXX: Can a task be both blocked and suspended?
	SUSPENDED,
};
// State for a task.
export type taskstate = struct {
	regs: arch::state,
	cspace: caps::cslot,
	vspace: caps::cslot,
	ipc_buffer: uintptr,
	status: task_status,
	// XXX: This is a virtual address, should be physical
	next: nullable *taskstate,
	prev: nullable *taskstate,
};
```
Here's a footnote to explain some off-topic curiosities about this code: [^2]
[^2]: Capabilities are essentially references to kernel objects. The kernel
  object for a task is the taskstate struct, and there can be many task
  capabilities which refer to this. Any task which possesses a task capability in
  its capability space can invoke operations against this task, such as reading
  or writing its registers.<br /><br />
  The link field is used to create a linked list of capabilities across the
  system. It has a doubly linked list for the next and previous capability, and
  a link to its parent capability, such as the memory capability from which the
  task state was allocated. The list is organized such that copies of the same
  capability are always adjacent to one another, and children always follow
  their parents.
  <br /><br />
  The answer to the XXX comment in task\_status is yes, by the way. Something to
  fix later.
The most interesting part of this structure is arch::state, which stores the
task's CPU registers. On x86\_64,[^3] this structure is defined as follows:
[^3]: Only x86\_64 is supported for now, but a RISC-V port is in-progress and I
  intend to do arm64 in the future.
```hare
export type state = struct {
	fs: u64,
	fsbase: u64,
	r15: u64,
	r14: u64,
	r13: u64,
	r12: u64,
	r11: u64,
	r10: u64,
	r9: u64,
	r8: u64,
	rbp: u64,
	rdi: u64,
	rsi: u64,
	rdx: u64,
	rcx: u64,
	rbx: u64,
	rax: u64,
	intno: u64,
	errcode: u64,
	rip: u64,
	cs: u64,
	rflags: u64,
	rsp: u64,
	ss: u64,
};
```
This structure is organized in part according to hardware constraints and in
part at the discretion of the kernel implementer. The last five fields, from
%rip to %ss, are constrained by the hardware. When an interrupt occurs, the CPU
pushes each of these registers to the stack, in this order, then transfers
control to the system interrupt handler. The next two registers serve a special
purpose within our interrupt implementation, and the remainder are ordered
arbitrarily.
In order to switch between two tasks, we need to save all of this state
somewhere, then load the same state for another task when returning from the
kernel to userspace. The save/restore process is handled in the interrupt
handler, in assembly:
```
.global isr_common
isr_common:
	_swapgs
	push %rax
	push %rbx
	push %rcx
	push %rdx
	push %rsi
	push %rdi
	push %rbp
	push %r8
	push %r9
	push %r10
	push %r11
	push %r12
	push %r13
	push %r14
	push %r15
	// Note: fsbase is handled elsewhere
	push $0
	push %fs
	cld
	mov %rsp, %rdi
	mov $_kernel_stack_top, %rsp
	call arch.isr_handler
_isr_exit:
	mov %rax, %rsp
	// Note: fsbase is handled elsewhere
	pop %fs
	pop %r15
	pop %r15
	pop %r14
	pop %r13
	pop %r12
	pop %r11
	pop %r10
	pop %r9
	pop %r8
	pop %rbp
	pop %rdi
	pop %rsi
	pop %rdx
	pop %rcx
	pop %rbx
	pop %rax
	_swapgs
	// Clean up error code and interrupt #
	add $16, %rsp
	iretq
```
I'm not going to go into too much detail on interrupts for this post (maybe in a
later post), but what's important here is the chain of push/pop instructions.
This automatically saves the CPU state for each task when entering the kernel.
The syscall handler has something similar.
This suggests a question: where's the stack?
Helios has a single kernel stack,[^4] which is moved to %rsp from
$\_kernel\_stack\_top in this code. This is different from systems like Linux,
which have one kernel stack per thread; the rationale behind this design choice
is out of scope for this post.[^5] However, the "stack" being pushed to here is
not, in fact, a traditional stack.
[^4]: For now; in the future it will have one stack per CPU.
[^5]: Man, I could just go on and on and on.
x86\_64 has an interesting feature wherein an interrupt can be configured to use
a special "interrupt stack". The [task state segment][tss] is a bit of a
historical artifact which is of little interest to Helios, but in long mode
(64-bit mode) it serves a new purpose: to provide a list of addresses where
up to seven interrupt stacks are stored. The [interrupt descriptor table][idt]
includes a 3-bit "IST" field which, when nonzero, instructs the CPU to set the
stack pointer to the corresponding address in the TSS when that interrupt fires.
Helios sets all of these to one, then does something interesting:
[tss]: https://wiki.osdev.org/Task_State_Segment
[idt]: https://wiki.osdev.org/Interrupt_Descriptor_Table#Structure_on_x86-64
```hare
// Stores a pointer to the current state context.
export let context: **state = null: **state;
fn init_tss(i: size) void = {
	cpus[i].tstate = taskstate { ... };
	context = &cpus[i].tstate.ist[0]: **state;
};
// ...
export fn save() void = {
	// On x86_64, most registers are saved and restored by the ISR or
	// syscall service routines.
	let active = *context: *[*]state;
	let regs = &active[-1];
	regs.fsbase = rdmsr(0xC0000100);
};
export fn restore(regs: *state) void = {
	wrmsr(0xC0000100, regs.fsbase);
	const regs = regs: *[*]state;
	*context = &regs[1];
};
```
We store a pointer to the active task's state struct in the TSS when we enter
userspace, and when an interrupt occurs, the CPU automatically places that state
into %rsp so we can trivially push all of the task's registers into it.
There is some weirdness to note here: the stack grows downwards. Each time you
push, the stack pointer is decremented, then the pushed value is written there.
So, we have to fill in this structure from the bottom up. Accordingly, we have
to do something a bit unusual here: we don't store a pointer to the context
object, but a pointer to the *end* of the context object. This is what
&active[-1] does here.
Hare has some memory safety features by default, such as bounds testing array
accesses. Here we have to take advantage of some of Hare's escape hatches to
accomplish the goal. First, we cast the pointer to an *unbounded array* of
states &mdash; that's what the \*\[\*\] is for. Then we can take the address of
element -1 without the compiler snitching on us.
There is also a separate step here to save the fsbase register. This will be
important later.
This provides us with enough pieces to enter userspace:
```hare
// Immediately enters this task in userspace. Only used during system
// initialization.
export @noreturn fn enteruser(task: *caps::capability) void = {
	const state = objects::task_getstate(task);
	assert(objects::task_schedulable(state));
	active = state;
	objects::vspace_activate(&state.vspace)!;
	arch::restore(&state.regs);
	arch::enteruser();
};
```
What we need next is a scheduler, and a periodic interrupt to invoke it, so that
we can switch tasks every so often.
Scheduler design is a complex subject which can have design, performance, and
complexity implications ranging from subtle to substantial. For Helios's present
needs we use a simple round-robin scheduler: each task gets the same time
slice and we just switch to them one after another.
The easy part is simply getting periodic interrupts. Again, this blog post isn't
about interrupts, so I'll just give you the reader's digest:
```hare
arch::install_irq(arch::PIT_IRQ, &pit_irq);
arch::pit_setphase(100);
// ...
fn pit_irq(state: *arch::state, irq: u8) void = {
	sched::switchtask();
	arch::pic_eoi(arch::PIT_IRQ);
};
```
The PIT, or programmable interrupt timer, is a feature on x86\_64 which provides
exactly what we need: periodic interrupts. This code configures it to tick at
100 Hz and sets up a little IRQ handler which calls sched::switchtask to perform
the actual context switch.
Recall that, by the time sched::switchtask is invoked, the CPU and interrupt
handler have already stashed all of the current task's registers into its state
struct. All we have to do now is pick out the next task and restore its state.
```hare
// see idle.s
let idle: arch::state;
// Switches to the next task.
export fn switchtask() void = {
	// Save state
	arch::save();
	match (next()) {
	case let task: *objects::taskstate =>
		active = task;
		objects::vspace_activate(&task.vspace)!;
		arch::restore(&task.regs);
	case null =>
		arch::restore(&idle);
	};
};
fn next() nullable *objects::taskstate = {
	let next = active.next;
	for (next != active) {
		if (next == null) {
			next = tasks;
			continue;
		};
		const cand = next as *objects::taskstate;
		if (objects::task_schedulable(cand)) {
			return cand;
		};
		next = cand.next;
	};
	const next = next as *objects::taskstate;
	if (objects::task_schedulable(next)) {
		return next;
	};
	return null;
};
```
Pretty straightforward. The scheduler [maintains a linked list][list] of tasks,
picks the next one which is schedulable,[^6] then runs it. If there are no
schedulable tasks, it runs the idle task.
[list]: https://git.sr.ht/~sircmpwn/helios/tree/673c27e57684deeedbe88e1e6b7b940fdca5fc87/item/sched/tasks.ha
[^6]: A task is schedulable if it is configured properly (with a cspace, vspace,
  and IPC buffer) and is not currently blocking (i.e. waiting on I/O or
  something).
Err, wait, what's the idle task? Simple: it's another state object (i.e. a set
of CPU registers) which essentially works as a statically allocated do-nothing
thread.
```hare
const idle_frame: [2]uintptr = [0, &pause: uintptr];
// Initializes the state for the idle thread.
export fn init_idle(idle: *state) void = {
	*idle = state {
		cs = seg::KCODE << 3,
		ss = seg::KDATA << 3,
		rflags = (1 << 21) | (1 << 9),
		rip = &pause: uintptr: u64,
		rbp = &idle_frame: uintptr: u64,
		...
	};
};
```
"pause" is a simple loop:
```hare
.global arch.pause
arch.pause:
	hlt
	jmp arch.pause
```
In the situation where every task is blocking on I/O, there's nothing for the
CPU to do until the operation finishes. So, we simply halt and wait for the next
interrupt to wake us back up, hopefully unblocking some tasks so we can schedule
them again. A more sophisticated kernel might take this opportunity to go into a
lower power state, perhaps, but for now this is quite sufficient.
With this last piece in place, we now have a multi-threaded operating system.
But there is one more piece to consider: when a task yields its time slice.
Just because a task receives CPU time does not mean that it needs to use it. A
task which has nothing useful to do can yield its time slice back to the kernel
through the "yieldtask" syscall. On the face of it, this is quite simple:
```hare
// Yields the current time slice and switches to the next task.
export @noreturn fn yieldtask() void = {
	arch::sysret_set(&active.regs, 0, 0);
	switchtask();
	arch::enteruser();
};
```
The "sysret\_set" updates the registers in the task state which correspond with
system call return values to (0, 0), indicating a successful return from the
yield syscall. But we don't actually return at all: we switch to the next task
and then return to *that*.
In addition to being called from userspace, this is also useful whenever the
kernel blocks a thread on some I/O or IPC operation. For example, tasks can wait
on "notification" objects, which another task can signal to wake them up &mdash;
a simple synchronization primitive. The implementation makes good use of
sched::yieldtask:
```hare
// Blocks the active task until this notification is signalled. Does not return
// if the operation is blocking.
export fn wait(note: *caps::capability) uint = {
	match (nbwait(note)) {
	case let word: uint =>
		return word;
	case errors::would_block =>
		let note = note_getstate(note);
		assert(note.recv == null); // TODO: support multiple receivers
		note.recv = sched::active;
		sched::active.status = task_status::BLOCKED;
		sched::yieldtask();
	};
};
```
Finally, that's the last piece.
### Threads in userspace
Phew! That was a lot of kernel pieces to unpack. And now for userspace... in the
next post! This one is getting pretty long. Here's what you have to look forward
to:
- Preparing the task and all of the objects it needs (such as a stack)
- High-level operations: join, detach, exit, suspend, etc
- Thread-local storage...
  - in the Hare compiler
  - in the ELF loader
  - at runtime
- Putting it all together to test the kernel
We'll see you next time!