logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

Kernel-hacking-with-Hare-part-2.md (15587B)


  1. ---
  2. title: "Notes from kernel hacking in Hare, part 2: multi-threading"
  3. date: 2022-10-02
  4. ---
  5. I have long promised that Hare would not have multi-threading, and it seems that
  6. I have broken that promise. However, I have remained true to the
  7. not-invented-here approach which is typical of my style by introducing it only
  8. after designing an entire kernel to implement it on top of.[^1]
  9. [^1]: Jokes aside, for those curious about multi-threading and Hare: our
  10. official stance is not actually as strict as "no threads, period", though in
  11. practice for many people it might amount to that. There is nothing stopping
  12. you from linking to pthreads or calling clone(2) to spin up threads in a Hare
  13. program, but the standard library explicitly provides no multi-threading
  14. support, synchronization primitives, or re-entrancy guarantees. That's not to
  15. say, however, that one could not build their own Hare standard library which
  16. does offer these features — and, in fact, that is exactly what the
  17. Vulcan test framework for Helios provides in its Hare libraries.
  18. For some background, [Helios][0] is a micro-kernel written in Hare. In addition
  19. to the project, the [Vulcan][1] system is a small userspace designed to test the
  20. kernel.
  21. [0]: https://git.sr.ht/~sircmpwn/helios
  22. [1]: https://git.sr.ht/~sircmpwn/helios/tree/master/item/vulcan
  23. ![A picture of a laptop running Helios and showing the results of the Vulcan test suite](https://l.sr.ht/JIya.jpg)
  24. While I don't anticipate multi-threaded processes playing a huge role in the
  25. complete Ares system in the future, they do have a place. In the long term, I
  26. would like to be able to provide an implementation of pthreads for porting
  27. existing software to the system. A more immediate concern is how to test the
  28. various kernel primitives provided by Helios, such as those which facilitate
  29. inter-process communication (IPC). It's much easier to test these with threads
  30. than with processes, since spawning threads does not require spinning up a new
  31. address space.
  32. ```hare
  33. @test fn notification::wait() void = {
  34. const note = helios::newnote()!;
  35. defer helios::destroy(note)!;
  36. const thread = threads::new(&notification_wait, note)!;
  37. threads::start(thread)!;
  38. defer threads::join(thread)!;
  39. helios::signal(note)!;
  40. };
  41. fn notification_wait(note: u64) void = {
  42. const note = note: helios::cap;
  43. helios::wait(note)!;
  44. };
  45. ```
  46. So how does it work? Let's split this up into two domains: kernelspace and
  47. userspace.
  48. ### Threads in the kernel
  49. The basic primitive for threads and processes in Helios is a "task", which is
  50. simply an object which receives some CPU time. A task has a capability space (so
  51. it can invoke operations against kernel objects), an virtual address space (so
  52. it has somewhere to map the process image and memory), and some state, such as
  53. the values of its CPU registers. The task-related structures are:
  54. ```hare
  55. // A task capability.
  56. export type task = struct {
  57. caps::capability,
  58. state: uintptr,
  59. @offset(caps::LINK_OFFS) link: caps::link,
  60. };
  61. // Scheduling status of a task.
  62. export type task_status = enum {
  63. ACTIVE,
  64. BLOCKED, // XXX: Can a task be both blocked and suspended?
  65. SUSPENDED,
  66. };
  67. // State for a task.
  68. export type taskstate = struct {
  69. regs: arch::state,
  70. cspace: caps::cslot,
  71. vspace: caps::cslot,
  72. ipc_buffer: uintptr,
  73. status: task_status,
  74. // XXX: This is a virtual address, should be physical
  75. next: nullable *taskstate,
  76. prev: nullable *taskstate,
  77. };
  78. ```
  79. Here's a footnote to explain some off-topic curiosities about this code: [^2]
  80. [^2]: Capabilities are essentially references to kernel objects. The kernel
  81. object for a task is the taskstate struct, and there can be many task
  82. capabilities which refer to this. Any task which possesses a task capability in
  83. its capability space can invoke operations against this task, such as reading
  84. or writing its registers.<br /><br />
  85. The link field is used to create a linked list of capabilities across the
  86. system. It has a doubly linked list for the next and previous capability, and
  87. a link to its parent capability, such as the memory capability from which the
  88. task state was allocated. The list is organized such that copies of the same
  89. capability are always adjacent to one another, and children always follow
  90. their parents.
  91. <br /><br />
  92. The answer to the XXX comment in task\_status is yes, by the way. Something to
  93. fix later.
  94. The most interesting part of this structure is arch::state, which stores the
  95. task's CPU registers. On x86\_64,[^3] this structure is defined as follows:
  96. [^3]: Only x86\_64 is supported for now, but a RISC-V port is in-progress and I
  97. intend to do arm64 in the future.
  98. ```hare
  99. export type state = struct {
  100. fs: u64,
  101. fsbase: u64,
  102. r15: u64,
  103. r14: u64,
  104. r13: u64,
  105. r12: u64,
  106. r11: u64,
  107. r10: u64,
  108. r9: u64,
  109. r8: u64,
  110. rbp: u64,
  111. rdi: u64,
  112. rsi: u64,
  113. rdx: u64,
  114. rcx: u64,
  115. rbx: u64,
  116. rax: u64,
  117. intno: u64,
  118. errcode: u64,
  119. rip: u64,
  120. cs: u64,
  121. rflags: u64,
  122. rsp: u64,
  123. ss: u64,
  124. };
  125. ```
  126. This structure is organized in part according to hardware constraints and in
  127. part at the discretion of the kernel implementer. The last five fields, from
  128. %rip to %ss, are constrained by the hardware. When an interrupt occurs, the CPU
  129. pushes each of these registers to the stack, in this order, then transfers
  130. control to the system interrupt handler. The next two registers serve a special
  131. purpose within our interrupt implementation, and the remainder are ordered
  132. arbitrarily.
  133. In order to switch between two tasks, we need to save all of this state
  134. somewhere, then load the same state for another task when returning from the
  135. kernel to userspace. The save/restore process is handled in the interrupt
  136. handler, in assembly:
  137. ```
  138. .global isr_common
  139. isr_common:
  140. _swapgs
  141. push %rax
  142. push %rbx
  143. push %rcx
  144. push %rdx
  145. push %rsi
  146. push %rdi
  147. push %rbp
  148. push %r8
  149. push %r9
  150. push %r10
  151. push %r11
  152. push %r12
  153. push %r13
  154. push %r14
  155. push %r15
  156. // Note: fsbase is handled elsewhere
  157. push $0
  158. push %fs
  159. cld
  160. mov %rsp, %rdi
  161. mov $_kernel_stack_top, %rsp
  162. call arch.isr_handler
  163. _isr_exit:
  164. mov %rax, %rsp
  165. // Note: fsbase is handled elsewhere
  166. pop %fs
  167. pop %r15
  168. pop %r15
  169. pop %r14
  170. pop %r13
  171. pop %r12
  172. pop %r11
  173. pop %r10
  174. pop %r9
  175. pop %r8
  176. pop %rbp
  177. pop %rdi
  178. pop %rsi
  179. pop %rdx
  180. pop %rcx
  181. pop %rbx
  182. pop %rax
  183. _swapgs
  184. // Clean up error code and interrupt #
  185. add $16, %rsp
  186. iretq
  187. ```
  188. I'm not going to go into too much detail on interrupts for this post (maybe in a
  189. later post), but what's important here is the chain of push/pop instructions.
  190. This automatically saves the CPU state for each task when entering the kernel.
  191. The syscall handler has something similar.
  192. This suggests a question: where's the stack?
  193. Helios has a single kernel stack,[^4] which is moved to %rsp from
  194. $\_kernel\_stack\_top in this code. This is different from systems like Linux,
  195. which have one kernel stack per thread; the rationale behind this design choice
  196. is out of scope for this post.[^5] However, the "stack" being pushed to here is
  197. not, in fact, a traditional stack.
  198. [^4]: For now; in the future it will have one stack per CPU.
  199. [^5]: Man, I could just go on and on and on.
  200. x86\_64 has an interesting feature wherein an interrupt can be configured to use
  201. a special "interrupt stack". The [task state segment][tss] is a bit of a
  202. historical artifact which is of little interest to Helios, but in long mode
  203. (64-bit mode) it serves a new purpose: to provide a list of addresses where
  204. up to seven interrupt stacks are stored. The [interrupt descriptor table][idt]
  205. includes a 3-bit "IST" field which, when nonzero, instructs the CPU to set the
  206. stack pointer to the corresponding address in the TSS when that interrupt fires.
  207. Helios sets all of these to one, then does something interesting:
  208. [tss]: https://wiki.osdev.org/Task_State_Segment
  209. [idt]: https://wiki.osdev.org/Interrupt_Descriptor_Table#Structure_on_x86-64
  210. ```hare
  211. // Stores a pointer to the current state context.
  212. export let context: **state = null: **state;
  213. fn init_tss(i: size) void = {
  214. cpus[i].tstate = taskstate { ... };
  215. context = &cpus[i].tstate.ist[0]: **state;
  216. };
  217. // ...
  218. export fn save() void = {
  219. // On x86_64, most registers are saved and restored by the ISR or
  220. // syscall service routines.
  221. let active = *context: *[*]state;
  222. let regs = &active[-1];
  223. regs.fsbase = rdmsr(0xC0000100);
  224. };
  225. export fn restore(regs: *state) void = {
  226. wrmsr(0xC0000100, regs.fsbase);
  227. const regs = regs: *[*]state;
  228. *context = &regs[1];
  229. };
  230. ```
  231. We store a pointer to the active task's state struct in the TSS when we enter
  232. userspace, and when an interrupt occurs, the CPU automatically places that state
  233. into %rsp so we can trivially push all of the task's registers into it.
  234. There is some weirdness to note here: the stack grows downwards. Each time you
  235. push, the stack pointer is decremented, then the pushed value is written there.
  236. So, we have to fill in this structure from the bottom up. Accordingly, we have
  237. to do something a bit unusual here: we don't store a pointer to the context
  238. object, but a pointer to the *end* of the context object. This is what
  239. &active[-1] does here.
  240. Hare has some memory safety features by default, such as bounds testing array
  241. accesses. Here we have to take advantage of some of Hare's escape hatches to
  242. accomplish the goal. First, we cast the pointer to an *unbounded array* of
  243. states &mdash; that's what the \*\[\*\] is for. Then we can take the address of
  244. element -1 without the compiler snitching on us.
  245. There is also a separate step here to save the fsbase register. This will be
  246. important later.
  247. This provides us with enough pieces to enter userspace:
  248. ```hare
  249. // Immediately enters this task in userspace. Only used during system
  250. // initialization.
  251. export @noreturn fn enteruser(task: *caps::capability) void = {
  252. const state = objects::task_getstate(task);
  253. assert(objects::task_schedulable(state));
  254. active = state;
  255. objects::vspace_activate(&state.vspace)!;
  256. arch::restore(&state.regs);
  257. arch::enteruser();
  258. };
  259. ```
  260. What we need next is a scheduler, and a periodic interrupt to invoke it, so that
  261. we can switch tasks every so often.
  262. Scheduler design is a complex subject which can have design, performance, and
  263. complexity implications ranging from subtle to substantial. For Helios's present
  264. needs we use a simple round-robin scheduler: each task gets the same time
  265. slice and we just switch to them one after another.
  266. The easy part is simply getting periodic interrupts. Again, this blog post isn't
  267. about interrupts, so I'll just give you the reader's digest:
  268. ```hare
  269. arch::install_irq(arch::PIT_IRQ, &pit_irq);
  270. arch::pit_setphase(100);
  271. // ...
  272. fn pit_irq(state: *arch::state, irq: u8) void = {
  273. sched::switchtask();
  274. arch::pic_eoi(arch::PIT_IRQ);
  275. };
  276. ```
  277. The PIT, or programmable interrupt timer, is a feature on x86\_64 which provides
  278. exactly what we need: periodic interrupts. This code configures it to tick at
  279. 100 Hz and sets up a little IRQ handler which calls sched::switchtask to perform
  280. the actual context switch.
  281. Recall that, by the time sched::switchtask is invoked, the CPU and interrupt
  282. handler have already stashed all of the current task's registers into its state
  283. struct. All we have to do now is pick out the next task and restore its state.
  284. ```hare
  285. // see idle.s
  286. let idle: arch::state;
  287. // Switches to the next task.
  288. export fn switchtask() void = {
  289. // Save state
  290. arch::save();
  291. match (next()) {
  292. case let task: *objects::taskstate =>
  293. active = task;
  294. objects::vspace_activate(&task.vspace)!;
  295. arch::restore(&task.regs);
  296. case null =>
  297. arch::restore(&idle);
  298. };
  299. };
  300. fn next() nullable *objects::taskstate = {
  301. let next = active.next;
  302. for (next != active) {
  303. if (next == null) {
  304. next = tasks;
  305. continue;
  306. };
  307. const cand = next as *objects::taskstate;
  308. if (objects::task_schedulable(cand)) {
  309. return cand;
  310. };
  311. next = cand.next;
  312. };
  313. const next = next as *objects::taskstate;
  314. if (objects::task_schedulable(next)) {
  315. return next;
  316. };
  317. return null;
  318. };
  319. ```
  320. Pretty straightforward. The scheduler [maintains a linked list][list] of tasks,
  321. picks the next one which is schedulable,[^6] then runs it. If there are no
  322. schedulable tasks, it runs the idle task.
  323. [list]: https://git.sr.ht/~sircmpwn/helios/tree/673c27e57684deeedbe88e1e6b7b940fdca5fc87/item/sched/tasks.ha
  324. [^6]: A task is schedulable if it is configured properly (with a cspace, vspace,
  325. and IPC buffer) and is not currently blocking (i.e. waiting on I/O or
  326. something).
  327. Err, wait, what's the idle task? Simple: it's another state object (i.e. a set
  328. of CPU registers) which essentially works as a statically allocated do-nothing
  329. thread.
  330. ```hare
  331. const idle_frame: [2]uintptr = [0, &pause: uintptr];
  332. // Initializes the state for the idle thread.
  333. export fn init_idle(idle: *state) void = {
  334. *idle = state {
  335. cs = seg::KCODE << 3,
  336. ss = seg::KDATA << 3,
  337. rflags = (1 << 21) | (1 << 9),
  338. rip = &pause: uintptr: u64,
  339. rbp = &idle_frame: uintptr: u64,
  340. ...
  341. };
  342. };
  343. ```
  344. "pause" is a simple loop:
  345. ```hare
  346. .global arch.pause
  347. arch.pause:
  348. hlt
  349. jmp arch.pause
  350. ```
  351. In the situation where every task is blocking on I/O, there's nothing for the
  352. CPU to do until the operation finishes. So, we simply halt and wait for the next
  353. interrupt to wake us back up, hopefully unblocking some tasks so we can schedule
  354. them again. A more sophisticated kernel might take this opportunity to go into a
  355. lower power state, perhaps, but for now this is quite sufficient.
  356. With this last piece in place, we now have a multi-threaded operating system.
  357. But there is one more piece to consider: when a task yields its time slice.
  358. Just because a task receives CPU time does not mean that it needs to use it. A
  359. task which has nothing useful to do can yield its time slice back to the kernel
  360. through the "yieldtask" syscall. On the face of it, this is quite simple:
  361. ```hare
  362. // Yields the current time slice and switches to the next task.
  363. export @noreturn fn yieldtask() void = {
  364. arch::sysret_set(&active.regs, 0, 0);
  365. switchtask();
  366. arch::enteruser();
  367. };
  368. ```
  369. The "sysret\_set" updates the registers in the task state which correspond with
  370. system call return values to (0, 0), indicating a successful return from the
  371. yield syscall. But we don't actually return at all: we switch to the next task
  372. and then return to *that*.
  373. In addition to being called from userspace, this is also useful whenever the
  374. kernel blocks a thread on some I/O or IPC operation. For example, tasks can wait
  375. on "notification" objects, which another task can signal to wake them up &mdash;
  376. a simple synchronization primitive. The implementation makes good use of
  377. sched::yieldtask:
  378. ```hare
  379. // Blocks the active task until this notification is signalled. Does not return
  380. // if the operation is blocking.
  381. export fn wait(note: *caps::capability) uint = {
  382. match (nbwait(note)) {
  383. case let word: uint =>
  384. return word;
  385. case errors::would_block =>
  386. let note = note_getstate(note);
  387. assert(note.recv == null); // TODO: support multiple receivers
  388. note.recv = sched::active;
  389. sched::active.status = task_status::BLOCKED;
  390. sched::yieldtask();
  391. };
  392. };
  393. ```
  394. Finally, that's the last piece.
  395. ### Threads in userspace
  396. Phew! That was a lot of kernel pieces to unpack. And now for userspace... in the
  397. next post! This one is getting pretty long. Here's what you have to look forward
  398. to:
  399. - Preparing the task and all of the objects it needs (such as a stack)
  400. - High-level operations: join, detach, exit, suspend, etc
  401. - Thread-local storage...
  402. - in the Hare compiler
  403. - in the ELF loader
  404. - at runtime
  405. - Putting it all together to test the kernel
  406. We'll see you next time!