logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

The-worst-bugs.md (8745B)


  1. ---
  2. date: 2014-02-02
  3. layout: post
  4. title: The bug that hides from breakpoints
  5. tags: [KnightOS, kernel hacking]
  6. ---
  7. This is the story of the most difficult bug I ever had to solve. See if you can
  8. figure it out before the conclusion.
  9. ### Background
  10. For some years now, I've worked on a kernel for Texas
  11. Instruments calculators called [KnightOS](https://github.com/KnightOS/kernel).
  12. This kernel is written entirely in assembly, and targets the old-school z80
  13. processor from back in 1976. This classic processor was built without any
  14. concept of protection rings. It's an 8-bit processor, with 150-some instructions
  15. and (in this application) 32K of RAM and 32K of Flash. This stuff is so old, I
  16. ended up writing most of the KnightOS toolchain from scratch rather than try to
  17. get archaic assemblers and compilers running on modern systems.
  18. When you're working in an enviornment like this, there's no seperation between
  19. kernel and userland. All "userspace" programs run as root, and crashing the entire
  20. system is a simple task. All the memory my kernel sets aside for the
  21. process table, or memory ownership, file handles, stacks, any other executing
  22. process - any program can modify this freely. Of course, we have to rely on the
  23. userland to play nice, and it usually does. But when there are bugs, they can be a
  24. real pain in the ass to hunt down.
  25. ### The elusive bug
  26. The original bug report: **When running the counting demo and switching between
  27. applications, the thread list graphics become corrupted.**
  28. I can reproduce this problem, so I settle into my development enviornment and I
  29. set a breakpoint near the thread list's graphical code. I fire up the emulator and
  30. repeat the steps... but it doesn't happen. This happened consistently: **the bug
  31. was not reproduceable when a breakpoint was set**. Keep in mind, I'm running this
  32. in a z80 emulator, so the enviornment is supposedly no different. There's no
  33. debugger attached here.
  34. Though this is quite strange, I don't immediately despair. I try instead setting a
  35. "breakpoint" by dropping an infinite loop in the code, instead of a formal
  36. breakpoint. I figure that I can halt the program flow manually and open the
  37. debugger to inspect the problem. However, the bug wouldn't be tamed quite so
  38. easily. The bug was unreproducable when I had this psuedo-breakpoint in place,
  39. too.
  40. At this point, I started to get a little frustrated. How do I debug a problem that
  41. disappears when you debug it? I decided to try and find out what caused it after
  42. it had taken place, by setting the breakpoint to be hit only after the graphical
  43. corruption happened. Here, I gained some ground. I was able to reproduce it, and
  44. *then* halt the machine, and I could examine memory and such after the bug was
  45. given a chance to have its way over the system.
  46. I discovered the reason the graphics were being corrupted. The kernel kept the
  47. length of the process table at a fixed address. The thread list, in order to draw
  48. the list of active threads, looks to this value to determine how many threads it
  49. should draw. Well, when the bug occured, the value was too high! The thread list
  50. was drawing threads that did not exist, and the text rendering puked non-ASCII
  51. characters all over the display. But why was that value being corrupted?
  52. It was an oddly specific address to change. None of the surrounding memory was
  53. touched. Making it even more odd was the very specific conditions this happened
  54. under - only when the counting demo was running. I asked myself, "what makes the
  55. counting demo unique?" It hit me after a moment of thought. The counting demo
  56. existed to demonstrate non-supsendable threads. The kernel would stop executing
  57. threads (or "suspend" them) when they lost focus, in an attempt to keep the
  58. system's very limited resources available. The counting demo was marked as
  59. non-suspendable, a feature that had been implemented a few months prior. It
  60. showed a number on the screen that counted up forever, and the idea was that you
  61. could go give some other application focus, come back, and the number would have
  62. been counting up while you were away. A background task, if you will.
  63. A more accurate description of the bug emerged: "the length of the kernel process
  64. table gets corrupted when launching the thread list when a non-suspendable thread
  65. is running". What followed was hours and hours of crawling through the hundreds of
  66. lines of assembly between summoning the thread list, and actually seeing it. I'll
  67. spare you the details, because they are very boring. We'll pick the story back up
  68. at the point where I had isolated the area in which it occured: applib.
  69. The KnightOS userland offered "applib", a library of common functions applications
  70. would need to get the general UX of the system. Among these was the function
  71. `applibGetKey`, which was a wrapper around the kernel's `getKey` function. The
  72. idea was that it would work the same way (return the last key pressed), but for
  73. special keys, it would do the appropriate action for you. For example, if you
  74. pressed the F5 key, it would suspend the current thread and launch the thread
  75. list. This is the mechanism with which most applications transfer control out of
  76. their own thread and into the thread list.
  77. Eager that I had found the source of the issue, I placed a breakpoint nearby. That
  78. same issue from before struck again - the bug vanished when the breakpoint was
  79. set. I tried a more creative approach: instead of using a proper breakpoint, I
  80. asked the emulator to halt whenever that address was written to. Even still - the
  81. bug hid itself whenever this happened.
  82. I decided to dive into the kernel's getKey function. Here's the start of the
  83. function, as it appeared at the time:
  84. ```
  85. getKey:
  86. call hasKeypadLock
  87. jr _
  88. xor a
  89. ret
  90. _: push bc
  91. ; ...
  92. ```
  93. I started going through this code line-by-line, trying to see if there was
  94. anything here that could concievably touch the thread table. I noticed a minor
  95. error here, and corrected it without thinking:
  96. ```
  97. getKey:
  98. call hasKeypadLock
  99. jr z, _
  100. xor a
  101. ret
  102. _: push bc
  103. ; ...
  104. ```
  105. The simple error I had corrected: getKey was pressing forward, even when the
  106. current thread didn't have control of the keyboard hardware. This was a silly
  107. error - only two characters were omitted.
  108. A moment after I fixed that issue, the answer set in - this was the source of the
  109. entire problem. Confirming it, I booted up the emulator with this change applied
  110. and the bug was indeed resolved.
  111. Can you guess what happened here? Here's the other piece of the puzzle to help you
  112. out, translated more or less into C for readability:
  113. ```c
  114. int applibGetKey() {
  115. int key = getKey();
  116. if (key == KEY_F5) {
  117. launch_threadlist();
  118. suspend_thread();
  119. }
  120. return key;
  121. }
  122. ```
  123. Two more details you might not have picked up on:
  124. * applibGetKey is non-blocking
  125. * suspend_thread suspends the current thread immediately, so it doesn't return until the
  126. thread resumes.
  127. ### The bug, uncovered
  128. Here's what actually happened. For most threads (the suspendable kind), that
  129. thread stops processing when `suspend_thread()` is called. The usually
  130. non-blocking applibGetKey function blocks until the thread is resumed in this
  131. scenario. However, the counting demo was *non-suspendable*. The suspend_thread
  132. function has no effect, by design. So, suspend_thread did not block, and the
  133. keypress was returned straight away. By this point, the thread list had launched
  134. properly and it was given control of the keyboard.
  135. However, the counting demo went back into its main loop, and started calling
  136. applibGetKey again. Since the average user's finger remained pressed against the
  137. button for a few moments more, applibGetKey *continued to launch the thread list,
  138. over and over*. The thread list itself is a special thread, and it doesn't
  139. actually have a user-friendly name. It was designed to ignore itself when it drew
  140. the active threads. However, it was *not* designed to ignore other instances of
  141. itself, the reason being that there would never be two of them running at once.
  142. When attempting to draw these other instances, the thread list started rendering
  143. text that wasn't there, causing the corruption.
  144. This bug vanished whenever I set a breakpoint because it would halt the system's
  145. keyboard processing logic. I lifted my finger from the key before allowing it to
  146. move on.
  147. The solution was to make the kernel's getKey function respect hardware locks by
  148. fixing that simple, two-character typo. That way, the counting demo, which had no
  149. right to know what keys were being pressed, would not know that they key was still
  150. being pressed.
  151. The debugging described by this blog post took approximately three weeks.
  152. [Discussion on Hacker News](https://news.ycombinator.com/item?id=7688700)