logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

The-case-against-fork.md (7842B)


  1. ---
  2. date: 2018-01-02
  3. layout: post
  4. title: fork is not my favorite syscall
  5. tags: [unix]
  6. ---
  7. This article has been on my to-write list for a while now. In my opinion, fork
  8. is one of the most questionable design choices of Unix. I don't understand the
  9. circumstances that led to its creation, and I grieve over the legacy rationale
  10. that keeps it alive to this day.
  11. Let's set the scene. It's 1971 and you're a fly on the wall in Bell Labs,
  12. watching the first edition of Unix being designed for the PDP-11/20. This
  13. machine has a 16-bit address space with no more than 248 kilobytes of memory.
  14. They're discussing how they're going to support programs that spawn new
  15. programs, and someone has a brilliant idea. "What if we copied the entire
  16. address space of the program into a new process running from the same spot, then
  17. let them overwrite themselves with the new program?" This got a rousing laugh
  18. out of everyone present, then they moved on to a better design which would
  19. become immortalized in the most popular and influential operating system of all
  20. time.
  21. At least, that's the story I'd like to have been told. In actual fact, the
  22. laughter becomes consensus. There's an obvious problem with this approach: every
  23. time you want to execute a new program, the entire process space is copied and
  24. promptly discarded when the new program begins. Usually when I complain about
  25. fork, this the point when its supporters play the virtual memory card, pointing
  26. out that modern operating systems don't actually have to copy the whole address
  27. space. We'll get to that, but first — First Edition Unix *does* copy the
  28. whole process space, so this excuse wouldn't have held up at the time. By Fourth
  29. Edition Unix (the next one for which kernel sources survived), they had wisened
  30. up a bit, and started only copying segments when they faulted.
  31. This model leads to a number of problems. One is that the new process inherits
  32. *all* of the parent's process descriptors, so you have to close them all before
  33. you exec another process. However, unless you're manually keeping tabs on your
  34. open file descriptors, there is no way to know what file handles you must close!
  35. The hack that solves this is `CLOEXEC`, the first of many hacks that deal with
  36. fork's poor design choices. This file descriptors problem balloons a bit -
  37. consider for example if you want to set up a pipe. You have to establish a piped
  38. pair of file descriptors in the parent, then close every fd *but* the pipe in
  39. the child, then `dup2` the pipe file descriptor over the (now recently closed)
  40. file descriptor 1. By this point you've probably had to do several non-trivial
  41. operations and utilize a handful of variables from the parent process space,
  42. which *hopefully* were on the stack so that we don't end up copying segments
  43. into the new process space anyway.
  44. These problems, however, pale in comparison to my number one complaint with the
  45. fork model. Fork is the direct cause of the *stupidest* component I've *ever*
  46. heard of in an operating system: the out-of-memory (aka OOM) killer. Say you
  47. have a process which is using half of the physical memory on your system, and
  48. wants to spawn a tiny program. Since fork "copies" the entire process, you might
  49. be inclined to think that this would make fork fail. But, on Linux and many
  50. other operating systems since, it does not fail! They agree that it's stupid to
  51. copy the entire process just to exec something else, but because fork is
  52. Important for Backwards Compatibility, they just fake it and reuse the same
  53. memory map (except read-only), then trap the faults and actually copy later.
  54. The hope is that the child will get on with it and exec before this happens.
  55. However, nothing prevents the child from doing something other than exec -
  56. it's free to use the memory space however it desires! This approach now leads to
  57. *memory overcommittment* - Linux has promised memory it does not have. As a
  58. result, when it really does run out of physical memory, Linux will just kill off
  59. processes until it has some memory back. Linux makes an awfully big fuss about
  60. "never breaking userspace" for a kernel that will lie about memory it doesn't
  61. have, then kill programs that try to use the back-alley memory they were given.
  62. That this nearly 50 year old crappy design choice has come to this astonishes
  63. me.
  64. Alas, I cannot rant forever without discussing the alternatives. There **are**
  65. better process models that have been developed since Unix!
  66. The first attempt I know of is BSD's `vfork` syscall, which is, in a nutshell,
  67. the same as fork but with severe limitations on what you do in the child process
  68. (i.e. nothing other than calling exec straight away). There are *loads* of
  69. problems with `vfork`. It only handles the most basic of use cases: you cannot
  70. set up a pipe, cannot set up a pty, and can't even close open file descriptors
  71. you inherited from the parent. Also, you couldn't really be sure of what
  72. variables you were and weren't editing or allowed to edit, considering the
  73. limitations of the C specification. Overall this syscall ended up being pretty
  74. useless.
  75. Another model is `posix_spawn`, which is a hell of an interface. It's far too
  76. complicated for me to detail here, and in my opinion far too complicated to ever
  77. consider using in practice. Even if it could be understood by mortals, it's a
  78. really bad implementation of the spawn paradigm — it basically operates
  79. like fork backwards, and inherits many of the same flaws. You still have to deal
  80. with children inheriting your file descriptors, for example, only now you do it
  81. in the parent process. It's also straight-up impossible to make a genuine pipe
  82. with `posix_spawn`. (*Note: a reader corrected me - this is indeed possible via
  83. posix_spawn_file_actions_adddup2*.)
  84. Let's talk about the good models - `rfork` and spawn (at least, if spawn is done
  85. right). `rfork` originated from plan9 and is a beautiful little coconut of a
  86. syscall, much like the rest of plan9. They also implement fork, but it's a
  87. special case of `rfork`. plan9 does not distinguish between processes and
  88. threads - all threads are processes and vice versa. However, new processes in
  89. plan9 are not the everything-must-go fuckfest of your typical fork call.
  90. Instead, you specify exactly what the child should get from you. You can choose
  91. to include (or not include) your memory space, file descriptors, environment, or
  92. a number of other things specific to plan9. There's a cool flag that makes it so
  93. you don't have to reap the process, too, which is nice because reaping children
  94. is another really stupid idea. It still has some problems, mainly around
  95. creating pipes without tremendous file descriptor fuckery, but it's basically as
  96. good as the fork model gets. Note: Linux offers this via the `clone` syscall
  97. now, but everyone just fork+execs anyway.
  98. The other model is the spawn model, which I prefer. This is the approach I took
  99. in my own kernel for KnightOS, and I think it's also used in NT (Microsoft's
  100. kernel). I don't really know much about NT, but I can tell you how it works in
  101. KnightOS. Basically, when you create a new process, it is kept in limbo until
  102. the parent consents to begin. You are given a handle with which you can
  103. configure the process - you can change its environment, load it up with file
  104. descriptors to your liking, and so on. When you're ready for it to begin, you
  105. give the go-ahead and it's off to the races. The spawn model has none of the
  106. flaws of fork.
  107. Both fork and exec can be useful at times, but spawning is much better for 90%
  108. of their use-cases. If I were to write a new kernel today, I'd probably take a
  109. leaf from plan9's book and find a happy medium between `rfork` and spawn, so you
  110. could use spawn to start new threads in your process space as well. To the
  111. brave OS designers of the future, ready to shrug off the weight of legacy:
  112. please reconsider fork.