The-case-against-fork.md - drewdevault.com - [mirror] blog and personal website of Drew DeVault

The-case-against-fork.md (7842B)

---
date: 2018-01-02
layout: post
title: fork is not my favorite syscall
tags: [unix]
---
This article has been on my to-write list for a while now. In my opinion, fork
is one of the most questionable design choices of Unix. I don't understand the
circumstances that led to its creation, and I grieve over the legacy rationale
that keeps it alive to this day.
Let's set the scene. It's 1971 and you're a fly on the wall in Bell Labs,
watching the first edition of Unix being designed for the PDP-11/20. This
machine has a 16-bit address space with no more than 248 kilobytes of memory.
They're discussing how they're going to support programs that spawn new
programs, and someone has a brilliant idea. "What if we copied the entire
address space of the program into a new process running from the same spot, then
let them overwrite themselves with the new program?" This got a rousing laugh
out of everyone present, then they moved on to a better design which would
become immortalized in the most popular and influential operating system of all
time.
At least, that's the story I'd like to have been told. In actual fact, the
laughter becomes consensus. There's an obvious problem with this approach: every
time you want to execute a new program, the entire process space is copied and
promptly discarded when the new program begins.  Usually when I complain about
fork, this the point when its supporters play the virtual memory card, pointing
out that modern operating systems don't actually have to copy the whole address
space. We'll get to that, but first &mdash; First Edition Unix *does* copy the
whole process space, so this excuse wouldn't have held up at the time. By Fourth
Edition Unix (the next one for which kernel sources survived), they had wisened
up a bit, and started only copying segments when they faulted.
This model leads to a number of problems. One is that the new process inherits
*all* of the parent's process descriptors, so you have to close them all before
you exec another process. However, unless you're manually keeping tabs on your
open file descriptors, there is no way to know what file handles you must close!
The hack that solves this is `CLOEXEC`, the first of many hacks that deal with
fork's poor design choices. This file descriptors problem balloons a bit -
consider for example if you want to set up a pipe. You have to establish a piped
pair of file descriptors in the parent, then close every fd *but* the pipe in
the child, then `dup2` the pipe file descriptor over the (now recently closed)
file descriptor 1. By this point you've probably had to do several non-trivial
operations and utilize a handful of variables from the parent process space,
which *hopefully* were on the stack so that we don't end up copying segments
into the new process space anyway.
These problems, however, pale in comparison to my number one complaint with the
fork model. Fork is the direct cause of the *stupidest* component I've *ever*
heard of in an operating system: the out-of-memory (aka OOM) killer. Say you
have a process which is using half of the physical memory on your system, and
wants to spawn a tiny program. Since fork "copies" the entire process, you might
be inclined to think that this would make fork fail. But, on Linux and many
other operating systems since, it does not fail! They agree that it's stupid to
copy the entire process just to exec something else, but because fork is
Important for Backwards Compatibility, they just fake it and reuse the same
memory map (except read-only), then trap the faults and actually copy later.
The hope is that the child will get on with it and exec before this happens.
However, nothing prevents the child from doing something other than exec -
it's free to use the memory space however it desires! This approach now leads to
*memory overcommittment* - Linux has promised memory it does not have. As a
result, when it really does run out of physical memory, Linux will just kill off
processes until it has some memory back. Linux makes an awfully big fuss about
"never breaking userspace" for a kernel that will lie about memory it doesn't
have, then kill programs that try to use the back-alley memory they were given.
That this nearly 50 year old crappy design choice has come to this astonishes
me.
Alas, I cannot rant forever without discussing the alternatives. There **are**
better process models that have been developed since Unix!
The first attempt I know of is BSD's `vfork` syscall, which is, in a nutshell,
the same as fork but with severe limitations on what you do in the child process
(i.e. nothing other than calling exec straight away). There are *loads* of
problems with `vfork`. It only handles the most basic of use cases: you cannot
set up a pipe, cannot set up a pty, and can't even close open file descriptors
you inherited from the parent. Also, you couldn't really be sure of what
variables you were and weren't editing or allowed to edit, considering the
limitations of the C specification. Overall this syscall ended up being pretty
useless.
Another model is `posix_spawn`, which is a hell of an interface. It's far too
complicated for me to detail here, and in my opinion far too complicated to ever
consider using in practice. Even if it could be understood by mortals, it's a
really bad implementation of the spawn paradigm &mdash; it basically operates
like fork backwards, and inherits many of the same flaws. You still have to deal
with children inheriting your file descriptors, for example, only now you do it
in the parent process. It's also straight-up impossible to make a genuine pipe
with `posix_spawn`. (*Note: a reader corrected me - this is indeed possible via
posix_spawn_file_actions_adddup2*.)
Let's talk about the good models - `rfork` and spawn (at least, if spawn is done
right). `rfork` originated from plan9 and is a beautiful little coconut of a
syscall, much like the rest of plan9. They also implement fork, but it's a
special case of `rfork`. plan9 does not distinguish between processes and
threads - all threads are processes and vice versa. However, new processes in
plan9 are not the everything-must-go fuckfest of your typical fork call.
Instead, you specify exactly what the child should get from you. You can choose
to include (or not include) your memory space, file descriptors, environment, or
a number of other things specific to plan9. There's a cool flag that makes it so
you don't have to reap the process, too, which is nice because reaping children
is another really stupid idea. It still has some problems, mainly around
creating pipes without tremendous file descriptor fuckery, but it's basically as
good as the fork model gets. Note: Linux offers this via the `clone` syscall
now, but everyone just fork+execs anyway.
The other model is the spawn model, which I prefer. This is the approach I took
in my own kernel for KnightOS, and I think it's also used in NT (Microsoft's
kernel). I don't really know much about NT, but I can tell you how it works in
KnightOS. Basically, when you create a new process, it is kept in limbo until
the parent consents to begin. You are given a handle with which you can
configure the process - you can change its environment, load it up with file
descriptors to your liking, and so on. When you're ready for it to begin, you
give the go-ahead and it's off to the races. The spawn model has none of the
flaws of fork.
Both fork and exec can be useful at times, but spawning is much better for 90%
of their use-cases. If I were to write a new kernel today, I'd probably take a
leaf from plan9's book and find a happy medium between `rfork` and spawn, so you
could use spawn to start new threads in your process space as well. To the
brave OS designers of the future, ready to shrug off the weight of legacy:
please reconsider fork.