commit: 314610c3e2cf89b0ba10b378fb85872fc1fe676e
parent 6c7da484381a4da993cca1486ba5c805f951e84d
Author: Drew DeVault <sir@cmpwn.com>
Date: Fri, 28 Jan 2022 14:10:24 +0100
MIME database
Diffstat:
1 file changed, 482 insertions(+), 0 deletions(-)
diff --git a/content/blog/Implementing-mime-in-xxxx.md b/content/blog/Implementing-mime-in-xxxx.md
@@ -0,0 +1,482 @@
+---
+title: Implementing a MIME database in XXXX
+date: 2022-01-28
+---
+
+<em>This is a (redacted) post from the internal blog of a new systems
+programming language we're developing. The project is being kept under wraps
+until we're done with it, so for this post I'll be calling it <span
+class="redacted">XXXX</span>. If you are interested in participating, <a
+href="mailto:sir@cmpwn.com">send me an email</a> with some details about your
+background and I'll get back to you.</em>
+
+<style>
+.redacted {
+ color: black;
+ background: black;
+}
+</style>
+
+Recently, I have been working on implementing a parser for media types (commonly
+called MIME types) and a database which maps media types to file extensions and
+vice-versa. I thought this would be an interesting module to blog about, given
+that it's only about 250 lines of code, does something useful and interesting,
+and demonstrates a few interesting <span class="redacted">xxxx</span> concepts.
+
+The format for media types is more-or-less defined by [RFC 2045], specifically
+[section 5.1]. The specification is not great. The grammar shown here is copied
+and pasted from parts of larger grammars in older RFCs, RFCs which are equally
+poorly defined. For example, the quoted-string nonterminal is never defined
+here, but instead comes from RFC 822, which defines it but also states that it
+can be "folded", which technically makes the following a valid Media Type:
+
+```
+text/plain;charset="hello
+ world"
+```
+
+Or so I would presume, but the qtext terminal "cannot include CR", which is the
+mechanism by which folding is performed in the first place, and... bleh. Let's
+just implement a "reasonable subset" of the spec instead and side-step the whole
+folding issue.[^1] This post will first cover parsing media types, then address
+our second goal: providing a database which maps media types to file extensions
+and vice versa.
+
+[RFC 2045]: https://datatracker.ietf.org/doc/html/rfc2045
+[section 5.1]: https://datatracker.ietf.org/doc/html/rfc2045#section-5.1
+[^1]: Any time we implement a "reasonable subset" of a specification rather than the whole specification, I add the module to the list of modules likely to be moved out of the standard library and into a standalone module at some point prior to release. Another module on this list is our XML parser.
+
+## Parsing Media Types
+
+So, here's what we're going to implement today: we want to parse the following
+string:
+
+```
+text/plain; charset=utf-8; foo="bar baz"
+```
+
+The code for that I came up with for is as follows:
+
+```hare
+// Parses a Media Type, returning a tuple of the content type (e.g.
+// "text/plain") and a parameter parser object, or [[errors::invalid]] if the
+// input cannot be parsed.
+//
+// To enumerate the Media Type parameter list, pass the type_params object into
+// [[next_param]]. If you do not need the parameter list, you can safely discard
+// the object. Note that any format errors following the ";" token will not
+// cause [[errors::invalid]] to be returned unless [[next_param]] is used to
+// enumerate all of the parameters.
+export fn parse(in: str) ((str, type_params) | errors::invalid) = {
+ const items = strings::cut(in, ";");
+ const mtype = items.0, params = items.1;
+ const items = strings::cut(mtype, "/");
+ if (len(items.0) < 1 || len(items.1) < 1) {
+ return errors::invalid;
+ };
+ typevalid(items.0)?;
+ typevalid(items.1)?;
+ return (mtype, strings::tokenize(params, ";"));
+};
+```
+
+This function accepts a string as input, then returns a tagged union which
+contains either a tuple of `(str, type_params)`, or a syntax error.
+
+I designed this with particular attention to the memory management semantics.
+<span class="redacted">xxxx</span> uses manual memory management, and if possible it's desirable to avoid
+allocating any additional memory so that the user of our APIs remains in control
+of the memory semantics. The return value is a sub-string borrowed from the
+"text/plain" part, as well as a tokenizer which is prepared to split the
+remainder of the string along the ";" tokens.
+
+Inspiration for strings::cut comes from [Go]:
+
+[Go]: https://github.com/golang/go/issues/46336
+
+<div class="highlight"><pre class="chroma"><code class="language-hare" data-lang="hare"><span class="err">$</span> <span class="n"><span class="redacted">xxxx</span>doc</span> <span class="n">strings</span><span class="o">::</span><span class="n">cut</span>
+<span class="c1">// Returns a string "cut" along the first instance of a delimiter, returning
+</span><span class="c1">// everything up to the delimiter, and everything after the delimiter, in a
+</span><span class="c1">// tuple.
+</span><span class="c1">//
+</span><span class="c1">// strings::cut("hello=world=foobar", "=") // ("hello", "world=foobar")
+</span><span class="c1">// strings::cut("hello world", "=") // ("hello world", "")
+</span><span class="c1">//
+</span><span class="c1">// The return value is borrowed from the 'in' parameter.
+</span><span class="c1"></span><span class="k">fn</span> <span class="n">cut</span><span class="p">(</span><span class="n">in</span><span class="o">:</span> <span class="kt">str</span><span class="p">,</span> <span class="n">delim</span><span class="o">:</span> <span class="kt">str</span><span class="p">)</span> <span class="p">(</span><span class="kt">str</span><span class="p">,</span> <span class="kt">str</span><span class="p">);</span>
+</code></pre></div>
+
+And strings::tokenize works like so:
+
+<div class="highlight"><pre class="chroma"><code class="language-hare" data-lang="hare"><span class="err">$</span> <span class="n"><span class="redacted">xxxx</span>doc</span> <span class="n">strings</span><span class="o">::</span><span class="n">tokenize</span>
+<span class="c1">// Returns a tokenizer which yields sub-strings tokenized by a delimiter.
+</span><span class="c1">//
+</span><span class="c1">// let tok = strings::tokenize("hello, my name is drew", " ");
+</span><span class="c1">// assert(strings::next_token(tok) == "hello,");
+</span><span class="c1">// assert(strings::next_token(tok) == "my");
+</span><span class="c1">// assert(strings::next_token(tok) == "name");
+</span><span class="c1">// assert(strings::remaining_tokens(tok) == "is drew");
+</span><span class="c1"></span><span class="k">fn</span> <span class="n">tokenize</span><span class="p">(</span><span class="n">s</span><span class="o">:</span> <span class="kt">str</span><span class="p">,</span> <span class="n">delim</span><span class="o">:</span> <span class="kt">str</span><span class="p">)</span> <span class="n">tokenizer</span><span class="p">;</span>
+</code></pre></div>
+
+The RFC limits the acceptable characters for the media type and subtype, which
+we test with the typevalid function.
+
+The user of this module often only cares about the media type and not its type
+parameters, so the tokenizer can be safely abandoned on the stack to get cleaned
+up when the stack frame exits if they don't care about the rest.
+
+This is enough to write a little test:
+
+```hare
+@test fn parse() void = {
+ const res = parse("text/plain")!;
+ assert(res.0 == "text/plain");
+
+ const res = parse("image/png")!;
+ assert(res.0 == "image/png");
+
+ const res = parse("application/svg+xml; charset=utf-8; foo=\"bar baz\"")!;
+ assert(res.0 == "application/svg+xml");
+};
+```
+
+<pre>$ <span class="redacted">xxxx</span> test mime::parse
+mime::parse..................................... OK
+
+1 passed; 0 failed; 1 tests completed in 0.10s</pre>
+
+To handle the type parameters in the third case, we add this function:
+
+```hare
+// Returns the next parameter as a (key, value) tuple from a [[type_params]]
+// object that was prepared via [[parse]], void if there are no remaining
+// parameters, and [[errors::invalid]] if a syntax error was encountered.
+export fn next_param(in: *type_params) ((str, str) | void | errors::invalid) = {
+ const tok = match (strings::next_token(in)) {
+ case let s: str =>
+ if (s == "") {
+ // empty parameter
+ return errors::invalid;
+ };
+ yield s;
+ case void =>
+ return;
+ };
+
+ const items = strings::cut(tok, "=");
+ // The RFC does not permit whitespace here, but whitespace is very
+ // common in the wild. ¯\_(ツ)_/¯
+ items.0 = strings::trim(items.0);
+ items.1 = strings::trim(items.1);
+ if (len(items.0) == 0 || len(items.1) == 0) {
+ return errors::invalid;
+ };
+
+ if (strings::hasprefix(items.1, "\"")) {
+ items.1 = quoted(items.1)?;
+ };
+
+ return (items.0, items.1);
+};
+```
+
+This returns a (key, value) tuple and advances to the next parameter, or returns
+void if there are no further parameters (or, if necessary, an error). This is
+pretty straightforward: the tokenizer prepared by parse is splitting the string
+on `;` tokens, so we first fetch the next token. We then use strings::cut again
+to split it over the `=` token, and after a quick trim to fix another RFC
+oversight, we can return it to the caller. Unless it's using this pesky
+quoted-string terminal, which is where our implementation starts to show its
+weaknesses:
+
+```hare
+fn quoted(in: str) (str | errors::invalid) = {
+ // We have only a basic implementation of quoted-string. It has a couple
+ // of problems:
+ //
+ // 1. The RFC does not define it very well
+ // 2. The parts of the RFC which are ill-defined are rarely used
+ // 3. Implementing quoted-pair would require allocating a new string
+ //
+ // This implementation should handle most Media Types seen in practice
+ // unless they're doing something weird and ill-advised with them.
+ in = strings::trim(in, '"');
+ if (strings::contains(in, "\\")
+ || strings::contains(in, "\r")
+ || strings::contains(in, "\n")) {
+ return errors::invalid;
+ };
+ return in;
+};
+```
+
+I think this implementation speaks for itself. It could be a bit faster if we
+didn't do 3 × O(n) strings::contains calls, but someone will send a patch
+if they care. The completed test for this is:
+
+```hare
+@test fn parse() void = {
+ const res = parse("text/plain")!;
+ assert(res.0 == "text/plain");
+
+ const res = parse("image/png")!;
+ assert(res.0 == "image/png");
+
+ const res = parse("application/svg+xml; charset=utf-8; foo=\"bar baz\"")!;
+ assert(res.0 == "application/svg+xml");
+ const params = res.1;
+ const param = next_param(¶ms)! as (str, str);
+ assert(param.0 == "charset" && param.1 == "utf-8");
+ const param = next_param(¶ms)! as (str, str);
+ assert(param.0 == "foo" && param.1 == "bar baz");
+ assert(next_param(¶ms) is void);
+
+ assert(parse("hi") is errors::invalid);
+ assert(parse("text/ spaces ") is errors::invalid);
+ assert(parse("text/@") is errors::invalid);
+
+ const res = parse("text/plain;charset")!;
+ assert(res.0 == "text/plain");
+ assert(next_param(&res.1) is errors::invalid);
+};
+```
+
+## The Media Type database
+
+The second part of this module is the Media Type database. This comes in two
+parts:
+
+1. An internal database which is populated by <span class="redacted">xxxx</span> modules. For example, an
+ image::png module might register the "image/png" mimetype with the internal
+ MIME database, similar to protocol registration for net::dial.
+2. A system-provided database, usually via /etc/mime.types, which is more
+ comprehensive, but may not be available at runtime.
+
+I plan on doing the second part later on, so for now we'll just focus on the
+first; most of the interesting bits are there anyway.
+
+Again, special consideration is given to memory management here. The essence of
+a good <span class="redacted">xxxx</span> program or API design can be ascertained from how well it handles
+memory management. As such, I have set aside separate lists to handle statically
+allocated MIME info (such as those provided by image::png et al) versus the
+forthcoming dynamically-allocated system database.
+
+```hare
+// A pair of a Media Type and a list of file extensions associated with it. The
+// extension list does not include the leading '.' character.
+export type mimetype = struct {
+ mime: str,
+ exts: []str,
+};
+
+// List of media types with statically allocated fields (though the list itself
+// is dynamically allocated).
+let static_db: []mimetype = [];
+
+// List of media types with heap-allocated fields, used when loading mime types
+// from the system database.
+let heap_db: []mimetype = [];
+
+const builtins: [_]mimetype = [
+ mimetype {
+ mime = "text/plain",
+ exts = ["txt"],
+ },
+ mimetype {
+ mime = "text/x-xxxx", // redacted for public blog post
+ exts = ["xx"],
+ },
+];
+
+@init fn init() void = {
+ register(builtins...);
+};
+
+@fini fn fini() void = {
+ for (let i = 0z; i < len(heap_db); i += 1) {
+ free(heap_db[i].mime);
+ strings::freeall(heap_db[i].exts);
+ };
+ free(heap_db);
+ free(static_db);
+};
+```
+
+The register function will be used from @init functions like this one to
+register media types with the internal database. This code has minimal
+allocations for the internal database, but we do actually do some allocating
+here to store the "static_db" slice. In theory we could eliminate this by
+statically provisioning a small number of slots to store the internal database
+in, but for this use-case the trade-off makes sense. There are use-cases where
+the trade-off does not make as much sense, however. For example, here's how the
+command line arguments are stored for your program in the "os" module:
+
+```hare
+// The command line arguments provided to the program. By convention, the first
+// member is usually the name of the program.
+export let args: []str = [];
+
+// Statically allocate arg strings if there are few enough arguments, saves a
+// syscall if we don't need it.
+let args_static: [32]str = [""...];
+
+@init fn init_environ() void = {
+ if (rt::argc < len(args_static)) {
+ args = args_static[..rt::argc];
+ for (let i = 0z; i < rt::argc; i += 1) {
+ args[i] = strings::fromc(rt::argv[i]);
+ };
+ } else {
+ args = alloc([], rt::argc);
+ for (let i = 0z; i < rt::argc; i += 1) {
+ append(args, strings::fromc(rt::argv[i]));
+ };
+ };
+
+};
+
+@fini fn fini_environ() void = {
+ if (rt::argc >= len(args_static)) {
+ free(args);
+ };
+};
+```
+
+A similar approach is also used on yyp's RISC-V kernel for [storing serial
+devices][0] without any runtime memory allocations.
+
+[0]: https://paste.sr.ht/~sircmpwn/acaa1e61e6bcb3e22e8b4bce7f233dcd844565eb
+
+The internal database is likely to be small, but the system database is likely
+to have a lot of media types and file extensions registered, so it makes sense
+to build out an efficient means of accessing them. For this purpose I have
+implemented a simple hash map. <span class="redacted">xxxx</span> does not have a built-in map construct, nor
+generics. The design constraints of <span class="redacted">xxxx</span> are closer to C than to anything else,
+and as such, the trade-offs for first-class maps are similar to C, which is to
+say that they don't make sense with our design. However, this use-case does not
+call for much sophistication, so a simple map will suffice.
+
+```hare
+use hash::fnv;
+
+def MIME_BUCKETS: size = 256;
+
+// Hash tables for efficient database lookup by mimetype or extension
+let mimetable: [MIME_BUCKETS][]*mimetype = [[]...];
+let exttable: [MIME_BUCKETS][]*mimetype = [[]...];
+
+// Registers a Media Type and its extensions in the internal MIME database. This
+// function is designed to be used by @init functions for modules which
+// implement new Media Types.
+export fn register(mime: mimetype...) void = {
+ let i = len(static_db);
+ append(static_db, mime...);
+ for (i < len(static_db); i += 1) {
+ const item = &static_db[i];
+ const hash = fnv::string(item.mime);
+ let bucket = &mimetable[hash % len(mimetable)];
+ append(bucket, item);
+
+ for (let i = 0z; i < len(item.exts); i += 1) {
+ const hash = fnv::string(item.exts[i]);
+ let bucket = &exttable[hash % len(exttable)];
+ append(bucket, item);
+ };
+ };
+};
+```
+
+A fixed-length array of slices is a common approach to hash tables in <span
+class="redacted">xxxx</span>. It's
+not a great design for hash tables whose size is not reasonably predictable in
+advance or which need to be frequently resized and rehashed, but it is pretty
+easy to implement and provides sufficient performance for use-cases like this. A
+re-sizable hash table, or tables using an alternate hash function, or the use of
+linked lists instead of slices, and so on — all of this is possible if the
+use-case calls for it, but must be written by hand.
+
+Finally, we implement the look-up functions, which are very simple:
+
+```hare
+// Looks up a Media Type based on the mime type string, returning null if
+// unknown.
+export fn lookup_mime(mime: str) const nullable *mimetype = {
+ const hash = fnv::string(mime);
+ const bucket = &mimetable[hash % len(mimetable)];
+ for (let i = 0z; i < len(bucket); i += 1) {
+ const item = bucket[i];
+ if (item.mime == mime) {
+ return item;
+ };
+ };
+ return null;
+};
+
+// Looks up a Media Type based on a file extension, with or without the leading
+// '.' character, returning null if unknown.
+export fn lookup_ext(ext: str) const nullable *mimetype = {
+ ext = strings::ltrim(ext, '.');
+ const hash = fnv::string(ext);
+ const bucket = &exttable[hash % len(exttable)];
+ for (let i = 0z; i < len(bucket); i += 1) {
+ const item = bucket[i];
+ for (let j = 0z; j < len(item.exts); j += 1) {
+ if (item.exts[j] == ext) {
+ return item;
+ };
+ };
+ };
+ return null;
+};
+```
+
+For the sake of completeness, here are the tests:
+
+```hare
+@test fn lookup_mime() void = {
+ assert(lookup_mime("foo/bar") == null);
+
+ const result = lookup_mime("text/plain");
+ assert(result != null);
+ const result = result: *mimetype;
+ assert(result.mime == "text/plain");
+ assert(len(result.exts) == 1);
+ assert(result.exts[0] == "txt");
+
+ const result = lookup_mime("text/x-xxxx");
+ assert(result != null);
+ const result = result: *mimetype;
+ assert(result.mime == "text/x-xxxx");
+ assert(len(result.exts) == 1);
+ assert(result.exts[0] == "xx");
+};
+
+@test fn lookup_ext() void = {
+ assert(lookup_ext("foo") == null);
+ assert(lookup_ext(".foo") == null);
+
+ const result = lookup_ext("txt");
+ assert(result != null);
+ const result = result: *mimetype;
+ assert(result.mime == "text/plain");
+ assert(len(result.exts) == 1);
+ assert(result.exts[0] == "txt");
+
+ const result = lookup_ext(".txt");
+ assert(result != null);
+ const result = result: *mimetype;
+ assert(result.mime == "text/plain");
+
+ const result = lookup_ext("xx");
+ assert(result != null);
+ const result = result: *mimetype;
+ assert(result.mime == "text/x-xxxx");
+ assert(len(result.exts) == 1);
+ assert(result.exts[0] == "xx");
+};
+```
+
+There you have it! I will later implement some code which parses /etc/mime.types
+in @init and fills up the heap_db slice, and this lookup code should work with
+it without any additional changes.