MIME database - drewdevault.com - [mirror] blog and personal website of Drew DeVault

commit: 314610c3e2cf89b0ba10b378fb85872fc1fe676e
parent 6c7da484381a4da993cca1486ba5c805f951e84d
Author: Drew DeVault <sir@cmpwn.com>
Date:   Fri, 28 Jan 2022 14:10:24 +0100

MIME database

Diffstat:
A content/blog/Implementing-mime-in-xxxx.md 482 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1 file changed, 482 insertions(+), 0 deletions(-)
diff --git a/content/blog/Implementing-mime-in-xxxx.md b/content/blog/Implementing-mime-in-xxxx.md
@@ -0,0 +1,482 @@
+---
+title: Implementing a MIME database in XXXX
+date: 2022-01-28
+---
+
+<em>This is a (redacted) post from the internal blog of a new systems
+programming language we're developing. The project is being kept under wraps
+until we're done with it, so for this post I'll be calling it <span
+class="redacted">XXXX</span>. If you are interested in participating, <a
+href="mailto:sir@cmpwn.com">send me an email</a> with some details about your
+background and I'll get back to you.</em>
+
+<style>
+.redacted {
+  color: black;
+  background: black;
+}
+</style>
+
+Recently, I have been working on implementing a parser for media types (commonly
+called MIME types) and a database which maps media types to file extensions and
+vice-versa. I thought this would be an interesting module to blog about, given
+that it's only about 250 lines of code, does something useful and interesting,
+and demonstrates a few interesting <span class="redacted">xxxx</span> concepts.
+
+The format for media types is more-or-less defined by [RFC 2045], specifically
+[section 5.1]. The specification is not great. The grammar shown here is copied
+and pasted from parts of larger grammars in older RFCs, RFCs which are equally
+poorly defined. For example, the quoted-string nonterminal is never defined
+here, but instead comes from RFC 822, which defines it but also states that it
+can be "folded", which technically makes the following a valid Media Type:
+
+```
+text/plain;charset="hello
+ world"
+```
+
+Or so I would presume, but the qtext terminal "cannot include CR", which is the
+mechanism by which folding is performed in the first place, and... bleh. Let's
+just implement a "reasonable subset" of the spec instead and side-step the whole
+folding issue.[^1] This post will first cover parsing media types, then address
+our second goal: providing a database which maps media types to file extensions
+and vice versa.
+
+[RFC 2045]: https://datatracker.ietf.org/doc/html/rfc2045
+[section 5.1]: https://datatracker.ietf.org/doc/html/rfc2045#section-5.1
+[^1]: Any time we implement a "reasonable subset" of a specification rather than the whole specification, I add the module to the list of modules likely to be moved out of the standard library and into a standalone module at some point prior to release. Another module on this list is our XML parser.
+
+## Parsing Media Types
+
+So, here's what we're going to implement today: we want to parse the following
+string:
+
+```
+text/plain; charset=utf-8; foo="bar baz"
+```
+
+The code for that I came up with for is as follows:
+
+```hare
+// Parses a Media Type, returning a tuple of the content type (e.g.
+// "text/plain") and a parameter parser object, or [[errors::invalid]] if the
+// input cannot be parsed.
+//
+// To enumerate the Media Type parameter list, pass the type_params object into
+// [[next_param]]. If you do not need the parameter list, you can safely discard
+// the object. Note that any format errors following the ";" token will not
+// cause [[errors::invalid]] to be returned unless [[next_param]] is used to
+// enumerate all of the parameters.
+export fn parse(in: str) ((str, type_params) | errors::invalid) = {
+	const items = strings::cut(in, ";");
+	const mtype = items.0, params = items.1;
+	const items = strings::cut(mtype, "/");
+	if (len(items.0) < 1 || len(items.1) < 1) {
+		return errors::invalid;
+	};
+	typevalid(items.0)?;
+	typevalid(items.1)?;
+	return (mtype, strings::tokenize(params, ";"));
+};
+```
+
+This function accepts a string as input, then returns a tagged union which
+contains either a tuple of `(str, type_params)`, or a syntax error.
+
+I designed this with particular attention to the memory management semantics.
+<span class="redacted">xxxx</span> uses manual memory management, and if possible it's desirable to avoid
+allocating any additional memory so that the user of our APIs remains in control
+of the memory semantics. The return value is a sub-string borrowed from the
+"text/plain" part, as well as a tokenizer which is prepared to split the
+remainder of the string along the ";" tokens.
+
+Inspiration for strings::cut comes from [Go]:
+
+[Go]: https://github.com/golang/go/issues/46336
+
+<div class="highlight"><pre class="chroma"><code class="language-hare" data-lang="hare"><span class="err">$</span> <span class="n"><span class="redacted">xxxx</span>doc</span> <span class="n">strings</span><span class="o">::</span><span class="n">cut</span>
+<span class="c1">// Returns a string "cut" along the first instance of a delimiter, returning
+</span><span class="c1">// everything up to the delimiter, and everything after the delimiter, in a
+</span><span class="c1">// tuple.
+</span><span class="c1">//
+</span><span class="c1">//      strings::cut("hello=world=foobar", "=") // ("hello", "world=foobar")
+</span><span class="c1">//      strings::cut("hello world", "=")        // ("hello world", "")
+</span><span class="c1">//
+</span><span class="c1">// The return value is borrowed from the 'in' parameter.
+</span><span class="c1"></span><span class="k">fn</span> <span class="n">cut</span><span class="p">(</span><span class="n">in</span><span class="o">:</span> <span class="kt">str</span><span class="p">,</span> <span class="n">delim</span><span class="o">:</span> <span class="kt">str</span><span class="p">)</span> <span class="p">(</span><span class="kt">str</span><span class="p">,</span> <span class="kt">str</span><span class="p">);</span>
+</code></pre></div>
+
+And strings::tokenize works like so:
+
+<div class="highlight"><pre class="chroma"><code class="language-hare" data-lang="hare"><span class="err">$</span> <span class="n"><span class="redacted">xxxx</span>doc</span> <span class="n">strings</span><span class="o">::</span><span class="n">tokenize</span>
+<span class="c1">// Returns a tokenizer which yields sub-strings tokenized by a delimiter.
+</span><span class="c1">//
+</span><span class="c1">//      let tok = strings::tokenize("hello, my name is drew", " ");
+</span><span class="c1">//      assert(strings::next_token(tok) == "hello,");
+</span><span class="c1">//      assert(strings::next_token(tok) == "my");
+</span><span class="c1">//      assert(strings::next_token(tok) == "name");
+</span><span class="c1">//      assert(strings::remaining_tokens(tok) == "is drew");
+</span><span class="c1"></span><span class="k">fn</span> <span class="n">tokenize</span><span class="p">(</span><span class="n">s</span><span class="o">:</span> <span class="kt">str</span><span class="p">,</span> <span class="n">delim</span><span class="o">:</span> <span class="kt">str</span><span class="p">)</span> <span class="n">tokenizer</span><span class="p">;</span>
+</code></pre></div>
+
+The RFC limits the acceptable characters for the media type and subtype, which
+we test with the typevalid function.
+
+The user of this module often only cares about the media type and not its type
+parameters, so the tokenizer can be safely abandoned on the stack to get cleaned
+up when the stack frame exits if they don't care about the rest.
+
+This is enough to write a little test:
+
+```hare
+@test fn parse() void = {
+	const res = parse("text/plain")!;
+	assert(res.0 == "text/plain");
+
+	const res = parse("image/png")!;
+	assert(res.0 == "image/png");
+
+	const res = parse("application/svg+xml; charset=utf-8; foo=\"bar baz\"")!;
+	assert(res.0 == "application/svg+xml");
+};
+```
+
+<pre>$ <span class="redacted">xxxx</span> test mime::parse
+mime::parse..................................... OK
+
+1 passed; 0 failed; 1 tests completed in 0.10s</pre>
+
+To handle the type parameters in the third case, we add this function:
+
+```hare
+// Returns the next parameter as a (key, value) tuple from a [[type_params]]
+// object that was prepared via [[parse]], void if there are no remaining
+// parameters, and [[errors::invalid]] if a syntax error was encountered.
+export fn next_param(in: *type_params) ((str, str) | void | errors::invalid) = {
+	const tok = match (strings::next_token(in)) {
+	case let s: str =>
+		if (s == "") {
+			// empty parameter
+			return errors::invalid;
+		};
+		yield s;
+	case void =>
+		return;
+	};
+
+	const items = strings::cut(tok, "=");
+	// The RFC does not permit whitespace here, but whitespace is very
+	// common in the wild. ¯\_(ツ)_/¯
+	items.0 = strings::trim(items.0);
+	items.1 = strings::trim(items.1);
+	if (len(items.0) == 0 || len(items.1) == 0) {
+		return errors::invalid;
+	};
+
+	if (strings::hasprefix(items.1, "\"")) {
+		items.1 = quoted(items.1)?;
+	};
+
+	return (items.0, items.1);
+};
+```
+
+This returns a (key, value) tuple and advances to the next parameter, or returns
+void if there are no further parameters (or, if necessary, an error). This is
+pretty straightforward: the tokenizer prepared by parse is splitting the string
+on `;` tokens, so we first fetch the next token. We then use strings::cut again
+to split it over the `=` token, and after a quick trim to fix another RFC
+oversight, we can return it to the caller. Unless it's using this pesky
+quoted-string terminal, which is where our implementation starts to show its
+weaknesses:
+
+```hare
+fn quoted(in: str) (str | errors::invalid) = {
+	// We have only a basic implementation of quoted-string. It has a couple
+	// of problems:
+	//
+	// 1. The RFC does not define it very well
+	// 2. The parts of the RFC which are ill-defined are rarely used
+	// 3. Implementing quoted-pair would require allocating a new string
+	//
+	// This implementation should handle most Media Types seen in practice
+	// unless they're doing something weird and ill-advised with them.
+	in = strings::trim(in, '"');
+	if (strings::contains(in, "\\")
+			|| strings::contains(in, "\r")
+			|| strings::contains(in, "\n")) {
+		return errors::invalid;
+	};
+	return in;
+};
+```
+
+I think this implementation speaks for itself. It could be a bit faster if we
+didn't do 3 &times; O(n) strings::contains calls, but someone will send a patch
+if they care. The completed test for this is:
+
+```hare
+@test fn parse() void = {
+	const res = parse("text/plain")!;
+	assert(res.0 == "text/plain");
+
+	const res = parse("image/png")!;
+	assert(res.0 == "image/png");
+
+	const res = parse("application/svg+xml; charset=utf-8; foo=\"bar baz\"")!;
+	assert(res.0 == "application/svg+xml");
+	const params = res.1;
+	const param = next_param(&params)! as (str, str);
+	assert(param.0 == "charset" && param.1 == "utf-8");
+	const param = next_param(&params)! as (str, str);
+	assert(param.0 == "foo" && param.1 == "bar baz");
+	assert(next_param(&params) is void);
+
+	assert(parse("hi") is errors::invalid);
+	assert(parse("text/ spaces ") is errors::invalid);
+	assert(parse("text/@") is errors::invalid);
+
+	const res = parse("text/plain;charset")!;
+	assert(res.0 == "text/plain");
+	assert(next_param(&res.1) is errors::invalid);
+};
+```
+
+## The Media Type database
+
+The second part of this module is the Media Type database. This comes in two
+parts:
+
+1. An internal database which is populated by <span class="redacted">xxxx</span> modules. For example, an
+   image::png module might register the "image/png" mimetype with the internal
+   MIME database, similar to protocol registration for net::dial.
+2. A system-provided database, usually via /etc/mime.types, which is more
+   comprehensive, but may not be available at runtime.
+
+I plan on doing the second part later on, so for now we'll just focus on the
+first; most of the interesting bits are there anyway.
+
+Again, special consideration is given to memory management here. The essence of
+a good <span class="redacted">xxxx</span> program or API design can be ascertained from how well it handles
+memory management. As such, I have set aside separate lists to handle statically
+allocated MIME info (such as those provided by image::png et al) versus the
+forthcoming dynamically-allocated system database.
+
+```hare
+// A pair of a Media Type and a list of file extensions associated with it. The
+// extension list does not include the leading '.' character.
+export type mimetype = struct {
+	mime: str,
+	exts: []str,
+};
+
+// List of media types with statically allocated fields (though the list itself
+// is dynamically allocated).
+let static_db: []mimetype = [];
+
+// List of media types with heap-allocated fields, used when loading mime types
+// from the system database.
+let heap_db: []mimetype = [];
+
+const builtins: [_]mimetype = [
+	mimetype {
+		mime = "text/plain",
+		exts = ["txt"],
+	},
+	mimetype {
+		mime = "text/x-xxxx", // redacted for public blog post
+		exts = ["xx"],
+	},
+];
+
+@init fn init() void = {
+	register(builtins...);
+};
+
+@fini fn fini() void = {
+	for (let i = 0z; i < len(heap_db); i += 1) {
+		free(heap_db[i].mime);
+		strings::freeall(heap_db[i].exts);
+	};
+	free(heap_db);
+	free(static_db);
+};
+```
+
+The register function will be used from @init functions like this one to
+register media types with the internal database. This code has minimal
+allocations for the internal database, but we do actually do some allocating
+here to store the "static_db" slice. In theory we could eliminate this by
+statically provisioning a small number of slots to store the internal database
+in, but for this use-case the trade-off makes sense. There are use-cases where
+the trade-off does not make as much sense, however. For example, here's how the
+command line arguments are stored for your program in the "os" module:
+
+```hare
+// The command line arguments provided to the program. By convention, the first
+// member is usually the name of the program.
+export let args: []str = [];
+
+// Statically allocate arg strings if there are few enough arguments, saves a
+// syscall if we don't need it.
+let args_static: [32]str = [""...];
+
+@init fn init_environ() void = {
+	if (rt::argc < len(args_static)) {
+		args = args_static[..rt::argc];
+		for (let i = 0z; i < rt::argc; i += 1) {
+			args[i] = strings::fromc(rt::argv[i]);
+		};
+	} else {
+		args = alloc([], rt::argc);
+		for (let i = 0z; i < rt::argc; i += 1) {
+			append(args, strings::fromc(rt::argv[i]));
+		};
+	};
+
+};
+
+@fini fn fini_environ() void = {
+	if (rt::argc >= len(args_static)) {
+		free(args);
+	};
+};
+```
+
+A similar approach is also used on yyp's RISC-V kernel for [storing serial
+devices][0] without any runtime memory allocations.
+
+[0]: https://paste.sr.ht/~sircmpwn/acaa1e61e6bcb3e22e8b4bce7f233dcd844565eb
+
+The internal database is likely to be small, but the system database is likely
+to have a lot of media types and file extensions registered, so it makes sense
+to build out an efficient means of accessing them. For this purpose I have
+implemented a simple hash map. <span class="redacted">xxxx</span> does not have a built-in map construct, nor
+generics. The design constraints of <span class="redacted">xxxx</span> are closer to C than to anything else,
+and as such, the trade-offs for first-class maps are similar to C, which is to
+say that they don't make sense with our design. However, this use-case does not
+call for much sophistication, so a simple map will suffice.
+
+```hare
+use hash::fnv;
+
+def MIME_BUCKETS: size = 256;
+
+// Hash tables for efficient database lookup by mimetype or extension
+let mimetable: [MIME_BUCKETS][]*mimetype = [[]...];
+let exttable: [MIME_BUCKETS][]*mimetype = [[]...];
+
+// Registers a Media Type and its extensions in the internal MIME database. This
+// function is designed to be used by @init functions for modules which
+// implement new Media Types.
+export fn register(mime: mimetype...) void = {
+	let i = len(static_db);
+	append(static_db, mime...);
+	for (i < len(static_db); i += 1) {
+		const item = &static_db[i];
+		const hash = fnv::string(item.mime);
+		let bucket = &mimetable[hash % len(mimetable)];
+		append(bucket, item);
+
+		for (let i = 0z; i < len(item.exts); i += 1) {
+			const hash = fnv::string(item.exts[i]);
+			let bucket = &exttable[hash % len(exttable)];
+			append(bucket, item);
+		};
+	};
+};
+```
+
+A fixed-length array of slices is a common approach to hash tables in <span
+class="redacted">xxxx</span>. It's
+not a great design for hash tables whose size is not reasonably predictable in
+advance or which need to be frequently resized and rehashed, but it is pretty
+easy to implement and provides sufficient performance for use-cases like this. A
+re-sizable hash table, or tables using an alternate hash function, or the use of
+linked lists instead of slices, and so on &mdash; all of this is possible if the
+use-case calls for it, but must be written by hand.
+
+Finally, we implement the look-up functions, which are very simple:
+
+```hare
+// Looks up a Media Type based on the mime type string, returning null if
+// unknown.
+export fn lookup_mime(mime: str) const nullable *mimetype = {
+	const hash = fnv::string(mime);
+	const bucket = &mimetable[hash % len(mimetable)];
+	for (let i = 0z; i < len(bucket); i += 1) {
+		const item = bucket[i];
+		if (item.mime == mime) {
+			return item;
+		};
+	};
+	return null;
+};
+
+// Looks up a Media Type based on a file extension, with or without the leading
+// '.' character, returning null if unknown.
+export fn lookup_ext(ext: str) const nullable *mimetype = {
+	ext = strings::ltrim(ext, '.');
+	const hash = fnv::string(ext);
+	const bucket = &exttable[hash % len(exttable)];
+	for (let i = 0z; i < len(bucket); i += 1) {
+		const item = bucket[i];
+		for (let j = 0z; j < len(item.exts); j += 1) {
+			if (item.exts[j] == ext) {
+				return item;
+			};
+		};
+	};
+	return null;
+};
+```
+
+For the sake of completeness, here are the tests:
+
+```hare
+@test fn lookup_mime() void = {
+	assert(lookup_mime("foo/bar") == null);
+
+	const result = lookup_mime("text/plain");
+	assert(result != null);
+	const result = result: *mimetype;
+	assert(result.mime == "text/plain");
+	assert(len(result.exts) == 1);
+	assert(result.exts[0] == "txt");
+
+	const result = lookup_mime("text/x-xxxx");
+	assert(result != null);
+	const result = result: *mimetype;
+	assert(result.mime == "text/x-xxxx");
+	assert(len(result.exts) == 1);
+	assert(result.exts[0] == "xx");
+};
+
+@test fn lookup_ext() void = {
+	assert(lookup_ext("foo") == null);
+	assert(lookup_ext(".foo") == null);
+
+	const result = lookup_ext("txt");
+	assert(result != null);
+	const result = result: *mimetype;
+	assert(result.mime == "text/plain");
+	assert(len(result.exts) == 1);
+	assert(result.exts[0] == "txt");
+
+	const result = lookup_ext(".txt");
+	assert(result != null);
+	const result = result: *mimetype;
+	assert(result.mime == "text/plain");
+
+	const result = lookup_ext("xx");
+	assert(result != null);
+	const result = result: *mimetype;
+	assert(result.mime == "text/x-xxxx");
+	assert(len(result.exts) == 1);
+	assert(result.exts[0] == "xx");
+};
+```
+
+There you have it! I will later implement some code which parses /etc/mime.types
+in @init and fills up the heap_db slice, and this lookup code should work with
+it without any additional changes.