Implementing-mime-in-xxxx.md - drewdevault.com - [mirror] blog and personal website of Drew DeVault

Implementing-mime-in-xxxx.md (18625B)

---
title: Implementing a MIME database in XXXX
date: 2022-01-28
---
<em>This is a (redacted) post from the internal blog of a new systems
programming language we're developing. The project is being kept under wraps
until we're done with it, so for this post I'll be calling it <span
class="redacted">XXXX</span>. If you are interested in participating, <a
href="mailto:sir@cmpwn.com">send me an email</a> with some details about your
background and I'll get back to you.</em>
<style>
.redacted {
  color: black;
  background: black;
}
</style>
Recently, I have been working on implementing a parser for media types (commonly
called MIME types) and a database which maps media types to file extensions and
vice-versa. I thought this would be an interesting module to blog about, given
that it's only about 250 lines of code, does something useful and interesting,
and demonstrates a few interesting <span class="redacted">xxxx</span> concepts.
The format for media types is more-or-less defined by [RFC 2045], specifically
[section 5.1]. The specification is not great. The grammar shown here is copied
and pasted from parts of larger grammars in older RFCs, RFCs which are equally
poorly defined. For example, the quoted-string nonterminal is never defined
here, but instead comes from RFC 822, which defines it but also states that it
can be "folded", which technically makes the following a valid Media Type:
```
text/plain;charset="hello
 world"
```
Or so I would presume, but the qtext terminal "cannot include CR", which is the
mechanism by which folding is performed in the first place, and... bleh. Let's
just implement a "reasonable subset" of the spec instead and side-step the whole
folding issue.[^1] This post will first cover parsing media types, then address
our second goal: providing a database which maps media types to file extensions
and vice versa.
[RFC 2045]: https://datatracker.ietf.org/doc/html/rfc2045
[section 5.1]: https://datatracker.ietf.org/doc/html/rfc2045#section-5.1
[^1]: Any time we implement a "reasonable subset" of a specification rather than the whole specification, I add the module to the list of modules likely to be moved out of the standard library and into a standalone module at some point prior to release. Another module on this list is our XML parser.
## Parsing Media Types
So, here's what we're going to implement today: we want to parse the following
string:
```
text/plain; charset=utf-8; foo="bar baz"
```
The code for that I came up with for is as follows:
```hare
// Parses a Media Type, returning a tuple of the content type (e.g.
// "text/plain") and a parameter parser object, or [[errors::invalid]] if the
// input cannot be parsed.
//
// To enumerate the Media Type parameter list, pass the type_params object into
// [[next_param]]. If you do not need the parameter list, you can safely discard
// the object. Note that any format errors following the ";" token will not
// cause [[errors::invalid]] to be returned unless [[next_param]] is used to
// enumerate all of the parameters.
export fn parse(in: str) ((str, type_params) | errors::invalid) = {
	const items = strings::cut(in, ";");
	const mtype = items.0, params = items.1;
	const items = strings::cut(mtype, "/");
	if (len(items.0) < 1 || len(items.1) < 1) {
		return errors::invalid;
	};
	typevalid(items.0)?;
	typevalid(items.1)?;
	return (mtype, strings::tokenize(params, ";"));
};
```
This function accepts a string as input, then returns a tagged union which
contains either a tuple of `(str, type_params)`, or a syntax error.
I designed this with particular attention to the memory management semantics.
<span class="redacted">xxxx</span> uses manual memory management, and if possible it's desirable to avoid
allocating any additional memory so that the user of our APIs remains in control
of the memory semantics. The return value is a sub-string borrowed from the
"text/plain" part, as well as a tokenizer which is prepared to split the
remainder of the string along the ";" tokens.
Inspiration for strings::cut comes from [Go]:
[Go]: https://github.com/golang/go/issues/46336
<div class="highlight"><pre class="chroma"><code class="language-hare" data-lang="hare"><span class="err">$</span> <span class="n"><span class="redacted">xxxx</span>doc</span> <span class="n">strings</span><span class="o">::</span><span class="n">cut</span>
<span class="c1">// Returns a string "cut" along the first instance of a delimiter, returning
</span><span class="c1">// everything up to the delimiter, and everything after the delimiter, in a
</span><span class="c1">// tuple.
</span><span class="c1">//
</span><span class="c1">//      strings::cut("hello=world=foobar", "=") // ("hello", "world=foobar")
</span><span class="c1">//      strings::cut("hello world", "=")        // ("hello world", "")
</span><span class="c1">//
</span><span class="c1">// The return value is borrowed from the 'in' parameter.
</span><span class="c1"></span><span class="k">fn</span> <span class="n">cut</span><span class="p">(</span><span class="n">in</span><span class="o">:</span> <span class="kt">str</span><span class="p">,</span> <span class="n">delim</span><span class="o">:</span> <span class="kt">str</span><span class="p">)</span> <span class="p">(</span><span class="kt">str</span><span class="p">,</span> <span class="kt">str</span><span class="p">);</span>
</code></pre></div>
And strings::tokenize works like so:
<div class="highlight"><pre class="chroma"><code class="language-hare" data-lang="hare"><span class="err">$</span> <span class="n"><span class="redacted">xxxx</span>doc</span> <span class="n">strings</span><span class="o">::</span><span class="n">tokenize</span>
<span class="c1">// Returns a tokenizer which yields sub-strings tokenized by a delimiter.
</span><span class="c1">//
</span><span class="c1">//      let tok = strings::tokenize("hello, my name is drew", " ");
</span><span class="c1">//      assert(strings::next_token(tok) == "hello,");
</span><span class="c1">//      assert(strings::next_token(tok) == "my");
</span><span class="c1">//      assert(strings::next_token(tok) == "name");
</span><span class="c1">//      assert(strings::remaining_tokens(tok) == "is drew");
</span><span class="c1"></span><span class="k">fn</span> <span class="n">tokenize</span><span class="p">(</span><span class="n">s</span><span class="o">:</span> <span class="kt">str</span><span class="p">,</span> <span class="n">delim</span><span class="o">:</span> <span class="kt">str</span><span class="p">)</span> <span class="n">tokenizer</span><span class="p">;</span>
</code></pre></div>
The RFC limits the acceptable characters for the media type and subtype, which
we test with the typevalid function.
The user of this module often only cares about the media type and not its type
parameters, so the tokenizer can be safely abandoned on the stack to get cleaned
up when the stack frame exits if they don't care about the rest.
This is enough to write a little test:
```hare
@test fn parse() void = {
	const res = parse("text/plain")!;
	assert(res.0 == "text/plain");
	const res = parse("image/png")!;
	assert(res.0 == "image/png");
	const res = parse("application/svg+xml; charset=utf-8; foo=\"bar baz\"")!;
	assert(res.0 == "application/svg+xml");
};
```
<pre>$ <span class="redacted">xxxx</span> test mime::parse
mime::parse..................................... OK
1 passed; 0 failed; 1 tests completed in 0.10s</pre>
To handle the type parameters in the third case, we add this function:
```hare
// Returns the next parameter as a (key, value) tuple from a [[type_params]]
// object that was prepared via [[parse]], void if there are no remaining
// parameters, and [[errors::invalid]] if a syntax error was encountered.
export fn next_param(in: *type_params) ((str, str) | void | errors::invalid) = {
	const tok = match (strings::next_token(in)) {
	case let s: str =>
		if (s == "") {
			// empty parameter
			return errors::invalid;
		};
		yield s;
	case void =>
		return;
	};
	const items = strings::cut(tok, "=");
	// The RFC does not permit whitespace here, but whitespace is very
	// common in the wild. ¯\_(ツ)_/¯
	items.0 = strings::trim(items.0);
	items.1 = strings::trim(items.1);
	if (len(items.0) == 0 || len(items.1) == 0) {
		return errors::invalid;
	};
	if (strings::hasprefix(items.1, "\"")) {
		items.1 = quoted(items.1)?;
	};
	return (items.0, items.1);
};
```
This returns a (key, value) tuple and advances to the next parameter, or returns
void if there are no further parameters (or, if necessary, an error). This is
pretty straightforward: the tokenizer prepared by parse is splitting the string
on `;` tokens, so we first fetch the next token. We then use strings::cut again
to split it over the `=` token, and after a quick trim to fix another RFC
oversight, we can return it to the caller. Unless it's using this pesky
quoted-string terminal, which is where our implementation starts to show its
weaknesses:
```hare
fn quoted(in: str) (str | errors::invalid) = {
	// We have only a basic implementation of quoted-string. It has a couple
	// of problems:
	//
	// 1. The RFC does not define it very well
	// 2. The parts of the RFC which are ill-defined are rarely used
	// 3. Implementing quoted-pair would require allocating a new string
	//
	// This implementation should handle most Media Types seen in practice
	// unless they're doing something weird and ill-advised with them.
	in = strings::trim(in, '"');
	if (strings::contains(in, "\\")
			|| strings::contains(in, "\r")
			|| strings::contains(in, "\n")) {
		return errors::invalid;
	};
	return in;
};
```
I think this implementation speaks for itself. It could be a bit faster if we
didn't do 3 &times; O(n) strings::contains calls, but someone will send a patch
if they care. The completed test for this is:
```hare
@test fn parse() void = {
	const res = parse("text/plain")!;
	assert(res.0 == "text/plain");
	const res = parse("image/png")!;
	assert(res.0 == "image/png");
	const res = parse("application/svg+xml; charset=utf-8; foo=\"bar baz\"")!;
	assert(res.0 == "application/svg+xml");
	const params = res.1;
	const param = next_param(&params)! as (str, str);
	assert(param.0 == "charset" && param.1 == "utf-8");
	const param = next_param(&params)! as (str, str);
	assert(param.0 == "foo" && param.1 == "bar baz");
	assert(next_param(&params) is void);
	assert(parse("hi") is errors::invalid);
	assert(parse("text/ spaces ") is errors::invalid);
	assert(parse("text/@") is errors::invalid);
	const res = parse("text/plain;charset")!;
	assert(res.0 == "text/plain");
	assert(next_param(&res.1) is errors::invalid);
};
```
## The Media Type database
The second part of this module is the Media Type database. This comes in two
parts:
1. An internal database which is populated by <span class="redacted">xxxx</span> modules. For example, an
   image::png module might register the "image/png" mimetype with the internal
   MIME database, similar to protocol registration for net::dial.
2. A system-provided database, usually via /etc/mime.types, which is more
   comprehensive, but may not be available at runtime.
I plan on doing the second part later on, so for now we'll just focus on the
first; most of the interesting bits are there anyway.
Again, special consideration is given to memory management here. The essence of
a good <span class="redacted">xxxx</span> program or API design can be ascertained from how well it handles
memory management. As such, I have set aside separate lists to handle statically
allocated MIME info (such as those provided by image::png et al) versus the
forthcoming dynamically-allocated system database.
```hare
// A pair of a Media Type and a list of file extensions associated with it. The
// extension list does not include the leading '.' character.
export type mimetype = struct {
	mime: str,
	exts: []str,
};
// List of media types with statically allocated fields (though the list itself
// is dynamically allocated).
let static_db: []mimetype = [];
// List of media types with heap-allocated fields, used when loading mime types
// from the system database.
let heap_db: []mimetype = [];
const builtins: [_]mimetype = [
	mimetype {
		mime = "text/plain",
		exts = ["txt"],
	},
	mimetype {
		mime = "text/x-xxxx", // redacted for public blog post
		exts = ["xx"],
	},
];
@init fn init() void = {
	register(builtins...);
};
@fini fn fini() void = {
	for (let i = 0z; i < len(heap_db); i += 1) {
		free(heap_db[i].mime);
		strings::freeall(heap_db[i].exts);
	};
	free(heap_db);
	free(static_db);
};
```
The register function will be used from @init functions like this one to
register media types with the internal database. This code has minimal
allocations for the internal database, but we do actually do some allocating
here to store the "static_db" slice. In theory we could eliminate this by
statically provisioning a small number of slots to store the internal database
in, but for this use-case the trade-off makes sense. There are use-cases where
the trade-off does not make as much sense, however. For example, here's how the
command line arguments are stored for your program in the "os" module:
```hare
// The command line arguments provided to the program. By convention, the first
// member is usually the name of the program.
export let args: []str = [];
// Statically allocate arg strings if there are few enough arguments, saves a
// syscall if we don't need it.
let args_static: [32]str = [""...];
@init fn init_environ() void = {
	if (rt::argc < len(args_static)) {
		args = args_static[..rt::argc];
		for (let i = 0z; i < rt::argc; i += 1) {
			args[i] = strings::fromc(rt::argv[i]);
		};
	} else {
		args = alloc([], rt::argc);
		for (let i = 0z; i < rt::argc; i += 1) {
			append(args, strings::fromc(rt::argv[i]));
		};
	};
};
@fini fn fini_environ() void = {
	if (rt::argc >= len(args_static)) {
		free(args);
	};
};
```
A similar approach is also used on yyp's RISC-V kernel for [storing serial
devices][0] without any runtime memory allocations.
[0]: https://paste.sr.ht/~sircmpwn/acaa1e61e6bcb3e22e8b4bce7f233dcd844565eb
The internal database is likely to be small, but the system database is likely
to have a lot of media types and file extensions registered, so it makes sense
to build out an efficient means of accessing them. For this purpose I have
implemented a simple hash map. <span class="redacted">xxxx</span> does not have a built-in map construct, nor
generics. The design constraints of <span class="redacted">xxxx</span> are closer to C than to anything else,
and as such, the trade-offs for first-class maps are similar to C, which is to
say that they don't make sense with our design. However, this use-case does not
call for much sophistication, so a simple map will suffice.
```hare
use hash::fnv;
def MIME_BUCKETS: size = 256;
// Hash tables for efficient database lookup by mimetype or extension
let mimetable: [MIME_BUCKETS][]*mimetype = [[]...];
let exttable: [MIME_BUCKETS][]*mimetype = [[]...];
// Registers a Media Type and its extensions in the internal MIME database. This
// function is designed to be used by @init functions for modules which
// implement new Media Types.
export fn register(mime: mimetype...) void = {
	let i = len(static_db);
	append(static_db, mime...);
	for (i < len(static_db); i += 1) {
		const item = &static_db[i];
		const hash = fnv::string(item.mime);
		let bucket = &mimetable[hash % len(mimetable)];
		append(bucket, item);
		for (let i = 0z; i < len(item.exts); i += 1) {
			const hash = fnv::string(item.exts[i]);
			let bucket = &exttable[hash % len(exttable)];
			append(bucket, item);
		};
	};
};
```
A fixed-length array of slices is a common approach to hash tables in <span
class="redacted">xxxx</span>. It's
not a great design for hash tables whose size is not reasonably predictable in
advance or which need to be frequently resized and rehashed, but it is pretty
easy to implement and provides sufficient performance for use-cases like this. A
re-sizable hash table, or tables using an alternate hash function, or the use of
linked lists instead of slices, and so on &mdash; all of this is possible if the
use-case calls for it, but must be written by hand.
Finally, we implement the look-up functions, which are very simple:
```hare
// Looks up a Media Type based on the mime type string, returning null if
// unknown.
export fn lookup_mime(mime: str) const nullable *mimetype = {
	const hash = fnv::string(mime);
	const bucket = &mimetable[hash % len(mimetable)];
	for (let i = 0z; i < len(bucket); i += 1) {
		const item = bucket[i];
		if (item.mime == mime) {
			return item;
		};
	};
	return null;
};
// Looks up a Media Type based on a file extension, with or without the leading
// '.' character, returning null if unknown.
export fn lookup_ext(ext: str) const nullable *mimetype = {
	ext = strings::ltrim(ext, '.');
	const hash = fnv::string(ext);
	const bucket = &exttable[hash % len(exttable)];
	for (let i = 0z; i < len(bucket); i += 1) {
		const item = bucket[i];
		for (let j = 0z; j < len(item.exts); j += 1) {
			if (item.exts[j] == ext) {
				return item;
			};
		};
	};
	return null;
};
```
For the sake of completeness, here are the tests:
```hare
@test fn lookup_mime() void = {
	assert(lookup_mime("foo/bar") == null);
	const result = lookup_mime("text/plain");
	assert(result != null);
	const result = result: *mimetype;
	assert(result.mime == "text/plain");
	assert(len(result.exts) == 1);
	assert(result.exts[0] == "txt");
	const result = lookup_mime("text/x-xxxx");
	assert(result != null);
	const result = result: *mimetype;
	assert(result.mime == "text/x-xxxx");
	assert(len(result.exts) == 1);
	assert(result.exts[0] == "xx");
};
@test fn lookup_ext() void = {
	assert(lookup_ext("foo") == null);
	assert(lookup_ext(".foo") == null);
	const result = lookup_ext("txt");
	assert(result != null);
	const result = result: *mimetype;
	assert(result.mime == "text/plain");
	assert(len(result.exts) == 1);
	assert(result.exts[0] == "txt");
	const result = lookup_ext(".txt");
	assert(result != null);
	const result = result: *mimetype;
	assert(result.mime == "text/plain");
	const result = lookup_ext("xx");
	assert(result != null);
	const result = result: *mimetype;
	assert(result.mime == "text/x-xxxx");
	assert(len(result.exts) == 1);
	assert(result.exts[0] == "xx");
};
```
There you have it! I will later implement some code which parses /etc/mime.types
in @init and fills up the heap_db slice, and this lookup code should work with
it without any additional changes.