logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git
commit: 314610c3e2cf89b0ba10b378fb85872fc1fe676e
parent 6c7da484381a4da993cca1486ba5c805f951e84d
Author: Drew DeVault <sir@cmpwn.com>
Date:   Fri, 28 Jan 2022 14:10:24 +0100

MIME database

Diffstat:

Acontent/blog/Implementing-mime-in-xxxx.md482+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 482 insertions(+), 0 deletions(-)

diff --git a/content/blog/Implementing-mime-in-xxxx.md b/content/blog/Implementing-mime-in-xxxx.md @@ -0,0 +1,482 @@ +--- +title: Implementing a MIME database in XXXX +date: 2022-01-28 +--- + +<em>This is a (redacted) post from the internal blog of a new systems +programming language we're developing. The project is being kept under wraps +until we're done with it, so for this post I'll be calling it <span +class="redacted">XXXX</span>. If you are interested in participating, <a +href="mailto:sir@cmpwn.com">send me an email</a> with some details about your +background and I'll get back to you.</em> + +<style> +.redacted { + color: black; + background: black; +} +</style> + +Recently, I have been working on implementing a parser for media types (commonly +called MIME types) and a database which maps media types to file extensions and +vice-versa. I thought this would be an interesting module to blog about, given +that it's only about 250 lines of code, does something useful and interesting, +and demonstrates a few interesting <span class="redacted">xxxx</span> concepts. + +The format for media types is more-or-less defined by [RFC 2045], specifically +[section 5.1]. The specification is not great. The grammar shown here is copied +and pasted from parts of larger grammars in older RFCs, RFCs which are equally +poorly defined. For example, the quoted-string nonterminal is never defined +here, but instead comes from RFC 822, which defines it but also states that it +can be "folded", which technically makes the following a valid Media Type: + +``` +text/plain;charset="hello + world" +``` + +Or so I would presume, but the qtext terminal "cannot include CR", which is the +mechanism by which folding is performed in the first place, and... bleh. Let's +just implement a "reasonable subset" of the spec instead and side-step the whole +folding issue.[^1] This post will first cover parsing media types, then address +our second goal: providing a database which maps media types to file extensions +and vice versa. + +[RFC 2045]: https://datatracker.ietf.org/doc/html/rfc2045 +[section 5.1]: https://datatracker.ietf.org/doc/html/rfc2045#section-5.1 +[^1]: Any time we implement a "reasonable subset" of a specification rather than the whole specification, I add the module to the list of modules likely to be moved out of the standard library and into a standalone module at some point prior to release. Another module on this list is our XML parser. + +## Parsing Media Types + +So, here's what we're going to implement today: we want to parse the following +string: + +``` +text/plain; charset=utf-8; foo="bar baz" +``` + +The code for that I came up with for is as follows: + +```hare +// Parses a Media Type, returning a tuple of the content type (e.g. +// "text/plain") and a parameter parser object, or [[errors::invalid]] if the +// input cannot be parsed. +// +// To enumerate the Media Type parameter list, pass the type_params object into +// [[next_param]]. If you do not need the parameter list, you can safely discard +// the object. Note that any format errors following the ";" token will not +// cause [[errors::invalid]] to be returned unless [[next_param]] is used to +// enumerate all of the parameters. +export fn parse(in: str) ((str, type_params) | errors::invalid) = { + const items = strings::cut(in, ";"); + const mtype = items.0, params = items.1; + const items = strings::cut(mtype, "/"); + if (len(items.0) < 1 || len(items.1) < 1) { + return errors::invalid; + }; + typevalid(items.0)?; + typevalid(items.1)?; + return (mtype, strings::tokenize(params, ";")); +}; +``` + +This function accepts a string as input, then returns a tagged union which +contains either a tuple of `(str, type_params)`, or a syntax error. + +I designed this with particular attention to the memory management semantics. +<span class="redacted">xxxx</span> uses manual memory management, and if possible it's desirable to avoid +allocating any additional memory so that the user of our APIs remains in control +of the memory semantics. The return value is a sub-string borrowed from the +"text/plain" part, as well as a tokenizer which is prepared to split the +remainder of the string along the ";" tokens. + +Inspiration for strings::cut comes from [Go]: + +[Go]: https://github.com/golang/go/issues/46336 + +<div class="highlight"><pre class="chroma"><code class="language-hare" data-lang="hare"><span class="err">$</span> <span class="n"><span class="redacted">xxxx</span>doc</span> <span class="n">strings</span><span class="o">::</span><span class="n">cut</span> +<span class="c1">// Returns a string "cut" along the first instance of a delimiter, returning +</span><span class="c1">// everything up to the delimiter, and everything after the delimiter, in a +</span><span class="c1">// tuple. +</span><span class="c1">// +</span><span class="c1">// strings::cut("hello=world=foobar", "=") // ("hello", "world=foobar") +</span><span class="c1">// strings::cut("hello world", "=") // ("hello world", "") +</span><span class="c1">// +</span><span class="c1">// The return value is borrowed from the 'in' parameter. +</span><span class="c1"></span><span class="k">fn</span> <span class="n">cut</span><span class="p">(</span><span class="n">in</span><span class="o">:</span> <span class="kt">str</span><span class="p">,</span> <span class="n">delim</span><span class="o">:</span> <span class="kt">str</span><span class="p">)</span> <span class="p">(</span><span class="kt">str</span><span class="p">,</span> <span class="kt">str</span><span class="p">);</span> +</code></pre></div> + +And strings::tokenize works like so: + +<div class="highlight"><pre class="chroma"><code class="language-hare" data-lang="hare"><span class="err">$</span> <span class="n"><span class="redacted">xxxx</span>doc</span> <span class="n">strings</span><span class="o">::</span><span class="n">tokenize</span> +<span class="c1">// Returns a tokenizer which yields sub-strings tokenized by a delimiter. +</span><span class="c1">// +</span><span class="c1">// let tok = strings::tokenize("hello, my name is drew", " "); +</span><span class="c1">// assert(strings::next_token(tok) == "hello,"); +</span><span class="c1">// assert(strings::next_token(tok) == "my"); +</span><span class="c1">// assert(strings::next_token(tok) == "name"); +</span><span class="c1">// assert(strings::remaining_tokens(tok) == "is drew"); +</span><span class="c1"></span><span class="k">fn</span> <span class="n">tokenize</span><span class="p">(</span><span class="n">s</span><span class="o">:</span> <span class="kt">str</span><span class="p">,</span> <span class="n">delim</span><span class="o">:</span> <span class="kt">str</span><span class="p">)</span> <span class="n">tokenizer</span><span class="p">;</span> +</code></pre></div> + +The RFC limits the acceptable characters for the media type and subtype, which +we test with the typevalid function. + +The user of this module often only cares about the media type and not its type +parameters, so the tokenizer can be safely abandoned on the stack to get cleaned +up when the stack frame exits if they don't care about the rest. + +This is enough to write a little test: + +```hare +@test fn parse() void = { + const res = parse("text/plain")!; + assert(res.0 == "text/plain"); + + const res = parse("image/png")!; + assert(res.0 == "image/png"); + + const res = parse("application/svg+xml; charset=utf-8; foo=\"bar baz\"")!; + assert(res.0 == "application/svg+xml"); +}; +``` + +<pre>$ <span class="redacted">xxxx</span> test mime::parse +mime::parse..................................... OK + +1 passed; 0 failed; 1 tests completed in 0.10s</pre> + +To handle the type parameters in the third case, we add this function: + +```hare +// Returns the next parameter as a (key, value) tuple from a [[type_params]] +// object that was prepared via [[parse]], void if there are no remaining +// parameters, and [[errors::invalid]] if a syntax error was encountered. +export fn next_param(in: *type_params) ((str, str) | void | errors::invalid) = { + const tok = match (strings::next_token(in)) { + case let s: str => + if (s == "") { + // empty parameter + return errors::invalid; + }; + yield s; + case void => + return; + }; + + const items = strings::cut(tok, "="); + // The RFC does not permit whitespace here, but whitespace is very + // common in the wild. ¯\_(ツ)_/¯ + items.0 = strings::trim(items.0); + items.1 = strings::trim(items.1); + if (len(items.0) == 0 || len(items.1) == 0) { + return errors::invalid; + }; + + if (strings::hasprefix(items.1, "\"")) { + items.1 = quoted(items.1)?; + }; + + return (items.0, items.1); +}; +``` + +This returns a (key, value) tuple and advances to the next parameter, or returns +void if there are no further parameters (or, if necessary, an error). This is +pretty straightforward: the tokenizer prepared by parse is splitting the string +on `;` tokens, so we first fetch the next token. We then use strings::cut again +to split it over the `=` token, and after a quick trim to fix another RFC +oversight, we can return it to the caller. Unless it's using this pesky +quoted-string terminal, which is where our implementation starts to show its +weaknesses: + +```hare +fn quoted(in: str) (str | errors::invalid) = { + // We have only a basic implementation of quoted-string. It has a couple + // of problems: + // + // 1. The RFC does not define it very well + // 2. The parts of the RFC which are ill-defined are rarely used + // 3. Implementing quoted-pair would require allocating a new string + // + // This implementation should handle most Media Types seen in practice + // unless they're doing something weird and ill-advised with them. + in = strings::trim(in, '"'); + if (strings::contains(in, "\\") + || strings::contains(in, "\r") + || strings::contains(in, "\n")) { + return errors::invalid; + }; + return in; +}; +``` + +I think this implementation speaks for itself. It could be a bit faster if we +didn't do 3 &times; O(n) strings::contains calls, but someone will send a patch +if they care. The completed test for this is: + +```hare +@test fn parse() void = { + const res = parse("text/plain")!; + assert(res.0 == "text/plain"); + + const res = parse("image/png")!; + assert(res.0 == "image/png"); + + const res = parse("application/svg+xml; charset=utf-8; foo=\"bar baz\"")!; + assert(res.0 == "application/svg+xml"); + const params = res.1; + const param = next_param(&params)! as (str, str); + assert(param.0 == "charset" && param.1 == "utf-8"); + const param = next_param(&params)! as (str, str); + assert(param.0 == "foo" && param.1 == "bar baz"); + assert(next_param(&params) is void); + + assert(parse("hi") is errors::invalid); + assert(parse("text/ spaces ") is errors::invalid); + assert(parse("text/@") is errors::invalid); + + const res = parse("text/plain;charset")!; + assert(res.0 == "text/plain"); + assert(next_param(&res.1) is errors::invalid); +}; +``` + +## The Media Type database + +The second part of this module is the Media Type database. This comes in two +parts: + +1. An internal database which is populated by <span class="redacted">xxxx</span> modules. For example, an + image::png module might register the "image/png" mimetype with the internal + MIME database, similar to protocol registration for net::dial. +2. A system-provided database, usually via /etc/mime.types, which is more + comprehensive, but may not be available at runtime. + +I plan on doing the second part later on, so for now we'll just focus on the +first; most of the interesting bits are there anyway. + +Again, special consideration is given to memory management here. The essence of +a good <span class="redacted">xxxx</span> program or API design can be ascertained from how well it handles +memory management. As such, I have set aside separate lists to handle statically +allocated MIME info (such as those provided by image::png et al) versus the +forthcoming dynamically-allocated system database. + +```hare +// A pair of a Media Type and a list of file extensions associated with it. The +// extension list does not include the leading '.' character. +export type mimetype = struct { + mime: str, + exts: []str, +}; + +// List of media types with statically allocated fields (though the list itself +// is dynamically allocated). +let static_db: []mimetype = []; + +// List of media types with heap-allocated fields, used when loading mime types +// from the system database. +let heap_db: []mimetype = []; + +const builtins: [_]mimetype = [ + mimetype { + mime = "text/plain", + exts = ["txt"], + }, + mimetype { + mime = "text/x-xxxx", // redacted for public blog post + exts = ["xx"], + }, +]; + +@init fn init() void = { + register(builtins...); +}; + +@fini fn fini() void = { + for (let i = 0z; i < len(heap_db); i += 1) { + free(heap_db[i].mime); + strings::freeall(heap_db[i].exts); + }; + free(heap_db); + free(static_db); +}; +``` + +The register function will be used from @init functions like this one to +register media types with the internal database. This code has minimal +allocations for the internal database, but we do actually do some allocating +here to store the "static_db" slice. In theory we could eliminate this by +statically provisioning a small number of slots to store the internal database +in, but for this use-case the trade-off makes sense. There are use-cases where +the trade-off does not make as much sense, however. For example, here's how the +command line arguments are stored for your program in the "os" module: + +```hare +// The command line arguments provided to the program. By convention, the first +// member is usually the name of the program. +export let args: []str = []; + +// Statically allocate arg strings if there are few enough arguments, saves a +// syscall if we don't need it. +let args_static: [32]str = [""...]; + +@init fn init_environ() void = { + if (rt::argc < len(args_static)) { + args = args_static[..rt::argc]; + for (let i = 0z; i < rt::argc; i += 1) { + args[i] = strings::fromc(rt::argv[i]); + }; + } else { + args = alloc([], rt::argc); + for (let i = 0z; i < rt::argc; i += 1) { + append(args, strings::fromc(rt::argv[i])); + }; + }; + +}; + +@fini fn fini_environ() void = { + if (rt::argc >= len(args_static)) { + free(args); + }; +}; +``` + +A similar approach is also used on yyp's RISC-V kernel for [storing serial +devices][0] without any runtime memory allocations. + +[0]: https://paste.sr.ht/~sircmpwn/acaa1e61e6bcb3e22e8b4bce7f233dcd844565eb + +The internal database is likely to be small, but the system database is likely +to have a lot of media types and file extensions registered, so it makes sense +to build out an efficient means of accessing them. For this purpose I have +implemented a simple hash map. <span class="redacted">xxxx</span> does not have a built-in map construct, nor +generics. The design constraints of <span class="redacted">xxxx</span> are closer to C than to anything else, +and as such, the trade-offs for first-class maps are similar to C, which is to +say that they don't make sense with our design. However, this use-case does not +call for much sophistication, so a simple map will suffice. + +```hare +use hash::fnv; + +def MIME_BUCKETS: size = 256; + +// Hash tables for efficient database lookup by mimetype or extension +let mimetable: [MIME_BUCKETS][]*mimetype = [[]...]; +let exttable: [MIME_BUCKETS][]*mimetype = [[]...]; + +// Registers a Media Type and its extensions in the internal MIME database. This +// function is designed to be used by @init functions for modules which +// implement new Media Types. +export fn register(mime: mimetype...) void = { + let i = len(static_db); + append(static_db, mime...); + for (i < len(static_db); i += 1) { + const item = &static_db[i]; + const hash = fnv::string(item.mime); + let bucket = &mimetable[hash % len(mimetable)]; + append(bucket, item); + + for (let i = 0z; i < len(item.exts); i += 1) { + const hash = fnv::string(item.exts[i]); + let bucket = &exttable[hash % len(exttable)]; + append(bucket, item); + }; + }; +}; +``` + +A fixed-length array of slices is a common approach to hash tables in <span +class="redacted">xxxx</span>. It's +not a great design for hash tables whose size is not reasonably predictable in +advance or which need to be frequently resized and rehashed, but it is pretty +easy to implement and provides sufficient performance for use-cases like this. A +re-sizable hash table, or tables using an alternate hash function, or the use of +linked lists instead of slices, and so on &mdash; all of this is possible if the +use-case calls for it, but must be written by hand. + +Finally, we implement the look-up functions, which are very simple: + +```hare +// Looks up a Media Type based on the mime type string, returning null if +// unknown. +export fn lookup_mime(mime: str) const nullable *mimetype = { + const hash = fnv::string(mime); + const bucket = &mimetable[hash % len(mimetable)]; + for (let i = 0z; i < len(bucket); i += 1) { + const item = bucket[i]; + if (item.mime == mime) { + return item; + }; + }; + return null; +}; + +// Looks up a Media Type based on a file extension, with or without the leading +// '.' character, returning null if unknown. +export fn lookup_ext(ext: str) const nullable *mimetype = { + ext = strings::ltrim(ext, '.'); + const hash = fnv::string(ext); + const bucket = &exttable[hash % len(exttable)]; + for (let i = 0z; i < len(bucket); i += 1) { + const item = bucket[i]; + for (let j = 0z; j < len(item.exts); j += 1) { + if (item.exts[j] == ext) { + return item; + }; + }; + }; + return null; +}; +``` + +For the sake of completeness, here are the tests: + +```hare +@test fn lookup_mime() void = { + assert(lookup_mime("foo/bar") == null); + + const result = lookup_mime("text/plain"); + assert(result != null); + const result = result: *mimetype; + assert(result.mime == "text/plain"); + assert(len(result.exts) == 1); + assert(result.exts[0] == "txt"); + + const result = lookup_mime("text/x-xxxx"); + assert(result != null); + const result = result: *mimetype; + assert(result.mime == "text/x-xxxx"); + assert(len(result.exts) == 1); + assert(result.exts[0] == "xx"); +}; + +@test fn lookup_ext() void = { + assert(lookup_ext("foo") == null); + assert(lookup_ext(".foo") == null); + + const result = lookup_ext("txt"); + assert(result != null); + const result = result: *mimetype; + assert(result.mime == "text/plain"); + assert(len(result.exts) == 1); + assert(result.exts[0] == "txt"); + + const result = lookup_ext(".txt"); + assert(result != null); + const result = result: *mimetype; + assert(result.mime == "text/plain"); + + const result = lookup_ext("xx"); + assert(result != null); + const result = result: *mimetype; + assert(result.mime == "text/x-xxxx"); + assert(len(result.exts) == 1); + assert(result.exts[0] == "xx"); +}; +``` + +There you have it! I will later implement some code which parses /etc/mime.types +in @init and fills up the heap_db slice, and this lookup code should work with +it without any additional changes.