logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

Implementing-mime-in-xxxx.md (18625B)


  1. ---
  2. title: Implementing a MIME database in XXXX
  3. date: 2022-01-28
  4. ---
  5. <em>This is a (redacted) post from the internal blog of a new systems
  6. programming language we're developing. The project is being kept under wraps
  7. until we're done with it, so for this post I'll be calling it <span
  8. class="redacted">XXXX</span>. If you are interested in participating, <a
  9. href="mailto:sir@cmpwn.com">send me an email</a> with some details about your
  10. background and I'll get back to you.</em>
  11. <style>
  12. .redacted {
  13. color: black;
  14. background: black;
  15. }
  16. </style>
  17. Recently, I have been working on implementing a parser for media types (commonly
  18. called MIME types) and a database which maps media types to file extensions and
  19. vice-versa. I thought this would be an interesting module to blog about, given
  20. that it's only about 250 lines of code, does something useful and interesting,
  21. and demonstrates a few interesting <span class="redacted">xxxx</span> concepts.
  22. The format for media types is more-or-less defined by [RFC 2045], specifically
  23. [section 5.1]. The specification is not great. The grammar shown here is copied
  24. and pasted from parts of larger grammars in older RFCs, RFCs which are equally
  25. poorly defined. For example, the quoted-string nonterminal is never defined
  26. here, but instead comes from RFC 822, which defines it but also states that it
  27. can be "folded", which technically makes the following a valid Media Type:
  28. ```
  29. text/plain;charset="hello
  30. world"
  31. ```
  32. Or so I would presume, but the qtext terminal "cannot include CR", which is the
  33. mechanism by which folding is performed in the first place, and... bleh. Let's
  34. just implement a "reasonable subset" of the spec instead and side-step the whole
  35. folding issue.[^1] This post will first cover parsing media types, then address
  36. our second goal: providing a database which maps media types to file extensions
  37. and vice versa.
  38. [RFC 2045]: https://datatracker.ietf.org/doc/html/rfc2045
  39. [section 5.1]: https://datatracker.ietf.org/doc/html/rfc2045#section-5.1
  40. [^1]: Any time we implement a "reasonable subset" of a specification rather than the whole specification, I add the module to the list of modules likely to be moved out of the standard library and into a standalone module at some point prior to release. Another module on this list is our XML parser.
  41. ## Parsing Media Types
  42. So, here's what we're going to implement today: we want to parse the following
  43. string:
  44. ```
  45. text/plain; charset=utf-8; foo="bar baz"
  46. ```
  47. The code for that I came up with for is as follows:
  48. ```hare
  49. // Parses a Media Type, returning a tuple of the content type (e.g.
  50. // "text/plain") and a parameter parser object, or [[errors::invalid]] if the
  51. // input cannot be parsed.
  52. //
  53. // To enumerate the Media Type parameter list, pass the type_params object into
  54. // [[next_param]]. If you do not need the parameter list, you can safely discard
  55. // the object. Note that any format errors following the ";" token will not
  56. // cause [[errors::invalid]] to be returned unless [[next_param]] is used to
  57. // enumerate all of the parameters.
  58. export fn parse(in: str) ((str, type_params) | errors::invalid) = {
  59. const items = strings::cut(in, ";");
  60. const mtype = items.0, params = items.1;
  61. const items = strings::cut(mtype, "/");
  62. if (len(items.0) < 1 || len(items.1) < 1) {
  63. return errors::invalid;
  64. };
  65. typevalid(items.0)?;
  66. typevalid(items.1)?;
  67. return (mtype, strings::tokenize(params, ";"));
  68. };
  69. ```
  70. This function accepts a string as input, then returns a tagged union which
  71. contains either a tuple of `(str, type_params)`, or a syntax error.
  72. I designed this with particular attention to the memory management semantics.
  73. <span class="redacted">xxxx</span> uses manual memory management, and if possible it's desirable to avoid
  74. allocating any additional memory so that the user of our APIs remains in control
  75. of the memory semantics. The return value is a sub-string borrowed from the
  76. "text/plain" part, as well as a tokenizer which is prepared to split the
  77. remainder of the string along the ";" tokens.
  78. Inspiration for strings::cut comes from [Go]:
  79. [Go]: https://github.com/golang/go/issues/46336
  80. <div class="highlight"><pre class="chroma"><code class="language-hare" data-lang="hare"><span class="err">$</span> <span class="n"><span class="redacted">xxxx</span>doc</span> <span class="n">strings</span><span class="o">::</span><span class="n">cut</span>
  81. <span class="c1">// Returns a string "cut" along the first instance of a delimiter, returning
  82. </span><span class="c1">// everything up to the delimiter, and everything after the delimiter, in a
  83. </span><span class="c1">// tuple.
  84. </span><span class="c1">//
  85. </span><span class="c1">// strings::cut("hello=world=foobar", "=") // ("hello", "world=foobar")
  86. </span><span class="c1">// strings::cut("hello world", "=") // ("hello world", "")
  87. </span><span class="c1">//
  88. </span><span class="c1">// The return value is borrowed from the 'in' parameter.
  89. </span><span class="c1"></span><span class="k">fn</span> <span class="n">cut</span><span class="p">(</span><span class="n">in</span><span class="o">:</span> <span class="kt">str</span><span class="p">,</span> <span class="n">delim</span><span class="o">:</span> <span class="kt">str</span><span class="p">)</span> <span class="p">(</span><span class="kt">str</span><span class="p">,</span> <span class="kt">str</span><span class="p">);</span>
  90. </code></pre></div>
  91. And strings::tokenize works like so:
  92. <div class="highlight"><pre class="chroma"><code class="language-hare" data-lang="hare"><span class="err">$</span> <span class="n"><span class="redacted">xxxx</span>doc</span> <span class="n">strings</span><span class="o">::</span><span class="n">tokenize</span>
  93. <span class="c1">// Returns a tokenizer which yields sub-strings tokenized by a delimiter.
  94. </span><span class="c1">//
  95. </span><span class="c1">// let tok = strings::tokenize("hello, my name is drew", " ");
  96. </span><span class="c1">// assert(strings::next_token(tok) == "hello,");
  97. </span><span class="c1">// assert(strings::next_token(tok) == "my");
  98. </span><span class="c1">// assert(strings::next_token(tok) == "name");
  99. </span><span class="c1">// assert(strings::remaining_tokens(tok) == "is drew");
  100. </span><span class="c1"></span><span class="k">fn</span> <span class="n">tokenize</span><span class="p">(</span><span class="n">s</span><span class="o">:</span> <span class="kt">str</span><span class="p">,</span> <span class="n">delim</span><span class="o">:</span> <span class="kt">str</span><span class="p">)</span> <span class="n">tokenizer</span><span class="p">;</span>
  101. </code></pre></div>
  102. The RFC limits the acceptable characters for the media type and subtype, which
  103. we test with the typevalid function.
  104. The user of this module often only cares about the media type and not its type
  105. parameters, so the tokenizer can be safely abandoned on the stack to get cleaned
  106. up when the stack frame exits if they don't care about the rest.
  107. This is enough to write a little test:
  108. ```hare
  109. @test fn parse() void = {
  110. const res = parse("text/plain")!;
  111. assert(res.0 == "text/plain");
  112. const res = parse("image/png")!;
  113. assert(res.0 == "image/png");
  114. const res = parse("application/svg+xml; charset=utf-8; foo=\"bar baz\"")!;
  115. assert(res.0 == "application/svg+xml");
  116. };
  117. ```
  118. <pre>$ <span class="redacted">xxxx</span> test mime::parse
  119. mime::parse..................................... OK
  120. 1 passed; 0 failed; 1 tests completed in 0.10s</pre>
  121. To handle the type parameters in the third case, we add this function:
  122. ```hare
  123. // Returns the next parameter as a (key, value) tuple from a [[type_params]]
  124. // object that was prepared via [[parse]], void if there are no remaining
  125. // parameters, and [[errors::invalid]] if a syntax error was encountered.
  126. export fn next_param(in: *type_params) ((str, str) | void | errors::invalid) = {
  127. const tok = match (strings::next_token(in)) {
  128. case let s: str =>
  129. if (s == "") {
  130. // empty parameter
  131. return errors::invalid;
  132. };
  133. yield s;
  134. case void =>
  135. return;
  136. };
  137. const items = strings::cut(tok, "=");
  138. // The RFC does not permit whitespace here, but whitespace is very
  139. // common in the wild. ¯\_(ツ)_/¯
  140. items.0 = strings::trim(items.0);
  141. items.1 = strings::trim(items.1);
  142. if (len(items.0) == 0 || len(items.1) == 0) {
  143. return errors::invalid;
  144. };
  145. if (strings::hasprefix(items.1, "\"")) {
  146. items.1 = quoted(items.1)?;
  147. };
  148. return (items.0, items.1);
  149. };
  150. ```
  151. This returns a (key, value) tuple and advances to the next parameter, or returns
  152. void if there are no further parameters (or, if necessary, an error). This is
  153. pretty straightforward: the tokenizer prepared by parse is splitting the string
  154. on `;` tokens, so we first fetch the next token. We then use strings::cut again
  155. to split it over the `=` token, and after a quick trim to fix another RFC
  156. oversight, we can return it to the caller. Unless it's using this pesky
  157. quoted-string terminal, which is where our implementation starts to show its
  158. weaknesses:
  159. ```hare
  160. fn quoted(in: str) (str | errors::invalid) = {
  161. // We have only a basic implementation of quoted-string. It has a couple
  162. // of problems:
  163. //
  164. // 1. The RFC does not define it very well
  165. // 2. The parts of the RFC which are ill-defined are rarely used
  166. // 3. Implementing quoted-pair would require allocating a new string
  167. //
  168. // This implementation should handle most Media Types seen in practice
  169. // unless they're doing something weird and ill-advised with them.
  170. in = strings::trim(in, '"');
  171. if (strings::contains(in, "\\")
  172. || strings::contains(in, "\r")
  173. || strings::contains(in, "\n")) {
  174. return errors::invalid;
  175. };
  176. return in;
  177. };
  178. ```
  179. I think this implementation speaks for itself. It could be a bit faster if we
  180. didn't do 3 &times; O(n) strings::contains calls, but someone will send a patch
  181. if they care. The completed test for this is:
  182. ```hare
  183. @test fn parse() void = {
  184. const res = parse("text/plain")!;
  185. assert(res.0 == "text/plain");
  186. const res = parse("image/png")!;
  187. assert(res.0 == "image/png");
  188. const res = parse("application/svg+xml; charset=utf-8; foo=\"bar baz\"")!;
  189. assert(res.0 == "application/svg+xml");
  190. const params = res.1;
  191. const param = next_param(&params)! as (str, str);
  192. assert(param.0 == "charset" && param.1 == "utf-8");
  193. const param = next_param(&params)! as (str, str);
  194. assert(param.0 == "foo" && param.1 == "bar baz");
  195. assert(next_param(&params) is void);
  196. assert(parse("hi") is errors::invalid);
  197. assert(parse("text/ spaces ") is errors::invalid);
  198. assert(parse("text/@") is errors::invalid);
  199. const res = parse("text/plain;charset")!;
  200. assert(res.0 == "text/plain");
  201. assert(next_param(&res.1) is errors::invalid);
  202. };
  203. ```
  204. ## The Media Type database
  205. The second part of this module is the Media Type database. This comes in two
  206. parts:
  207. 1. An internal database which is populated by <span class="redacted">xxxx</span> modules. For example, an
  208. image::png module might register the "image/png" mimetype with the internal
  209. MIME database, similar to protocol registration for net::dial.
  210. 2. A system-provided database, usually via /etc/mime.types, which is more
  211. comprehensive, but may not be available at runtime.
  212. I plan on doing the second part later on, so for now we'll just focus on the
  213. first; most of the interesting bits are there anyway.
  214. Again, special consideration is given to memory management here. The essence of
  215. a good <span class="redacted">xxxx</span> program or API design can be ascertained from how well it handles
  216. memory management. As such, I have set aside separate lists to handle statically
  217. allocated MIME info (such as those provided by image::png et al) versus the
  218. forthcoming dynamically-allocated system database.
  219. ```hare
  220. // A pair of a Media Type and a list of file extensions associated with it. The
  221. // extension list does not include the leading '.' character.
  222. export type mimetype = struct {
  223. mime: str,
  224. exts: []str,
  225. };
  226. // List of media types with statically allocated fields (though the list itself
  227. // is dynamically allocated).
  228. let static_db: []mimetype = [];
  229. // List of media types with heap-allocated fields, used when loading mime types
  230. // from the system database.
  231. let heap_db: []mimetype = [];
  232. const builtins: [_]mimetype = [
  233. mimetype {
  234. mime = "text/plain",
  235. exts = ["txt"],
  236. },
  237. mimetype {
  238. mime = "text/x-xxxx", // redacted for public blog post
  239. exts = ["xx"],
  240. },
  241. ];
  242. @init fn init() void = {
  243. register(builtins...);
  244. };
  245. @fini fn fini() void = {
  246. for (let i = 0z; i < len(heap_db); i += 1) {
  247. free(heap_db[i].mime);
  248. strings::freeall(heap_db[i].exts);
  249. };
  250. free(heap_db);
  251. free(static_db);
  252. };
  253. ```
  254. The register function will be used from @init functions like this one to
  255. register media types with the internal database. This code has minimal
  256. allocations for the internal database, but we do actually do some allocating
  257. here to store the "static_db" slice. In theory we could eliminate this by
  258. statically provisioning a small number of slots to store the internal database
  259. in, but for this use-case the trade-off makes sense. There are use-cases where
  260. the trade-off does not make as much sense, however. For example, here's how the
  261. command line arguments are stored for your program in the "os" module:
  262. ```hare
  263. // The command line arguments provided to the program. By convention, the first
  264. // member is usually the name of the program.
  265. export let args: []str = [];
  266. // Statically allocate arg strings if there are few enough arguments, saves a
  267. // syscall if we don't need it.
  268. let args_static: [32]str = [""...];
  269. @init fn init_environ() void = {
  270. if (rt::argc < len(args_static)) {
  271. args = args_static[..rt::argc];
  272. for (let i = 0z; i < rt::argc; i += 1) {
  273. args[i] = strings::fromc(rt::argv[i]);
  274. };
  275. } else {
  276. args = alloc([], rt::argc);
  277. for (let i = 0z; i < rt::argc; i += 1) {
  278. append(args, strings::fromc(rt::argv[i]));
  279. };
  280. };
  281. };
  282. @fini fn fini_environ() void = {
  283. if (rt::argc >= len(args_static)) {
  284. free(args);
  285. };
  286. };
  287. ```
  288. A similar approach is also used on yyp's RISC-V kernel for [storing serial
  289. devices][0] without any runtime memory allocations.
  290. [0]: https://paste.sr.ht/~sircmpwn/acaa1e61e6bcb3e22e8b4bce7f233dcd844565eb
  291. The internal database is likely to be small, but the system database is likely
  292. to have a lot of media types and file extensions registered, so it makes sense
  293. to build out an efficient means of accessing them. For this purpose I have
  294. implemented a simple hash map. <span class="redacted">xxxx</span> does not have a built-in map construct, nor
  295. generics. The design constraints of <span class="redacted">xxxx</span> are closer to C than to anything else,
  296. and as such, the trade-offs for first-class maps are similar to C, which is to
  297. say that they don't make sense with our design. However, this use-case does not
  298. call for much sophistication, so a simple map will suffice.
  299. ```hare
  300. use hash::fnv;
  301. def MIME_BUCKETS: size = 256;
  302. // Hash tables for efficient database lookup by mimetype or extension
  303. let mimetable: [MIME_BUCKETS][]*mimetype = [[]...];
  304. let exttable: [MIME_BUCKETS][]*mimetype = [[]...];
  305. // Registers a Media Type and its extensions in the internal MIME database. This
  306. // function is designed to be used by @init functions for modules which
  307. // implement new Media Types.
  308. export fn register(mime: mimetype...) void = {
  309. let i = len(static_db);
  310. append(static_db, mime...);
  311. for (i < len(static_db); i += 1) {
  312. const item = &static_db[i];
  313. const hash = fnv::string(item.mime);
  314. let bucket = &mimetable[hash % len(mimetable)];
  315. append(bucket, item);
  316. for (let i = 0z; i < len(item.exts); i += 1) {
  317. const hash = fnv::string(item.exts[i]);
  318. let bucket = &exttable[hash % len(exttable)];
  319. append(bucket, item);
  320. };
  321. };
  322. };
  323. ```
  324. A fixed-length array of slices is a common approach to hash tables in <span
  325. class="redacted">xxxx</span>. It's
  326. not a great design for hash tables whose size is not reasonably predictable in
  327. advance or which need to be frequently resized and rehashed, but it is pretty
  328. easy to implement and provides sufficient performance for use-cases like this. A
  329. re-sizable hash table, or tables using an alternate hash function, or the use of
  330. linked lists instead of slices, and so on &mdash; all of this is possible if the
  331. use-case calls for it, but must be written by hand.
  332. Finally, we implement the look-up functions, which are very simple:
  333. ```hare
  334. // Looks up a Media Type based on the mime type string, returning null if
  335. // unknown.
  336. export fn lookup_mime(mime: str) const nullable *mimetype = {
  337. const hash = fnv::string(mime);
  338. const bucket = &mimetable[hash % len(mimetable)];
  339. for (let i = 0z; i < len(bucket); i += 1) {
  340. const item = bucket[i];
  341. if (item.mime == mime) {
  342. return item;
  343. };
  344. };
  345. return null;
  346. };
  347. // Looks up a Media Type based on a file extension, with or without the leading
  348. // '.' character, returning null if unknown.
  349. export fn lookup_ext(ext: str) const nullable *mimetype = {
  350. ext = strings::ltrim(ext, '.');
  351. const hash = fnv::string(ext);
  352. const bucket = &exttable[hash % len(exttable)];
  353. for (let i = 0z; i < len(bucket); i += 1) {
  354. const item = bucket[i];
  355. for (let j = 0z; j < len(item.exts); j += 1) {
  356. if (item.exts[j] == ext) {
  357. return item;
  358. };
  359. };
  360. };
  361. return null;
  362. };
  363. ```
  364. For the sake of completeness, here are the tests:
  365. ```hare
  366. @test fn lookup_mime() void = {
  367. assert(lookup_mime("foo/bar") == null);
  368. const result = lookup_mime("text/plain");
  369. assert(result != null);
  370. const result = result: *mimetype;
  371. assert(result.mime == "text/plain");
  372. assert(len(result.exts) == 1);
  373. assert(result.exts[0] == "txt");
  374. const result = lookup_mime("text/x-xxxx");
  375. assert(result != null);
  376. const result = result: *mimetype;
  377. assert(result.mime == "text/x-xxxx");
  378. assert(len(result.exts) == 1);
  379. assert(result.exts[0] == "xx");
  380. };
  381. @test fn lookup_ext() void = {
  382. assert(lookup_ext("foo") == null);
  383. assert(lookup_ext(".foo") == null);
  384. const result = lookup_ext("txt");
  385. assert(result != null);
  386. const result = result: *mimetype;
  387. assert(result.mime == "text/plain");
  388. assert(len(result.exts) == 1);
  389. assert(result.exts[0] == "txt");
  390. const result = lookup_ext(".txt");
  391. assert(result != null);
  392. const result = result: *mimetype;
  393. assert(result.mime == "text/plain");
  394. const result = lookup_ext("xx");
  395. assert(result != null);
  396. const result = result: *mimetype;
  397. assert(result.mime == "text/x-xxxx");
  398. assert(len(result.exts) == 1);
  399. assert(result.exts[0] == "xx");
  400. };
  401. ```
  402. There you have it! I will later implement some code which parses /etc/mime.types
  403. in @init and fills up the heap_db slice, and this lookup code should work with
  404. it without any additional changes.