logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

Our-self-hosted-parser-design.md (19689B)


  1. ---
  2. title: "Parsers all the way down: writing a self-hosting parser"
  3. date: 2021-04-22
  4. outputs: [html, gemtext]
  5. ---
  6. One of the things we're working on in [my new programming language][0] is a
  7. self-hosting compiler. Having a self-hosted compiler is a critical step in the
  8. development of (some) programming languages: it signals that the language is
  9. mature enough to be comfortably used to implement itself. While this isn't right
  10. for some languages (e.g. shell scripts), for a systems programming language like
  11. ours, this is a crucial step in our bootstrapping plan. Our self-hosted parser
  12. design was completed this week, and today I'll share some details about how it
  13. works and how it came to be.
  14. [0]: https://drewdevault.com/2021/03/19/A-new-systems-language.html
  15. This is the third parser which has been implemented for this language. We wrote
  16. a sacrificial compiler prototype upfront to help inform the language design, and
  17. that first compiler used [yacc][yacc] for its parser. Using yacc was helpful at
  18. first because it makes it reasonably simple to iterate on the parser when the
  19. language is still undergoing frequent and far-reaching design changes. Another
  20. nice side-effect starting with a yacc parser is that it makes it quite easy to
  21. produce a formal grammar when you settle on the design. Here's a peek at some of
  22. our original parser code:
  23. [yacc]: https://en.wikipedia.org/wiki/Yacc
  24. ```
  25. struct_type
  26. : T_STRUCT '{' struct_fields '}' {
  27. $$.flags = 0;
  28. $$.storage = TYPE_STRUCT;
  29. allocfrom((void **)&$$.fields, &$3, sizeof($3));
  30. }
  31. | T_UNION '{' struct_fields '}' {
  32. $$.flags = 0;
  33. $$.storage = TYPE_UNION;
  34. allocfrom((void **)&$$.fields, &$3, sizeof($3));
  35. }
  36. ;
  37. struct_fields
  38. : struct_field
  39. | struct_field ',' { $$ = $1; }
  40. | struct_field ',' struct_fields {
  41. $$ = $1;
  42. allocfrom((void **)&$$.next, &$3, sizeof($3));
  43. }
  44. ;
  45. struct_field
  46. : T_IDENT ':' type {
  47. $$.name = $1;
  48. allocfrom((void**)&$$.type, &$3, sizeof($3));
  49. $$.next = NULL;
  50. }
  51. ;
  52. ```
  53. This approach has you writing code which is already almost a formal grammar in
  54. its own right. If we strip out the C code, we get the following:
  55. ```
  56. struct_type
  57. : T_STRUCT '{' struct_fields '}'
  58. | T_UNION '{' struct_fields '}'
  59. ;
  60. struct_fields
  61. : struct_field
  62. | struct_field ','
  63. | struct_field ',' struct_fields
  64. ;
  65. struct_field
  66. : T_IDENT ':' type
  67. ;
  68. ```
  69. This gives us a reasonably clean path to writing a formal grammar (and
  70. specification) for the language, which is what we did next.
  71. ![A screenshot of a PDF file which shows a formal grammar similar to the sample
  72. given above.](https://l.sr.ht/CFo0.png)
  73. All of these samples describe a struct type. The following example shows what
  74. this grammar looks like in real code — starting from the word "struct" and
  75. including up to the "}" at the end.
  76. ```hare
  77. type coordinates = struct {
  78. x: int,
  79. y: int,
  80. z: int,
  81. };
  82. ```
  83. In order to feed our parser tokens to work with, we also need a lexer, or a
  84. *lexical analyzer*. This turns a series of characters like "struct" into a
  85. single token, like the T_STRUCT we used in the yacc code. Like the original
  86. compiler used yacc as a parser generator, we also used [lex][lex] as a lexer
  87. generator. It's simply a list of regexes and the names of the tokens that match
  88. those regexes, plus a little bit of extra code to do things like turning "1234"
  89. into an int with a value of 1234. Our lexer also kept track of line and column
  90. numbers as it consumed characters from input files.
  91. [lex]: https://en.wikipedia.org/wiki/Lex_(software)
  92. ```
  93. "struct" { _lineno(); return T_STRUCT; }
  94. "union" { _lineno(); return T_UNION; }
  95. "{" { _lineno(); return '{'; }
  96. "}" { _lineno(); return '}'; }
  97. [a-zA-Z][a-zA-Z0-9_]* {
  98. _lineno();
  99. yylval.sval = strdup(yytext);
  100. return T_IDENTIFIER;
  101. }
  102. ```
  103. After we settled on the design with our prototype compiler, which was able to
  104. compile some simple test programs to give us a feel for our language design, we
  105. set it aside and wrote the specification, and, alongside it, a second compiler.
  106. This new compiler was written in C — the language was not ready to
  107. self-host yet — and uses a hand-written [recursive descent][rd] parser.
  108. [rd]: https://en.wikipedia.org/wiki/Recursive_descent_parser
  109. To simplify the parser, we deliberately designed a context-free LL(1) grammar,
  110. which means it (a) can parse an input unambiguously without needing additional
  111. context, and (b) only requires one token of look-ahead. This makes our parser
  112. design a lot simpler, which was a deliberate goal of the language design. Our
  113. hand-rolled lexer is *slightly* more complicated: it requires two characters of
  114. lookahead to distinguish between the ".", "..", and "..." tokens.
  115. I'll skip going in depth on the design of the second parser, because the hosted
  116. parser is more interesting, and a pretty similar design anyway. Let's start by
  117. taking a look at our hosted lexer. Our lexer is initialized with an input source
  118. (e.g. a file) from which it can read a stream of characters. Then, each time we
  119. need a token, we'll ask it to read the next one out. It will read as many
  120. characters as it needs to unambiguously identify the next token, then hand it up
  121. to the caller.
  122. Our specification provides some information to guide the lexer design:
  123. > A token is the smallest unit of meaning in the <span style="background:
  124. > black">****</span> grammar. The lexical analysis phase processes a UTF-8
  125. > source file to produce a stream of tokens by matching the terminals with the
  126. > input text.
  127. >
  128. > Tokens may be separated by *white-space* characters, which are defined as the
  129. > Unicode code-points `U+0009` (horizontal tabulation), `U+000A` (line feed), and
  130. > `U+0020` (space). Any number of whitespace characters may be inserted between
  131. > tokens, either to disambiguate from subsequent tokens, or for aesthetic
  132. > purposes. This whitespace is discarded during the lexical analysis phase.
  133. >
  134. > *Within a single token, white-space is meaningful. For example, the
  135. > string-literal token is defined by two quotation marks **"** enclosing any
  136. > number of literal characters. The enclosed characters are considered part of
  137. > the string-literal token and any whitespace therein is not discarded.*
  138. >
  139. > The lexical analysis process consumes Unicode characters from the source file
  140. > input until it is exhausted, performing the following steps in order: it shall
  141. > consume and discard white-space characters until a non-white-space character
  142. > is found, then consume the longest sequence of characters which could
  143. > constitute a token, and emit it to the token stream.
  144. There are a few different kinds of tokens our lexer is going to need to handle:
  145. operators, like "+" and "-"; keywords, like "struct" and "return"; user-defined
  146. identifiers, like variable names; and constants, like string and numeric
  147. literals.
  148. In short, given the following source code:
  149. ```hare
  150. fn add2(x: int, y: int) int = x + y;
  151. ```
  152. We need to return the following sequence of tokens:
  153. ```
  154. fn (keyword)
  155. add2 (identifier)
  156. ( (operator)
  157. x
  158. :
  159. int
  160. ,
  161. y
  162. int
  163. )
  164. int
  165. =
  166. x
  167. +
  168. y
  169. ;
  170. ```
  171. This way, our parser doesn't have to deal with whitespace, or distinguishing
  172. "int" (keyword) from "integer" (identifier), or handling invalid tokens like
  173. "$". To actually implement this behavior, we'll start with an initialization
  174. function which populates a state structure.
  175. ```hare
  176. // Initializes a new lexer for the given input stream. The path is borrowed.
  177. export fn init(in: *io::stream, path: str, flags: flags...) lexer = {
  178. return lexer {
  179. in = in,
  180. path = path,
  181. loc = (1, 1),
  182. un = void,
  183. rb = [void...],
  184. };
  185. };
  186. export type lexer = struct {
  187. in: *io::stream,
  188. path: str,
  189. loc: (uint, uint),
  190. rb: [2](rune | io::EOF | void),
  191. };
  192. ```
  193. This state structure holds, respectively:
  194. - The input I/O stream
  195. - The path to the current input file
  196. - The current (line, column) number
  197. - A buffer of un-read characters from the input, for lookahead
  198. The main entry point for doing the actual lexing will look like this:
  199. ```hare
  200. // Returns the next token from the lexer.
  201. export fn lex(lex: *lexer) (token | error);
  202. // A single lexical token, the value it represents, and its location in a file.
  203. export type token = (ltok, value, location);
  204. // A token value, used for tokens such as '1337' (an integer).
  205. export type value = (str | rune | i64 | u64 | f64 | void);
  206. // A location in a source file.
  207. export type location = struct {
  208. path: str,
  209. line: uint,
  210. col: uint
  211. };
  212. // A lexical token class.
  213. export type ltok = enum uint {
  214. UNDERSCORE,
  215. ABORT,
  216. ALLOC,
  217. APPEND,
  218. AS,
  219. // ... continued ...
  220. EOF,
  221. };
  222. ```
  223. The idea is that when the caller needs another token, they will call `lex`, and
  224. receive either a token or an error. The purpose of our lex function is to read
  225. out the next character and decide what kind of tokens it might be the start of,
  226. and dispatch to more specific lexing functions to handle each case.
  227. ```hare
  228. export fn lex(lex: *lexer) (token | error) = {
  229. let loc = location { ... };
  230. let rn: rune = match (nextw(lex)?) {
  231. _: io::EOF => return (ltok::EOF, void, mkloc(lex)),
  232. rl: (rune, location) => {
  233. loc = rl.1;
  234. rl.0;
  235. },
  236. };
  237. if (is_name(rn, false)) {
  238. unget(lex, rn);
  239. return lex_name(lex, loc, true);
  240. };
  241. if (ascii::isdigit(rn)) {
  242. unget(lex, rn);
  243. return lex_literal(lex, loc);
  244. };
  245. let tok: ltok = switch (rn) {
  246. * => return syntaxerr(loc, "invalid character"),
  247. '"', '\'' => {
  248. unget(lex, rn);
  249. return lex_rn_str(lex, loc);
  250. },
  251. '.', '<', '>' => return lex3(lex, loc, rn),
  252. '^', '*', '%', '/', '+', '-', ':', '!', '&', '|', '=' => {
  253. return lex2(lex, loc, rn);
  254. },
  255. '~' => ltok::BNOT,
  256. ',' => ltok::COMMA,
  257. '{' => ltok::LBRACE,
  258. '[' => ltok::LBRACKET,
  259. '(' => ltok::LPAREN,
  260. '}' => ltok::RBRACE,
  261. ']' => ltok::RBRACKET,
  262. ')' => ltok::RPAREN,
  263. ';' => ltok::SEMICOLON,
  264. '?' => ltok::QUESTION,
  265. };
  266. return (tok, void, loc);
  267. };
  268. ```
  269. Aside from the EOF case, and simple single-character operators like ";", both of
  270. which this function handles itself, its role is to dispatch work to various
  271. sub-lexers.
  272. <details>
  273. <summary>Expand me to read the helper functions</summary>
  274. <pre>
  275. fn nextw(lex: *lexer) ((rune, location) | io::EOF | io::error) = {
  276. for (true) {
  277. let loc = mkloc(lex);
  278. match (next(lex)) {
  279. e: (io::error | io::EOF) => return e,
  280. r: rune => if (!ascii::isspace(r)) {
  281. return (r, loc);
  282. } else {
  283. free(lex.comment);
  284. lex.comment = "";
  285. },
  286. };
  287. };
  288. abort();
  289. };
  290. </pre>
  291. <pre>
  292. fn unget(lex: *lexer, r: (rune | io::EOF)) void = {
  293. if (!(lex.rb[0] is void)) {
  294. assert(lex.rb[1] is void, "ungot too many runes");
  295. lex.rb[1] = lex.rb[0];
  296. };
  297. lex.rb[0] = r;
  298. };
  299. </pre>
  300. <pre>
  301. fn is_name(r: rune, num: bool) bool =
  302. ascii::isalpha(r) || r == '_' || r == '@' || (num && ascii::isdigit(r));
  303. </pre>
  304. </details>
  305. The sub-lexers handle more specific cases. The lex_name function handles things
  306. which look like identifiers, including keywords; the lex_literal function
  307. handles things which look like literals (e.g. "1234"); lex_rn_str handles rune
  308. and string literals (e.g. "hello world" and '\n'); and lex2 and lex3
  309. respectively handle two- and three-character operators like "&&" and "\>\>=".
  310. lex_name is the most complicated of these. Because the only thing which
  311. distinguishes a keyword from an identifier is that the former matches a specific
  312. list of strings, we start by reading a "name" into a buffer, then binary
  313. searching against a list of known keywords to see if it matches something there.
  314. To facilitate this, "bmap" is a pre-sorted array of keyword names.
  315. ```hare
  316. const bmap: [_]str = [
  317. // Keep me alpha-sorted and consistent with the ltok enum.
  318. "_",
  319. "abort",
  320. "alloc",
  321. "append",
  322. "as",
  323. "assert",
  324. "bool",
  325. // ...
  326. ];
  327. fn lex_name(lex: *lexer, loc: location, keyword: bool) (token | error) = {
  328. let buf = strio::dynamic();
  329. match (next(lex)) {
  330. r: rune => {
  331. assert(is_name(r, false));
  332. strio::appendrune(buf, r);
  333. },
  334. _: (io::EOF | io::error) => abort(), // Invariant
  335. };
  336. for (true) match (next(lex)?) {
  337. _: io::EOF => break,
  338. r: rune => {
  339. if (!is_name(r, true)) {
  340. unget(lex, r);
  341. break;
  342. };
  343. strio::appendrune(buf, r);
  344. },
  345. };
  346. let name = strio::finish(buf);
  347. if (!keyword) {
  348. return (ltok::NAME, name, loc);
  349. };
  350. return match (sort::search(bmap[..ltok::LAST_KEYWORD+1],
  351. size(str), &name, &namecmp)) {
  352. null => (ltok::NAME, name, loc),
  353. v: *void => {
  354. defer free(name);
  355. let tok = v: uintptr - &bmap[0]: uintptr;
  356. tok /= size(str): uintptr;
  357. (tok: ltok, void, loc);
  358. },
  359. };
  360. };
  361. ```
  362. The rest of the code is more of the same, but I've [put it up here][lex.ha] if
  363. you want to read it.
  364. Let's move on to parsing: we need to turn this one dimensional stream of tokens
  365. into an structured form: the *Abstract Syntax Tree*. Consider the following
  366. sample code:
  367. [lex.ha]: https://paste.sr.ht/~sircmpwn/25871787b0d41db2b0af573ba1c93e1b6438b942
  368. ```hare
  369. let x: int = add2(40, 2);
  370. ```
  371. Our token stream looks like this:
  372. ```
  373. let x : int = add2 ( 40 , 2 ) ;
  374. ```
  375. But what we need is something more structured, like this:
  376. ```
  377. binding
  378. name="x"
  379. type="int"
  380. initializer=call-expression
  381. => func="add2"
  382. parameters
  383. constant value="40"
  384. constant value="2"
  385. ```
  386. We know at each step what kinds of tokens are valid in each situation. After we
  387. see "let", we know that we're parsing a binding, so we look for a name ("x")
  388. and a colon token, a type for the variable, an equals sign, and an expression
  389. which initializes it. To parse the initializer, we see an identifier, "add2",
  390. then an open parenthesis, so we know we're in a call expression, and we can
  391. start parsing arguments.
  392. To make our parser code expressive, and to handle errors neatly, we're going to
  393. implement a few helper function that lets us describe these states in terms of
  394. what the parser wants from the lexer. We have a few functions to accomplish
  395. this:
  396. ```hare
  397. // Requires the next token to have a matching ltok. Returns that token, or an error.
  398. fn want(lexer: *lex::lexer, want: lex::ltok...) (lex::token | error) = {
  399. let tok = lex::lex(lexer)?;
  400. if (len(want) == 0) {
  401. return tok;
  402. };
  403. for (let i = 0z; i < len(want); i += 1) {
  404. if (tok.0 == want[i]) {
  405. return tok;
  406. };
  407. };
  408. let buf = strio::dynamic();
  409. defer io::close(buf);
  410. for (let i = 0z; i < len(want); i += 1) {
  411. fmt::fprintf(buf, "'{}'", lex::tokstr((want[i], void, mkloc(lexer))));
  412. if (i + 1 < len(want)) {
  413. fmt::fprint(buf, ", ");
  414. };
  415. };
  416. return syntaxerr(mkloc(lexer), "Unexpected '{}', was expecting {}",
  417. lex::tokstr(tok), strio::string(buf));
  418. };
  419. // Looks for a matching ltok from the lexer, and if not present, unlexes the
  420. // token and returns void. If found, the token is consumed from the lexer and is
  421. // returned.
  422. fn try(
  423. lexer: *lex::lexer,
  424. want: lex::ltok...
  425. ) (lex::token | error | void) = {
  426. let tok = lex::lex(lexer)?;
  427. assert(len(want) > 0);
  428. for (let i = 0z; i < len(want); i += 1) {
  429. if (tok.0 == want[i]) {
  430. return tok;
  431. };
  432. };
  433. lex::unlex(lexer, tok);
  434. };
  435. // Looks for a matching ltok from the lexer, unlexes the token, and returns
  436. // it; or void if it was not a ltok.
  437. fn peek(
  438. lexer: *lex::lexer,
  439. want: lex::ltok...
  440. ) (lex::token | error | void) = {
  441. let tok = lex::lex(lexer)?;
  442. lex::unlex(lexer, tok);
  443. if (len(want) == 0) {
  444. return tok;
  445. };
  446. for (let i = 0z; i < len(want); i += 1) {
  447. if (tok.0 == want[i]) {
  448. return tok;
  449. };
  450. };
  451. };
  452. ```
  453. Let's say we're looking for a binding like our sample code to show up next. The
  454. grammar from the spec is as follows:
  455. ![](https://l.sr.ht/W2AY.png)
  456. And here's the code that parses that:
  457. ```hare
  458. fn binding(lexer: *lex::lexer) (ast::expr | error) = {
  459. const is_static: bool = try(lexer, ltok::STATIC)? is lex::token;
  460. const is_const = switch (want(lexer, ltok::LET, ltok::CONST)?.0) {
  461. ltok::LET => false,
  462. ltok::CONST => true,
  463. };
  464. let bindings: []ast::binding = [];
  465. for (true) {
  466. const name = want(lexer, ltok::NAME)?.1 as str;
  467. const btype: nullable *ast::_type =
  468. if (try(lexer, ltok::COLON)? is lex::token) {
  469. alloc(_type(lexer)?);
  470. } else null;
  471. want(lexer, ltok::EQUAL)?;
  472. const init = alloc(expression(lexer)?);
  473. append(bindings, ast::binding {
  474. name = name,
  475. _type = btype,
  476. init = init,
  477. });
  478. match (try(lexer, ltok::COMMA)?) {
  479. _: void => break,
  480. _: lex::token => void,
  481. };
  482. };
  483. return ast::binding_expr {
  484. is_static = is_static,
  485. is_const = is_const,
  486. bindings = bindings,
  487. };
  488. };
  489. ```
  490. Hopefully the flow of this code is fairly apparent. The goal is to fill in the
  491. following AST structure:
  492. ```hare
  493. // A single variable biding. For example:
  494. //
  495. // foo: int = bar
  496. export type binding = struct {
  497. name: str,
  498. _type: nullable *_type,
  499. init: *expr,
  500. };
  501. // A variable binding expression. For example:
  502. //
  503. // let foo: int = bar, ...
  504. export type binding_expr = struct {
  505. is_static: bool,
  506. is_const: bool,
  507. bindings: []binding,
  508. };
  509. ```
  510. The rest of the code is pretty similar, though some corners of the grammar are a
  511. bit hairier than others. One example is how we parse infix operators for binary
  512. arithmetic expressions (such as "2 + 2"):
  513. ```hare
  514. fn binarithm(
  515. lexer: *lex::lexer,
  516. lvalue: (ast::expr | void),
  517. i: int,
  518. ) (ast::expr | error) = {
  519. // Precedence climbing parser
  520. // https://en.wikipedia.org/wiki/Operator-precedence_parser
  521. let lvalue = match (lvalue) {
  522. _: void => cast(lexer, void)?,
  523. expr: ast::expr => expr,
  524. };
  525. let tok = lex::lex(lexer)?;
  526. for (let j = precedence(tok); j >= i; j = precedence(tok)) {
  527. const op = binop_for_tok(tok);
  528. let rvalue = cast(lexer, void)?;
  529. tok = lex::lex(lexer)?;
  530. for (let k = precedence(tok); k > j; k = precedence(tok)) {
  531. lex::unlex(lexer, tok);
  532. rvalue = binarithm(lexer, rvalue, k)?;
  533. tok = lex::lex(lexer)?;
  534. };
  535. let expr = ast::binarithm_expr {
  536. op = op,
  537. lvalue = alloc(lvalue),
  538. rvalue = alloc(rvalue),
  539. };
  540. lvalue = expr;
  541. };
  542. lex::unlex(lexer, tok);
  543. return lvalue;
  544. };
  545. fn precedence(tok: lex::token) int = switch (tok.0) {
  546. ltok::LOR => 0,
  547. ltok::LXOR => 1,
  548. ltok::LAND => 2,
  549. ltok::LEQUAL, ltok::NEQUAL => 3,
  550. ltok::LESS, ltok::LESSEQ, ltok::GREATER, ltok::GREATEREQ => 4,
  551. ltok::BOR => 5,
  552. ltok::BXOR => 6,
  553. ltok::BAND => 7,
  554. ltok::LSHIFT, ltok::RSHIFT => 8,
  555. ltok::PLUS, ltok::MINUS => 9,
  556. ltok::TIMES, ltok::DIV, ltok::MODULO => 10,
  557. * => -1,
  558. };
  559. ```
  560. I don't really grok this algorithm, to be honest, but hey, it works. Whenever I
  561. write a precedence climbing parser, I'll stare at the Wikipedia page for 15
  562. minutes, quickly write a parser, and then immediately forget how it works. Maybe
  563. I'll write a blog post about it someday.
  564. Anyway, ultimately, this code lives in our standard library and is used for
  565. several things, including our (early in development) self-hosted compiler.
  566. Here's an example of its usage, taken from our documentation generator:
  567. ```hare
  568. fn scan(path: str) (ast::subunit | error) = {
  569. const input = match (os::open(path)) {
  570. s: *io::stream => s,
  571. err: fs::error => fmt::fatal("Error reading {}: {}",
  572. path, fs::strerror(err)),
  573. };
  574. defer io::close(input);
  575. const lexer = lex::init(input, path, lex::flags::COMMENTS);
  576. return parse::subunit(&lexer)?;
  577. };
  578. ```
  579. Where the "ast::subunit" type is:
  580. ```hare
  581. // A sub-unit, typically representing a single source file.
  582. export type subunit = struct {
  583. imports: []import,
  584. decls: []decl,
  585. };
  586. ```
  587. Pretty straightforward! Having this as part of the standard library should make
  588. it much easier for users to build language-aware tooling with the language
  589. itself. We also plan on having our type checker in the stdlib as well. This is
  590. something that I drew inspiration for from Golang &mdash; having a lot of their
  591. toolchain components in the standard library makes it really easy to write
  592. Go-aware tools.
  593. So, there you have it: the next stage in the development of our language. I hope
  594. you're looking forward to it!