logo

drewdevault.com

[mirror] blog and personal website of Drew DeVault git clone https://hacktivis.me/git/mirror/drewdevault.com.git

Our-self-hosted-parser-design.md (19686B)


  1. ---
  2. title: "Parsers all the way down: writing a self-hosting parser"
  3. date: 2021-04-22
  4. ---
  5. One of the things we're working on in [my new programming language][0] is a
  6. self-hosting compiler. Having a self-hosted compiler is a critical step in the
  7. development of (some) programming languages: it signals that the language is
  8. mature enough to be comfortably used to implement itself. While this isn't right
  9. for some languages (e.g. shell scripts), for a systems programming language like
  10. ours, this is a crucial step in our bootstrapping plan. Our self-hosted parser
  11. design was completed this week, and today I'll share some details about how it
  12. works and how it came to be.
  13. [0]: https://drewdevault.com/2021/03/19/A-new-systems-language.html
  14. This is the third parser which has been implemented for this language. We wrote
  15. a sacrificial compiler prototype upfront to help inform the language design, and
  16. that first compiler used [yacc][yacc] for its parser. Using yacc was helpful at
  17. first because it makes it reasonably simple to iterate on the parser when the
  18. language is still undergoing frequent and far-reaching design changes. Another
  19. nice side-effect starting with a yacc parser is that it makes it quite easy to
  20. produce a formal grammar when you settle on the design. Here's a peek at some of
  21. our original parser code:
  22. [yacc]: https://en.wikipedia.org/wiki/Yacc
  23. ```
  24. struct_type
  25. : T_STRUCT '{' struct_fields '}' {
  26. $$.flags = 0;
  27. $$.storage = TYPE_STRUCT;
  28. allocfrom((void **)&$$.fields, &$3, sizeof($3));
  29. }
  30. | T_UNION '{' struct_fields '}' {
  31. $$.flags = 0;
  32. $$.storage = TYPE_UNION;
  33. allocfrom((void **)&$$.fields, &$3, sizeof($3));
  34. }
  35. ;
  36. struct_fields
  37. : struct_field
  38. | struct_field ',' { $$ = $1; }
  39. | struct_field ',' struct_fields {
  40. $$ = $1;
  41. allocfrom((void **)&$$.next, &$3, sizeof($3));
  42. }
  43. ;
  44. struct_field
  45. : T_IDENT ':' type {
  46. $$.name = $1;
  47. allocfrom((void**)&$$.type, &$3, sizeof($3));
  48. $$.next = NULL;
  49. }
  50. ;
  51. ```
  52. This approach has you writing code which is already almost a formal grammar in
  53. its own right. If we strip out the C code, we get the following:
  54. ```
  55. struct_type
  56. : T_STRUCT '{' struct_fields '}'
  57. | T_UNION '{' struct_fields '}'
  58. ;
  59. struct_fields
  60. : struct_field
  61. | struct_field ','
  62. | struct_field ',' struct_fields
  63. ;
  64. struct_field
  65. : T_IDENT ':' type
  66. ;
  67. ```
  68. This gives us a reasonably clean path to writing a formal grammar (and
  69. specification) for the language, which is what we did next.
  70. ![A screenshot of a PDF file which shows a formal grammar similar to the sample
  71. given above.](https://redacted.moe/f/00bba3bd.png)
  72. All of these samples describe a struct type. The following example shows what
  73. this grammar looks like in real code — starting from the word "struct" and
  74. including up to the "}" at the end.
  75. ```hare
  76. type coordinates = struct {
  77. x: int,
  78. y: int,
  79. z: int,
  80. };
  81. ```
  82. In order to feed our parser tokens to work with, we also need a lexer, or a
  83. *lexical analyzer*. This turns a series of characters like "struct" into a
  84. single token, like the T_STRUCT we used in the yacc code. Like the original
  85. compiler used yacc as a parser generator, we also used [lex][lex] as a lexer
  86. generator. It's simply a list of regexes and the names of the tokens that match
  87. those regexes, plus a little bit of extra code to do things like turning "1234"
  88. into an int with a value of 1234. Our lexer also kept track of line and column
  89. numbers as it consumed characters from input files.
  90. [lex]: https://en.wikipedia.org/wiki/Lex_(software)
  91. ```
  92. "struct" { _lineno(); return T_STRUCT; }
  93. "union" { _lineno(); return T_UNION; }
  94. "{" { _lineno(); return '{'; }
  95. "}" { _lineno(); return '}'; }
  96. [a-zA-Z][a-zA-Z0-9_]* {
  97. _lineno();
  98. yylval.sval = strdup(yytext);
  99. return T_IDENTIFIER;
  100. }
  101. ```
  102. After we settled on the design with our prototype compiler, which was able to
  103. compile some simple test programs to give us a feel for our language design, we
  104. set it aside and wrote the specification, and, alongside it, a second compiler.
  105. This new compiler was written in C — the language was not ready to
  106. self-host yet — and uses a hand-written [recursive descent][rd] parser.
  107. [rd]: https://en.wikipedia.org/wiki/Recursive_descent_parser
  108. To simplify the parser, we deliberately designed a context-free LL(1) grammar,
  109. which means it (a) can parse an input unambiguously without needing additional
  110. context, and (b) only requires one token of look-ahead. This makes our parser
  111. design a lot simpler, which was a deliberate goal of the language design. Our
  112. hand-rolled lexer is *slightly* more complicated: it requires two characters of
  113. lookahead to distinguish between the ".", "..", and "..." tokens.
  114. I'll skip going in depth on the design of the second parser, because the hosted
  115. parser is more interesting, and a pretty similar design anyway. Let's start by
  116. taking a look at our hosted lexer. Our lexer is initialized with an input source
  117. (e.g. a file) from which it can read a stream of characters. Then, each time we
  118. need a token, we'll ask it to read the next one out. It will read as many
  119. characters as it needs to unambiguously identify the next token, then hand it up
  120. to the caller.
  121. Our specification provides some information to guide the lexer design:
  122. > A token is the smallest unit of meaning in the <span style="background:
  123. > black">****</span> grammar. The lexical analysis phase processes a UTF-8
  124. > source file to produce a stream of tokens by matching the terminals with the
  125. > input text.
  126. >
  127. > Tokens may be separated by *white-space* characters, which are defined as the
  128. > Unicode code-points `U+0009` (horizontal tabulation), `U+000A` (line feed), and
  129. > `U+0020` (space). Any number of whitespace characters may be inserted between
  130. > tokens, either to disambiguate from subsequent tokens, or for aesthetic
  131. > purposes. This whitespace is discarded during the lexical analysis phase.
  132. >
  133. > *Within a single token, white-space is meaningful. For example, the
  134. > string-literal token is defined by two quotation marks **"** enclosing any
  135. > number of literal characters. The enclosed characters are considered part of
  136. > the string-literal token and any whitespace therein is not discarded.*
  137. >
  138. > The lexical analysis process consumes Unicode characters from the source file
  139. > input until it is exhausted, performing the following steps in order: it shall
  140. > consume and discard white-space characters until a non-white-space character
  141. > is found, then consume the longest sequence of characters which could
  142. > constitute a token, and emit it to the token stream.
  143. There are a few different kinds of tokens our lexer is going to need to handle:
  144. operators, like "+" and "-"; keywords, like "struct" and "return"; user-defined
  145. identifiers, like variable names; and constants, like string and numeric
  146. literals.
  147. In short, given the following source code:
  148. ```hare
  149. fn add2(x: int, y: int) int = x + y;
  150. ```
  151. We need to return the following sequence of tokens:
  152. ```
  153. fn (keyword)
  154. add2 (identifier)
  155. ( (operator)
  156. x
  157. :
  158. int
  159. ,
  160. y
  161. int
  162. )
  163. int
  164. =
  165. x
  166. +
  167. y
  168. ;
  169. ```
  170. This way, our parser doesn't have to deal with whitespace, or distinguishing
  171. "int" (keyword) from "integer" (identifier), or handling invalid tokens like
  172. "$". To actually implement this behavior, we'll start with an initialization
  173. function which populates a state structure.
  174. ```hare
  175. // Initializes a new lexer for the given input stream. The path is borrowed.
  176. export fn init(in: *io::stream, path: str, flags: flags...) lexer = {
  177. return lexer {
  178. in = in,
  179. path = path,
  180. loc = (1, 1),
  181. un = void,
  182. rb = [void...],
  183. };
  184. };
  185. export type lexer = struct {
  186. in: *io::stream,
  187. path: str,
  188. loc: (uint, uint),
  189. rb: [2](rune | io::EOF | void),
  190. };
  191. ```
  192. This state structure holds, respectively:
  193. - The input I/O stream
  194. - The path to the current input file
  195. - The current (line, column) number
  196. - A buffer of un-read characters from the input, for lookahead
  197. The main entry point for doing the actual lexing will look like this:
  198. ```hare
  199. // Returns the next token from the lexer.
  200. export fn lex(lex: *lexer) (token | error);
  201. // A single lexical token, the value it represents, and its location in a file.
  202. export type token = (ltok, value, location);
  203. // A token value, used for tokens such as '1337' (an integer).
  204. export type value = (str | rune | i64 | u64 | f64 | void);
  205. // A location in a source file.
  206. export type location = struct {
  207. path: str,
  208. line: uint,
  209. col: uint
  210. };
  211. // A lexical token class.
  212. export type ltok = enum uint {
  213. UNDERSCORE,
  214. ABORT,
  215. ALLOC,
  216. APPEND,
  217. AS,
  218. // ... continued ...
  219. EOF,
  220. };
  221. ```
  222. The idea is that when the caller needs another token, they will call `lex`, and
  223. receive either a token or an error. The purpose of our lex function is to read
  224. out the next character and decide what kind of tokens it might be the start of,
  225. and dispatch to more specific lexing functions to handle each case.
  226. ```hare
  227. export fn lex(lex: *lexer) (token | error) = {
  228. let loc = location { ... };
  229. let rn: rune = match (nextw(lex)?) {
  230. _: io::EOF => return (ltok::EOF, void, mkloc(lex)),
  231. rl: (rune, location) => {
  232. loc = rl.1;
  233. rl.0;
  234. },
  235. };
  236. if (is_name(rn, false)) {
  237. unget(lex, rn);
  238. return lex_name(lex, loc, true);
  239. };
  240. if (ascii::isdigit(rn)) {
  241. unget(lex, rn);
  242. return lex_literal(lex, loc);
  243. };
  244. let tok: ltok = switch (rn) {
  245. * => return syntaxerr(loc, "invalid character"),
  246. '"', '\'' => {
  247. unget(lex, rn);
  248. return lex_rn_str(lex, loc);
  249. },
  250. '.', '<', '>' => return lex3(lex, loc, rn),
  251. '^', '*', '%', '/', '+', '-', ':', '!', '&', '|', '=' => {
  252. return lex2(lex, loc, rn);
  253. },
  254. '~' => ltok::BNOT,
  255. ',' => ltok::COMMA,
  256. '{' => ltok::LBRACE,
  257. '[' => ltok::LBRACKET,
  258. '(' => ltok::LPAREN,
  259. '}' => ltok::RBRACE,
  260. ']' => ltok::RBRACKET,
  261. ')' => ltok::RPAREN,
  262. ';' => ltok::SEMICOLON,
  263. '?' => ltok::QUESTION,
  264. };
  265. return (tok, void, loc);
  266. };
  267. ```
  268. Aside from the EOF case, and simple single-character operators like ";", both of
  269. which this function handles itself, its role is to dispatch work to various
  270. sub-lexers.
  271. <details>
  272. <summary>Expand me to read the helper functions</summary>
  273. <pre>
  274. fn nextw(lex: *lexer) ((rune, location) | io::EOF | io::error) = {
  275. for (true) {
  276. let loc = mkloc(lex);
  277. match (next(lex)) {
  278. e: (io::error | io::EOF) => return e,
  279. r: rune => if (!ascii::isspace(r)) {
  280. return (r, loc);
  281. } else {
  282. free(lex.comment);
  283. lex.comment = "";
  284. },
  285. };
  286. };
  287. abort();
  288. };
  289. </pre>
  290. <pre>
  291. fn unget(lex: *lexer, r: (rune | io::EOF)) void = {
  292. if (!(lex.rb[0] is void)) {
  293. assert(lex.rb[1] is void, "ungot too many runes");
  294. lex.rb[1] = lex.rb[0];
  295. };
  296. lex.rb[0] = r;
  297. };
  298. </pre>
  299. <pre>
  300. fn is_name(r: rune, num: bool) bool =
  301. ascii::isalpha(r) || r == '_' || r == '@' || (num && ascii::isdigit(r));
  302. </pre>
  303. </details>
  304. The sub-lexers handle more specific cases. The lex_name function handles things
  305. which look like identifiers, including keywords; the lex_literal function
  306. handles things which look like literals (e.g. "1234"); lex_rn_str handles rune
  307. and string literals (e.g. "hello world" and '\n'); and lex2 and lex3
  308. respectively handle two- and three-character operators like "&&" and "\>\>=".
  309. lex_name is the most complicated of these. Because the only thing which
  310. distinguishes a keyword from an identifier is that the former matches a specific
  311. list of strings, we start by reading a "name" into a buffer, then binary
  312. searching against a list of known keywords to see if it matches something there.
  313. To facilitate this, "bmap" is a pre-sorted array of keyword names.
  314. ```hare
  315. const bmap: [_]str = [
  316. // Keep me alpha-sorted and consistent with the ltok enum.
  317. "_",
  318. "abort",
  319. "alloc",
  320. "append",
  321. "as",
  322. "assert",
  323. "bool",
  324. // ...
  325. ];
  326. fn lex_name(lex: *lexer, loc: location, keyword: bool) (token | error) = {
  327. let buf = strio::dynamic();
  328. match (next(lex)) {
  329. r: rune => {
  330. assert(is_name(r, false));
  331. strio::appendrune(buf, r);
  332. },
  333. _: (io::EOF | io::error) => abort(), // Invariant
  334. };
  335. for (true) match (next(lex)?) {
  336. _: io::EOF => break,
  337. r: rune => {
  338. if (!is_name(r, true)) {
  339. unget(lex, r);
  340. break;
  341. };
  342. strio::appendrune(buf, r);
  343. },
  344. };
  345. let name = strio::finish(buf);
  346. if (!keyword) {
  347. return (ltok::NAME, name, loc);
  348. };
  349. return match (sort::search(bmap[..ltok::LAST_KEYWORD+1],
  350. size(str), &name, &namecmp)) {
  351. null => (ltok::NAME, name, loc),
  352. v: *void => {
  353. defer free(name);
  354. let tok = v: uintptr - &bmap[0]: uintptr;
  355. tok /= size(str): uintptr;
  356. (tok: ltok, void, loc);
  357. },
  358. };
  359. };
  360. ```
  361. The rest of the code is more of the same, but I've [put it up here][lex.ha] if
  362. you want to read it.
  363. Let's move on to parsing: we need to turn this one dimensional stream of tokens
  364. into an structured form: the *Abstract Syntax Tree*. Consider the following
  365. sample code:
  366. [lex.ha]: https://paste.sr.ht/~sircmpwn/25871787b0d41db2b0af573ba1c93e1b6438b942
  367. ```hare
  368. let x: int = add2(40, 2);
  369. ```
  370. Our token stream looks like this:
  371. ```
  372. let x : int = add2 ( 40 , 2 ) ;
  373. ```
  374. But what we need is something more structured, like this:
  375. ```
  376. binding
  377. name="x"
  378. type="int"
  379. initializer=call-expression
  380. => func="add2"
  381. parameters
  382. constant value="40"
  383. constant value="2"
  384. ```
  385. We know at each step what kinds of tokens are valid in each situation. After we
  386. see "let", we know that we're parsing a binding, so we look for a name ("x")
  387. and a colon token, a type for the variable, an equals sign, and an expression
  388. which initializes it. To parse the initializer, we see an identifier, "add2",
  389. then an open parenthesis, so we know we're in a call expression, and we can
  390. start parsing arguments.
  391. To make our parser code expressive, and to handle errors neatly, we're going to
  392. implement a few helper function that lets us describe these states in terms of
  393. what the parser wants from the lexer. We have a few functions to accomplish
  394. this:
  395. ```hare
  396. // Requires the next token to have a matching ltok. Returns that token, or an error.
  397. fn want(lexer: *lex::lexer, want: lex::ltok...) (lex::token | error) = {
  398. let tok = lex::lex(lexer)?;
  399. if (len(want) == 0) {
  400. return tok;
  401. };
  402. for (let i = 0z; i < len(want); i += 1) {
  403. if (tok.0 == want[i]) {
  404. return tok;
  405. };
  406. };
  407. let buf = strio::dynamic();
  408. defer io::close(buf);
  409. for (let i = 0z; i < len(want); i += 1) {
  410. fmt::fprintf(buf, "'{}'", lex::tokstr((want[i], void, mkloc(lexer))));
  411. if (i + 1 < len(want)) {
  412. fmt::fprint(buf, ", ");
  413. };
  414. };
  415. return syntaxerr(mkloc(lexer), "Unexpected '{}', was expecting {}",
  416. lex::tokstr(tok), strio::string(buf));
  417. };
  418. // Looks for a matching ltok from the lexer, and if not present, unlexes the
  419. // token and returns void. If found, the token is consumed from the lexer and is
  420. // returned.
  421. fn try(
  422. lexer: *lex::lexer,
  423. want: lex::ltok...
  424. ) (lex::token | error | void) = {
  425. let tok = lex::lex(lexer)?;
  426. assert(len(want) > 0);
  427. for (let i = 0z; i < len(want); i += 1) {
  428. if (tok.0 == want[i]) {
  429. return tok;
  430. };
  431. };
  432. lex::unlex(lexer, tok);
  433. };
  434. // Looks for a matching ltok from the lexer, unlexes the token, and returns
  435. // it; or void if it was not a ltok.
  436. fn peek(
  437. lexer: *lex::lexer,
  438. want: lex::ltok...
  439. ) (lex::token | error | void) = {
  440. let tok = lex::lex(lexer)?;
  441. lex::unlex(lexer, tok);
  442. if (len(want) == 0) {
  443. return tok;
  444. };
  445. for (let i = 0z; i < len(want); i += 1) {
  446. if (tok.0 == want[i]) {
  447. return tok;
  448. };
  449. };
  450. };
  451. ```
  452. Let's say we're looking for a binding like our sample code to show up next. The
  453. grammar from the spec is as follows:
  454. ![](https://redacted.moe/f/f0e1d07d.png)
  455. And here's the code that parses that:
  456. ```hare
  457. fn binding(lexer: *lex::lexer) (ast::expr | error) = {
  458. const is_static: bool = try(lexer, ltok::STATIC)? is lex::token;
  459. const is_const = switch (want(lexer, ltok::LET, ltok::CONST)?.0) {
  460. ltok::LET => false,
  461. ltok::CONST => true,
  462. };
  463. let bindings: []ast::binding = [];
  464. for (true) {
  465. const name = want(lexer, ltok::NAME)?.1 as str;
  466. const btype: nullable *ast::_type =
  467. if (try(lexer, ltok::COLON)? is lex::token) {
  468. alloc(_type(lexer)?);
  469. } else null;
  470. want(lexer, ltok::EQUAL)?;
  471. const init = alloc(expression(lexer)?);
  472. append(bindings, ast::binding {
  473. name = name,
  474. _type = btype,
  475. init = init,
  476. });
  477. match (try(lexer, ltok::COMMA)?) {
  478. _: void => break,
  479. _: lex::token => void,
  480. };
  481. };
  482. return ast::binding_expr {
  483. is_static = is_static,
  484. is_const = is_const,
  485. bindings = bindings,
  486. };
  487. };
  488. ```
  489. Hopefully the flow of this code is fairly apparent. The goal is to fill in the
  490. following AST structure:
  491. ```hare
  492. // A single variable biding. For example:
  493. //
  494. // foo: int = bar
  495. export type binding = struct {
  496. name: str,
  497. _type: nullable *_type,
  498. init: *expr,
  499. };
  500. // A variable binding expression. For example:
  501. //
  502. // let foo: int = bar, ...
  503. export type binding_expr = struct {
  504. is_static: bool,
  505. is_const: bool,
  506. bindings: []binding,
  507. };
  508. ```
  509. The rest of the code is pretty similar, though some corners of the grammar are a
  510. bit hairier than others. One example is how we parse infix operators for binary
  511. arithmetic expressions (such as "2 + 2"):
  512. ```hare
  513. fn binarithm(
  514. lexer: *lex::lexer,
  515. lvalue: (ast::expr | void),
  516. i: int,
  517. ) (ast::expr | error) = {
  518. // Precedence climbing parser
  519. // https://en.wikipedia.org/wiki/Operator-precedence_parser
  520. let lvalue = match (lvalue) {
  521. _: void => cast(lexer, void)?,
  522. expr: ast::expr => expr,
  523. };
  524. let tok = lex::lex(lexer)?;
  525. for (let j = precedence(tok); j >= i; j = precedence(tok)) {
  526. const op = binop_for_tok(tok);
  527. let rvalue = cast(lexer, void)?;
  528. tok = lex::lex(lexer)?;
  529. for (let k = precedence(tok); k > j; k = precedence(tok)) {
  530. lex::unlex(lexer, tok);
  531. rvalue = binarithm(lexer, rvalue, k)?;
  532. tok = lex::lex(lexer)?;
  533. };
  534. let expr = ast::binarithm_expr {
  535. op = op,
  536. lvalue = alloc(lvalue),
  537. rvalue = alloc(rvalue),
  538. };
  539. lvalue = expr;
  540. };
  541. lex::unlex(lexer, tok);
  542. return lvalue;
  543. };
  544. fn precedence(tok: lex::token) int = switch (tok.0) {
  545. ltok::LOR => 0,
  546. ltok::LXOR => 1,
  547. ltok::LAND => 2,
  548. ltok::LEQUAL, ltok::NEQUAL => 3,
  549. ltok::LESS, ltok::LESSEQ, ltok::GREATER, ltok::GREATEREQ => 4,
  550. ltok::BOR => 5,
  551. ltok::BXOR => 6,
  552. ltok::BAND => 7,
  553. ltok::LSHIFT, ltok::RSHIFT => 8,
  554. ltok::PLUS, ltok::MINUS => 9,
  555. ltok::TIMES, ltok::DIV, ltok::MODULO => 10,
  556. * => -1,
  557. };
  558. ```
  559. I don't really grok this algorithm, to be honest, but hey, it works. Whenever I
  560. write a precedence climbing parser, I'll stare at the Wikipedia page for 15
  561. minutes, quickly write a parser, and then immediately forget how it works. Maybe
  562. I'll write a blog post about it someday.
  563. Anyway, ultimately, this code lives in our standard library and is used for
  564. several things, including our (early in development) self-hosted compiler.
  565. Here's an example of its usage, taken from our documentation generator:
  566. ```hare
  567. fn scan(path: str) (ast::subunit | error) = {
  568. const input = match (os::open(path)) {
  569. s: *io::stream => s,
  570. err: fs::error => fmt::fatal("Error reading {}: {}",
  571. path, fs::strerror(err)),
  572. };
  573. defer io::close(input);
  574. const lexer = lex::init(input, path, lex::flags::COMMENTS);
  575. return parse::subunit(&lexer)?;
  576. };
  577. ```
  578. Where the "ast::subunit" type is:
  579. ```hare
  580. // A sub-unit, typically representing a single source file.
  581. export type subunit = struct {
  582. imports: []import,
  583. decls: []decl,
  584. };
  585. ```
  586. Pretty straightforward! Having this as part of the standard library should make
  587. it much easier for users to build language-aware tooling with the language
  588. itself. We also plan on having our type checker in the stdlib as well. This is
  589. something that I drew inspiration for from Golang &mdash; having a lot of their
  590. toolchain components in the standard library makes it really easy to write
  591. Go-aware tools.
  592. So, there you have it: the next stage in the development of our language. I hope
  593. you're looking forward to it!