logo

oasis-root

Compiled tree of Oasis Linux based on own branch at <https://hacktivis.me/git/oasis/> git clone https://anongit.hacktivis.me/git/oasis-root.git

lexer.lua (92478B)


  1. -- Copyright 2006-2024 Mitchell. See LICENSE.
  2. --- Lexes Scintilla documents and source code with Lua and LPeg.
  3. --
  4. -- ### Writing Lua Lexers
  5. --
  6. -- Lexers recognize and tag elements of source code for syntax highlighting. Scintilla (the
  7. -- editing component behind [Textadept][] and [SciTE][]) traditionally uses static, compiled C++
  8. -- lexers which are notoriously difficult to create and/or extend. On the other hand, Lua makes
  9. -- it easy to to rapidly create new lexers, extend existing ones, and embed lexers within one
  10. -- another. Lua lexers tend to be more readable than C++ lexers too.
  11. --
  12. -- While lexers can be written in plain Lua, Scintillua prefers using Parsing Expression
  13. -- Grammars, or PEGs, composed with the Lua [LPeg library][]. As a result, this document is
  14. -- devoted to writing LPeg lexers. The following table comes from the LPeg documentation and
  15. -- summarizes all you need to know about constructing basic LPeg patterns. This module provides
  16. -- convenience functions for creating and working with other more advanced patterns and concepts.
  17. --
  18. -- Operator | Description
  19. -- -|-
  20. -- `lpeg.P(string)` | Matches `string` literally.
  21. -- `lpeg.P(`_`n`_`)` | Matches exactly _`n`_ number of characters.
  22. -- `lpeg.S(string)` | Matches any character in set `string`.
  23. -- `lpeg.R("`_`xy`_`")`| Matches any character between range `x` and `y`.
  24. -- `patt^`_`n`_ | Matches at least _`n`_ repetitions of `patt`.
  25. -- `patt^-`_`n`_ | Matches at most _`n`_ repetitions of `patt`.
  26. -- `patt1 * patt2` | Matches `patt1` followed by `patt2`.
  27. -- `patt1 + patt2` | Matches `patt1` or `patt2` (ordered choice).
  28. -- `patt1 - patt2` | Matches `patt1` if `patt2` does not also match.
  29. -- `-patt` | Matches if `patt` does not match, consuming no input.
  30. -- `#patt` | Matches `patt` but consumes no input.
  31. --
  32. -- The first part of this document deals with rapidly constructing a simple lexer. The next part
  33. -- deals with more advanced techniques, such as embedding lexers within one another. Following
  34. -- that is a discussion about code folding, or being able to tell Scintilla which code blocks
  35. -- are "foldable" (temporarily hideable from view). After that are instructions on how to use
  36. -- Lua lexers with the aforementioned Textadept and SciTE editors. Finally there are comments
  37. -- on lexer performance and limitations.
  38. --
  39. -- [LPeg library]: http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
  40. -- [Textadept]: https://orbitalquark.github.io/textadept
  41. -- [SciTE]: https://scintilla.org/SciTE.html
  42. --
  43. -- ### Lexer Basics
  44. --
  45. -- The *lexers/* directory contains all of Scintillua's Lua lexers, including any new ones you
  46. -- write. Before attempting to write one from scratch though, first determine if your programming
  47. -- language is similar to any of the 100+ languages supported. If so, you may be able to copy
  48. -- and modify, or inherit from that lexer, saving some time and effort. The filename of your
  49. -- lexer should be the name of your programming language in lower case followed by a *.lua*
  50. -- extension. For example, a new Lua lexer has the name *lua.lua*.
  51. --
  52. -- Note: Try to refrain from using one-character language names like "c", "d", or "r". For
  53. -- example, Scintillua uses "ansi_c", "dmd", and "rstats", respectively.
  54. --
  55. -- #### New Lexer Template
  56. --
  57. -- There is a *lexers/template.txt* file that contains a simple template for a new lexer. Feel
  58. -- free to use it, replacing the '?' with the name of your lexer. Consider this snippet from
  59. -- the template:
  60. --
  61. -- -- ? LPeg lexer.
  62. --
  63. -- local lexer = lexer
  64. -- local P, S = lpeg.P, lpeg.S
  65. --
  66. -- local lex = lexer.new(...)
  67. --
  68. -- [... lexer rules ...]
  69. --
  70. -- -- Identifier.
  71. -- local identifier = lex:tag(lexer.IDENTIFIER, lexer.word)
  72. -- lex:add_rule('identifier', identifier)
  73. --
  74. -- [... more lexer rules ...]
  75. --
  76. -- return lex
  77. --
  78. -- The first line of code is a Lua convention to store a global variable into a local variable
  79. -- for quick access. The second line simply defines often used convenience variables. The third
  80. -- and last lines [define](#lexer.new) and return the lexer object Scintillua uses; they are
  81. -- very important and must be part of every lexer. Note the `...` passed to `lexer.new()` is
  82. -- literal: the lexer will assume the name of its filename or an alternative name specified
  83. -- by `lexer.load()` in embedded lexer applications. The fourth line uses something called a
  84. -- "tag", an essential component of lexers. You will learn about tags shortly. The fifth line
  85. -- defines a lexer grammar rule, which you will learn about later. (Be aware that it is common
  86. -- practice to combine these two lines for short rules.) Note, however, the `local` prefix in
  87. -- front of variables, which is needed so-as not to affect Lua's global environment. All in all,
  88. -- this is a minimal, working lexer that you can build on.
  89. --
  90. -- #### Tags
  91. --
  92. -- Take a moment to think about your programming language's structure. What kind of key elements
  93. -- does it have? Most languages have elements like keywords, strings, and comments. The
  94. -- lexer's job is to break down source code into these elements and "tag" them for syntax
  95. -- highlighting. Therefore, tags are an essential component of lexers. It is up to you how
  96. -- specific your lexer is when it comes to tagging elements. Perhaps only distinguishing between
  97. -- keywords and identifiers is necessary, or maybe recognizing constants and built-in functions,
  98. -- methods, or libraries is desirable. The Lua lexer, for example, tags the following elements:
  99. -- keywords, functions, constants, identifiers, strings, comments, numbers, labels, attributes,
  100. -- and operators. Even though functions and constants are subsets of identifiers, Lua programmers
  101. -- find it helpful for the lexer to distinguish between them all. It is perfectly acceptable
  102. -- to just recognize keywords and identifiers.
  103. --
  104. -- In a lexer, LPeg patterns that match particular sequences of characters are tagged with a
  105. -- tag name using the the `lexer.tag()` function. Let us examine the "identifier" tag used in
  106. -- the template shown earlier:
  107. --
  108. -- local identifier = lex:tag(lexer.IDENTIFIER, lexer.word)
  109. --
  110. -- At first glance, the first argument does not appear to be a string name and the second
  111. -- argument does not appear to be an LPeg pattern. Perhaps you expected something like:
  112. --
  113. -- lex:tag('identifier', (lpeg.R('AZ', 'az') + '_') * (lpeg.R('AZ', 'az', '09') + '_')^0)
  114. --
  115. -- The `lexer` module actually provides a convenient list of common tag names and common LPeg
  116. -- patterns for you to use. Tag names for programming languages include (but are not limited
  117. -- to) `lexer.DEFAULT`, `lexer.COMMENT`, `lexer.STRING`, `lexer.NUMBER`, `lexer.KEYWORD`,
  118. -- `lexer.IDENTIFIER`, `lexer.OPERATOR`, `lexer.ERROR`, `lexer.PREPROCESSOR`, `lexer.CONSTANT`,
  119. -- `lexer.CONSTANT_BUILTIN`, `lexer.VARIABLE`, `lexer.VARIABLE_BUILTIN`, `lexer.FUNCTION`,
  120. -- `lexer.FUNCTION_BUILTIN`, `lexer.FUNCTION_METHOD`, `lexer.CLASS`, `lexer.TYPE`, `lexer.LABEL`,
  121. -- `lexer.REGEX`, `lexer.EMBEDDED`, and `lexer.ANNOTATION`. Tag names for markup languages include
  122. -- (but are not limited to) `lexer.TAG`, `lexer.ATTRIBUTE`, `lexer.HEADING`, `lexer.BOLD`,
  123. -- `lexer.ITALIC`, `lexer.UNDERLINE`, `lexer.CODE`, `lexer.LINK`, `lexer.REFERENCE`, and
  124. -- `lexer.LIST`. Patterns include `lexer.any`, `lexer.alpha`, `lexer.digit`, `lexer.alnum`,
  125. -- `lexer.lower`, `lexer.upper`, `lexer.xdigit`, `lexer.graph`, `lexer.punct`, `lexer.space`,
  126. -- `lexer.newline`, `lexer.nonnewline`, `lexer.dec_num`, `lexer.hex_num`, `lexer.oct_num`,
  127. -- `lexer.bin_num`, `lexer.integer`, `lexer.float`, `lexer.number`, and `lexer.word`. You may
  128. -- use your own tag names if none of the above fit your language, but an advantage to using
  129. -- predefined tag names is that the language elements your lexer recognizes will inherit any
  130. -- universal syntax highlighting color theme that your editor uses. You can also "subclass"
  131. -- existing tag names by appending a '.*subclass*' string to them. For example, the HTML lexer
  132. -- tags unknown tags as `lexer.TAG .. '.unknown'`. This gives editors the opportunity to style
  133. -- those subclassed tags in a different way than normal tags, or fall back to styling them as
  134. -- normal tags.
  135. --
  136. -- ##### Example Tags
  137. --
  138. -- So, how might you recognize and tag elements like keywords, comments, and strings? Here are
  139. -- some examples.
  140. --
  141. -- **Keywords**
  142. --
  143. -- Instead of matching _n_ keywords with _n_ `P('keyword_`_`n`_`')` ordered choices, use one
  144. -- of of the following methods:
  145. --
  146. -- 1. Use the convenience function `lexer.word_match()` optionally coupled with
  147. -- `lexer.set_word_list()`. It is much easier and more efficient to write word matches like:
  148. --
  149. -- local keyword = lex:tag(lexer.KEYWORD, lex:word_match(lexer.KEYWORD))
  150. -- [...]
  151. -- lex:set_word_list(lexer.KEYWORD, {
  152. -- 'keyword_1', 'keyword_2', ..., 'keyword_n'
  153. -- })
  154. --
  155. -- local case_insensitive_word = lex:tag(lexer.KEYWORD, lex:word_match(lexer.KEYWORD, true))
  156. -- [...]
  157. -- lex:set_word_list(lexer.KEYWORD, {
  158. -- 'KEYWORD_1', 'keyword_2', ..., 'KEYword_n'
  159. -- })
  160. --
  161. -- local hyphenated_keyword = lex:tag(lexer.KEYWORD, lex:word_match(lexer.KEYWORD))
  162. -- [...]
  163. -- lex:set_word_list(lexer.KEYWORD, {
  164. -- 'keyword-1', 'keyword-2', ..., 'keyword-n'
  165. -- })
  166. --
  167. -- The benefit of using this method is that other lexers that inherit from, embed, or embed
  168. -- themselves into your lexer can set, replace, or extend these word lists. For example,
  169. -- the TypeScript lexer inherits from JavaScript, but extends JavaScript's keyword and type
  170. -- lists with more options.
  171. --
  172. -- This method also allows applications that use your lexer to extend or replace your word
  173. -- lists. For example, the Lua lexer includes keywords and functions for the latest version
  174. -- of Lua (5.4 at the time of writing). However, editors using that lexer might want to use
  175. -- keywords from Lua version 5.1, which is still quite popular.
  176. --
  177. -- Note that calling `lex:set_word_list()` is completely optional. Your lexer is allowed to
  178. -- expect the editor using it to supply word lists. Scintilla-based editors can do so via
  179. -- Scintilla's `ILexer5` interface.
  180. --
  181. -- 2. Use the lexer-agnostic form of `lexer.word_match()`:
  182. --
  183. -- local keyword = lex:tag(lexer.KEYWORD, lexer.word_match{
  184. -- 'keyword_1', 'keyword_2', ..., 'keyword_n'
  185. -- })
  186. --
  187. -- local case_insensitive_keyword = lex:tag(lexer.KEYWORD, lexer.word_match({
  188. -- 'KEYWORD_1', 'keyword_2', ..., 'KEYword_n'
  189. -- }, true))
  190. --
  191. -- local hyphened_keyword = lex:tag(lexer.KEYWORD, lexer.word_match{
  192. -- 'keyword-1', 'keyword-2', ..., 'keyword-n'
  193. -- })
  194. --
  195. -- For short keyword lists, you can use a single string of words. For example:
  196. --
  197. -- local keyword = lex:tag(lexer.KEYWORD, lexer.word_match('key_1 key_2 ... key_n'))
  198. --
  199. -- You can use this method for static word lists that do not change, or where it does not
  200. -- make sense to allow applications or other lexers to extend or replace a word list.
  201. --
  202. -- **Comments**
  203. --
  204. -- Line-style comments with a prefix character(s) are easy to express:
  205. --
  206. -- local shell_comment = lex:tag(lexer.COMMENT, lexer.to_eol('#'))
  207. -- local c_line_comment = lex:tag(lexer.COMMENT, lexer.to_eol('//', true))
  208. --
  209. -- The comments above start with a '#' or "//" and go to the end of the line (EOL). The second
  210. -- comment recognizes the next line also as a comment if the current line ends with a '\'
  211. -- escape character.
  212. --
  213. -- C-style "block" comments with a start and end delimiter are also easy to express:
  214. --
  215. -- local c_comment = lex:tag(lexer.COMMENT, lexer.range('/*', '*/'))
  216. --
  217. -- This comment starts with a "/\*" sequence and contains anything up to and including an ending
  218. -- "\*/" sequence. The ending "\*/" is optional so the lexer can recognize unfinished comments
  219. -- as comments and highlight them properly.
  220. --
  221. -- **Strings**
  222. --
  223. -- Most programming languages allow escape sequences in strings such that a sequence like
  224. -- "\\&quot;" in a double-quoted string indicates that the '&quot;' is not the end of the
  225. -- string. `lexer.range()` handles escapes inherently.
  226. --
  227. -- local dq_str = lexer.range('"')
  228. -- local sq_str = lexer.range("'")
  229. -- local string = lex:tag(lexer.STRING, dq_str + sq_str)
  230. --
  231. -- In this case, the lexer treats '\' as an escape character in a string sequence.
  232. --
  233. -- **Numbers**
  234. --
  235. -- Most programming languages have the same format for integers and floats, so it might be as
  236. -- simple as using a predefined LPeg pattern:
  237. --
  238. -- local number = lex:tag(lexer.NUMBER, lexer.number)
  239. --
  240. -- However, some languages allow postfix characters on integers.
  241. --
  242. -- local integer = P('-')^-1 * (lexer.dec_num * S('lL')^-1)
  243. -- local number = lex:tag(lexer.NUMBER, lexer.float + lexer.hex_num + integer)
  244. --
  245. -- Other languages allow separaters within numbers for better readability.
  246. --
  247. -- local number = lex:tag(lexer.NUMBER, lexer.number_('_')) -- recognize 1_000_000
  248. --
  249. -- Your language may need other tweaks, but it is up to you how fine-grained you want your
  250. -- highlighting to be. After all, you are not writing a compiler or interpreter!
  251. --
  252. -- #### Rules
  253. --
  254. -- Programming languages have grammars, which specify valid syntactic structure. For example,
  255. -- comments usually cannot appear within a string, and valid identifiers (like variable names)
  256. -- cannot be keywords. In Lua lexers, grammars consist of LPeg pattern rules, many of which
  257. -- are tagged. Recall from the lexer template the `lexer.add_rule()` call, which adds a rule
  258. -- to the lexer's grammar:
  259. --
  260. -- lex:add_rule('identifier', identifier)
  261. --
  262. -- Each rule has an associated name, but rule names are completely arbitrary and serve only to
  263. -- identify and distinguish between different rules. Rule order is important: if text does not
  264. -- match the first rule added to the grammar, the lexer tries to match the second rule added, and
  265. -- so on. Right now this lexer simply matches identifiers under a rule named "identifier".
  266. --
  267. -- To illustrate the importance of rule order, here is an example of a simplified Lua lexer:
  268. --
  269. -- lex:add_rule('keyword', lex:tag(lexer.KEYWORD, ...))
  270. -- lex:add_rule('identifier', lex:tag(lexer.IDENTIFIER, ...))
  271. -- lex:add_rule('string', lex:tag(lexer.STRING, ...))
  272. -- lex:add_rule('comment', lex:tag(lexer.COMMENT, ...))
  273. -- lex:add_rule('number', lex:tag(lexer.NUMBER, ...))
  274. -- lex:add_rule('label', lex:tag(lexer.LABEL, ...))
  275. -- lex:add_rule('operator', lex:tag(lexer.OPERATOR, ...))
  276. --
  277. -- Notice how identifiers come _after_ keywords. In Lua, as with most programming languages,
  278. -- the characters allowed in keywords and identifiers are in the same set (alphanumerics plus
  279. -- underscores). If the lexer added the "identifier" rule before the "keyword" rule, all keywords
  280. -- would match identifiers and thus would be incorrectly tagged (and likewise incorrectly
  281. -- highlighted) as identifiers instead of keywords. The same idea applies to function names,
  282. -- constants, etc. that you may want to distinguish between: their rules should come before
  283. -- identifiers.
  284. --
  285. -- So what about text that does not match any rules? For example in Lua, the '!' character is
  286. -- meaningless outside a string or comment. Normally the lexer skips over such text. If instead
  287. -- you want to highlight these "syntax errors", add a final rule:
  288. --
  289. -- lex:add_rule('keyword', keyword)
  290. -- ...
  291. -- lex:add_rule('error', lex:tag(lexer.ERROR, lexer.any))
  292. --
  293. -- This identifies and tags any character not matched by an existing rule as a `lexer.ERROR`.
  294. --
  295. -- Even though the rules defined in the examples above contain a single tagged pattern, rules may
  296. -- consist of multiple tagged patterns. For example, the rule for an HTML tag could consist of a
  297. -- tagged tag followed by an arbitrary number of tagged attributes, separated by whitespace. This
  298. -- allows the lexer to produce all tags separately, but in a single, convenient rule. That rule
  299. -- might look something like this:
  300. --
  301. -- local ws = lex:get_rule('whitespace') -- predefined rule for all lexers
  302. -- lex:add_rule('tag', tag_start * (ws * attributes)^0 * tag_end^-1)
  303. --
  304. -- Note however that lexers with complex rules like these are more prone to lose track of their
  305. -- state, especially if they span multiple lines.
  306. --
  307. -- #### Summary
  308. --
  309. -- Lexers primarily consist of tagged patterns and grammar rules. These patterns match language
  310. -- elements like keywords, comments, and strings, and rules dictate the order in which patterns
  311. -- are matched. At your disposal are a number of convenience patterns and functions for rapidly
  312. -- creating a lexer. If you choose to use predefined tag names (or perhaps even subclassed
  313. -- names) for your patterns, you do not have to update your editor's theme to specify how to
  314. -- syntax-highlight those patterns. Your language's elements will inherit the default syntax
  315. -- highlighting color theme your editor uses.
  316. --
  317. -- ### Advanced Techniques
  318. --
  319. -- #### Line Lexers
  320. --
  321. -- By default, lexers match the arbitrary chunks of text passed to them by Scintilla. These
  322. -- chunks may be a full document, only the visible part of a document, or even just portions
  323. -- of lines. Some lexers need to match whole lines. For example, a lexer for the output of a
  324. -- file "diff" needs to know if the line started with a '+' or '-' and then style the entire
  325. -- line accordingly. To indicate that your lexer matches by line, create the lexer with an
  326. -- extra parameter:
  327. --
  328. -- local lex = lexer.new(..., {lex_by_line = true})
  329. --
  330. -- Now the input text for the lexer is a single line at a time. Keep in mind that line lexers
  331. -- do not have the ability to look ahead to subsequent lines.
  332. --
  333. -- #### Embedded Lexers
  334. --
  335. -- Scintillua lexers embed within one another very easily, requiring minimal effort. In the
  336. -- following sections, the lexer being embedded is called the "child" lexer and the lexer a child
  337. -- is being embedded in is called the "parent". For example, consider an HTML lexer and a CSS
  338. -- lexer. Either lexer stands alone for styling their respective HTML and CSS files. However, CSS
  339. -- can be embedded inside HTML. In this specific case, the CSS lexer is the "child" lexer with
  340. -- the HTML lexer being the "parent". Now consider an HTML lexer and a PHP lexer. This sounds
  341. -- a lot like the case with CSS, but there is a subtle difference: PHP _embeds itself into_
  342. -- HTML while CSS is _embedded in_ HTML. This fundamental difference results in two types of
  343. -- embedded lexers: a parent lexer that embeds other child lexers in it (like HTML embedding CSS),
  344. -- and a child lexer that embeds itself into a parent lexer (like PHP embedding itself in HTML).
  345. --
  346. -- ##### Parent Lexer
  347. --
  348. -- Before embedding a child lexer into a parent lexer, the parent lexer needs to load the child
  349. -- lexer. This is done with the `lexer.load()` function. For example, loading the CSS lexer
  350. -- within the HTML lexer looks like:
  351. --
  352. -- local css = lexer.load('css')
  353. --
  354. -- The next part of the embedding process is telling the parent lexer when to switch over
  355. -- to the child lexer and when to switch back. The lexer refers to these indications as the
  356. -- "start rule" and "end rule", respectively, and are just LPeg patterns. Continuing with the
  357. -- HTML/CSS example, the transition from HTML to CSS is when the lexer encounters a "style"
  358. -- tag with a "type" attribute whose value is "text/css":
  359. --
  360. -- local css_tag = P('<style') * P(function(input, index)
  361. -- if input:find('^[^>]+type="text/css"', index) then return true end
  362. -- end)
  363. --
  364. -- This pattern looks for the beginning of a "style" tag and searches its attribute list for
  365. -- the text "`type="text/css"`". (In this simplified example, the Lua pattern does not consider
  366. -- whitespace between the '=' nor does it consider that using single quotes is valid.) If there
  367. -- is a match, the functional pattern returns `true`. However, we ultimately want to style the
  368. -- "style" tag as an HTML tag, so the actual start rule looks like this:
  369. --
  370. -- local css_start_rule = #css_tag * tag
  371. --
  372. -- Now that the parent knows when to switch to the child, it needs to know when to switch
  373. -- back. In the case of HTML/CSS, the switch back occurs when the lexer encounters an ending
  374. -- "style" tag, though the lexer should still style the tag as an HTML tag:
  375. --
  376. -- local css_end_rule = #P('</style>') * tag
  377. --
  378. -- Once the parent loads the child lexer and defines the child's start and end rules, it embeds
  379. -- the child with the `lexer.embed()` function:
  380. --
  381. -- lex:embed(css, css_start_rule, css_end_rule)
  382. --
  383. -- ##### Child Lexer
  384. --
  385. -- The process for instructing a child lexer to embed itself into a parent is very similar to
  386. -- embedding a child into a parent: first, load the parent lexer into the child lexer with the
  387. -- `lexer.load()` function and then create start and end rules for the child lexer. However,
  388. -- in this case, call `lexer.embed()` with switched arguments. For example, in the PHP lexer:
  389. --
  390. -- local html = lexer.load('html')
  391. -- local php_start_rule = lex:tag('php_tag', '<?php' * lexer.space)
  392. -- local php_end_rule = lex:tag('php_tag', '?>')
  393. -- html:embed(lex, php_start_rule, php_end_rule)
  394. --
  395. -- Note that the use of a 'php_tag' tag will require the editor using the lexer to specify how
  396. -- to highlight text with that tag. In order to avoid this, you could use the `lexer.PREPROCESSOR`
  397. -- tag instead.
  398. --
  399. -- #### Lexers with Complex State
  400. --
  401. -- A vast majority of lexers are not stateful and can operate on any chunk of text in a
  402. -- document. However, there may be rare cases where a lexer does need to keep track of some
  403. -- sort of persistent state. Rather than using `lpeg.P` function patterns that set state
  404. -- variables, it is recommended to make use of Scintilla's built-in, per-line state integers via
  405. -- `lexer.line_state`. It was designed to accommodate up to 32 bit-flags for tracking state.
  406. -- `lexer.line_from_position()` will return the line for any position given to an `lpeg.P`
  407. -- function pattern. (Any positions derived from that position argument will also work.)
  408. --
  409. -- Writing stateful lexers is beyond the scope of this document.
  410. --
  411. -- ### Code Folding
  412. --
  413. -- When reading source code, it is occasionally helpful to temporarily hide blocks of code like
  414. -- functions, classes, comments, etc. This is the concept of "folding". In the Textadept and
  415. -- SciTE editors for example, little indicators in the editor margins appear next to code that
  416. -- can be folded at places called "fold points". When the user clicks an indicator, the editor
  417. -- hides the code associated with the indicator until the user clicks the indicator again. The
  418. -- lexer specifies these fold points and what code exactly to fold.
  419. --
  420. -- The fold points for most languages occur on keywords or character sequences. Examples of
  421. -- fold keywords are "if" and "end" in Lua and examples of fold character sequences are '{',
  422. -- '}', "/\*", and "\*/" in C for code block and comment delimiters, respectively. However,
  423. -- these fold points cannot occur just anywhere. For example, lexers should not recognize fold
  424. -- keywords that appear within strings or comments. The `lexer.add_fold_point()` function allows
  425. -- you to conveniently define fold points with such granularity. For example, consider C:
  426. --
  427. -- lex:add_fold_point(lexer.OPERATOR, '{', '}')
  428. -- lex:add_fold_point(lexer.COMMENT, '/*', '*/')
  429. --
  430. -- The first assignment states that any '{' or '}' that the lexer tagged as an `lexer.OPERATOR`
  431. -- is a fold point. Likewise, the second assignment states that any "/\*" or "\*/" that the
  432. -- lexer tagged as part of a `lexer.COMMENT` is a fold point. The lexer does not consider any
  433. -- occurrences of these characters outside their tagged elements (such as in a string) as fold
  434. -- points. How do you specify fold keywords? Here is an example for Lua:
  435. --
  436. -- lex:add_fold_point(lexer.KEYWORD, 'if', 'end')
  437. -- lex:add_fold_point(lexer.KEYWORD, 'do', 'end')
  438. -- lex:add_fold_point(lexer.KEYWORD, 'function', 'end')
  439. -- lex:add_fold_point(lexer.KEYWORD, 'repeat', 'until')
  440. --
  441. -- If your lexer has case-insensitive keywords as fold points, simply add a
  442. -- `case_insensitive_fold_points = true` option to `lexer.new()`, and specify keywords in
  443. -- lower case.
  444. --
  445. -- If your lexer needs to do some additional processing in order to determine if a tagged element
  446. -- is a fold point, pass a function to `lex:add_fold_point()` that returns an integer. A return
  447. -- value of `1` indicates the element is a beginning fold point and a return value of `-1`
  448. -- indicates the element is an ending fold point. A return value of `0` indicates the element
  449. -- is not a fold point. For example:
  450. --
  451. -- local function fold_strange_element(text, pos, line, s, symbol)
  452. -- if ... then
  453. -- return 1 -- beginning fold point
  454. -- elseif ... then
  455. -- return -1 -- ending fold point
  456. -- end
  457. -- return 0
  458. -- end
  459. --
  460. -- lex:add_fold_point('strange_element', '|', fold_strange_element)
  461. --
  462. -- Any time the lexer encounters a '|' that is tagged as a "strange_element", it calls the
  463. -- `fold_strange_element` function to determine if '|' is a fold point. The lexer calls these
  464. -- functions with the following arguments: the text to identify fold points in, the beginning
  465. -- position of the current line in the text to fold, the current line's text, the position in
  466. -- the current line the fold point text starts at, and the fold point text itself.
  467. --
  468. -- #### Fold by Indentation
  469. --
  470. -- Some languages have significant whitespace and/or no delimiters that indicate fold points. If
  471. -- your lexer falls into this category and you would like to mark fold points based on changes
  472. -- in indentation, create the lexer with a `fold_by_indentation = true` option:
  473. --
  474. -- local lex = lexer.new(..., {fold_by_indentation = true})
  475. --
  476. -- ### Using Lexers
  477. --
  478. -- **Textadept**
  479. --
  480. -- Place your lexer in your *~/.textadept/lexers/* directory so you do not overwrite it when
  481. -- upgrading Textadept. Also, lexers in this directory override default lexers. Thus, Textadept
  482. -- loads a user *lua* lexer instead of the default *lua* lexer. This is convenient for tweaking
  483. -- a default lexer to your liking. Then add a [file extension](#lexer.detect_extensions) for
  484. -- your lexer if necessary.
  485. --
  486. -- **SciTE**
  487. --
  488. -- Create a *.properties* file for your lexer and `import` it in either your *SciTEUser.properties*
  489. -- or *SciTEGlobal.properties*. The contents of the *.properties* file should contain:
  490. --
  491. -- file.patterns.[lexer_name]=[file_patterns]
  492. -- lexer.$(file.patterns.[lexer_name])=scintillua.[lexer_name]
  493. -- keywords.$(file.patterns.[lexer_name])=scintillua
  494. -- keywords2.$(file.patterns.[lexer_name])=scintillua
  495. -- ...
  496. -- keywords9.$(file.patterns.[lexer_name])=scintillua
  497. --
  498. -- where `[lexer_name]` is the name of your lexer (minus the *.lua* extension) and
  499. -- `[file_patterns]` is a set of file extensions to use your lexer for. The `keyword` settings are
  500. -- only needed if another SciTE properties file has defined keyword sets for `[file_patterns]`.
  501. -- The `scintillua` keyword setting instructs Scintillua to use the keyword sets defined within
  502. -- the lexer. You can override a lexer's keyword set(s) by specifying your own in the same order
  503. -- that the lexer calls `lex:set_word_list()`. For example, the Lua lexer's first set of keywords
  504. -- is for reserved words, the second is for built-in global functions, the third is for library
  505. -- functions, the fourth is for built-in global constants, and the fifth is for library constants.
  506. --
  507. -- SciTE assigns styles to tag names in order to perform syntax highlighting. Since the set of
  508. -- tag names used for a given language changes, your *.properties* file should specify styles
  509. -- for tag names instead of style numbers. For example:
  510. --
  511. -- scintillua.styles.my_tag=$(scintillua.styles.keyword),bold
  512. --
  513. -- ### Migrating Legacy Lexers
  514. --
  515. -- Legacy lexers are of the form:
  516. --
  517. -- local lexer = require('lexer')
  518. -- local token, word_match = lexer.token, lexer.word_match
  519. -- local P, S = lpeg.P, lpeg.S
  520. --
  521. -- local lex = lexer.new('?')
  522. --
  523. -- -- Whitespace.
  524. -- lex:add_rule('whitespace', token(lexer.WHITESPACE, lexer.space^1))
  525. --
  526. -- -- Keywords.
  527. -- lex:add_rule('keyword', token(lexer.KEYWORD, word_match{
  528. -- [...]
  529. -- }))
  530. --
  531. -- [... other rule definitions ...]
  532. --
  533. -- -- Custom.
  534. -- lex:add_rule('custom_rule', token('custom_token', ...))
  535. -- lex:add_style('custom_token', lexer.styles.keyword .. {bold = true})
  536. --
  537. -- -- Fold points.
  538. -- lex:add_fold_point(lexer.OPERATOR, '{', '}')
  539. --
  540. -- return lex
  541. --
  542. -- While Scintillua will mostly handle such legacy lexers just fine without any changes, it is
  543. -- recommended that you migrate yours. The migration process is fairly straightforward:
  544. --
  545. -- 1. `lexer` exists in the default lexer environment, so `require('lexer')` should be replaced
  546. -- by simply `lexer`. (Keep in mind `local lexer = lexer` is a Lua idiom.)
  547. -- 2. Every lexer created using `lexer.new()` should no longer specify a lexer name by string,
  548. -- but should instead use `...` (three dots), which evaluates to the lexer's filename or
  549. -- alternative name in embedded lexer applications.
  550. -- 3. Every lexer created using `lexer.new()` now includes a rule to match whitespace. Unless
  551. -- your lexer has significant whitespace, you can remove your legacy lexer's whitespace
  552. -- token and rule. Otherwise, your defined whitespace rule will replace the default one.
  553. -- 4. The concept of tokens has been replaced with tags. Instead of calling a `token()` function,
  554. -- call [`lex:tag()`](#lexer.tag) instead.
  555. -- 5. Lexers now support replaceable word lists. Instead of calling `lexer.word_match()` with
  556. -- large word lists, call it as an instance method with an identifier string (typically
  557. -- something like `lexer.KEYWORD`). Then at the end of the lexer (before `return lex`), call
  558. -- [`lex:set_word_list()`](#lexer.set_word_list) with the same identifier and the usual
  559. -- list of words to match. This allows users of your lexer to call `lex:set_word_list()`
  560. -- with their own set of words should they wish to.
  561. -- 6. Lexers no longer specify styling information. Remove any calls to `lex:add_style()`. You
  562. -- may need to add styling information for custom tags to your editor's theme.
  563. -- 7. `lexer.last_char_includes()` has been deprecated in favor of the new `lexer.after_set()`.
  564. -- Use the character set and pattern as arguments to that new function.
  565. --
  566. -- As an example, consider the following sample legacy lexer:
  567. --
  568. -- local lexer = require('lexer')
  569. -- local token, word_match = lexer.token, lexer.word_match
  570. -- local P, S = lpeg.P, lpeg.S
  571. --
  572. -- local lex = lexer.new('legacy')
  573. --
  574. -- lex:add_rule('whitespace', token(lexer.WHITESPACE, lexer.space^1))
  575. -- lex:add_rule('keyword', token(lexer.KEYWORD, word_match('foo bar baz')))
  576. -- lex:add_rule('custom', token('custom', 'quux'))
  577. -- lex:add_style('custom', lexer.styles.keyword .. {bold = true})
  578. -- lex:add_rule('identifier', token(lexer.IDENTIFIER, lexer.word))
  579. -- lex:add_rule('string', token(lexer.STRING, lexer.range('"')))
  580. -- lex:add_rule('comment', token(lexer.COMMENT, lexer.to_eol('#')))
  581. -- lex:add_rule('number', token(lexer.NUMBER, lexer.number))
  582. -- lex:add_rule('operator', token(lexer.OPERATOR, S('+-*/%^=<>,.()[]{}')))
  583. --
  584. -- lex:add_fold_point(lexer.OPERATOR, '{', '}')
  585. --
  586. -- return lex
  587. --
  588. -- Following the migration steps would yield:
  589. --
  590. -- local lexer = lexer
  591. -- local P, S = lpeg.P, lpeg.S
  592. --
  593. -- local lex = lexer.new(...)
  594. --
  595. -- lex:add_rule('keyword', lex:tag(lexer.KEYWORD, lex:word_match(lexer.KEYWORD)))
  596. -- lex:add_rule('custom', lex:tag('custom', 'quux'))
  597. -- lex:add_rule('identifier', lex:tag(lexer.IDENTIFIER, lexer.word))
  598. -- lex:add_rule('string', lex:tag(lexer.STRING, lexer.range('"')))
  599. -- lex:add_rule('comment', lex:tag(lexer.COMMENT, lexer.to_eol('#')))
  600. -- lex:add_rule('number', lex:tag(lexer.NUMBER, lexer.number))
  601. -- lex:add_rule('operator', lex:tag(lexer.OPERATOR, S('+-*/%^=<>,.()[]{}')))
  602. --
  603. -- lex:add_fold_point(lexer.OPERATOR, '{', '}')
  604. --
  605. -- lex:set_word_list(lexer.KEYWORD, {'foo', 'bar', 'baz'})
  606. --
  607. -- return lex
  608. --
  609. -- Any editors using this lexer would have to add a style for the 'custom' tag.
  610. --
  611. -- ### Considerations
  612. --
  613. -- #### Performance
  614. --
  615. -- There might be some slight overhead when initializing a lexer, but loading a file from disk
  616. -- into Scintilla is usually more expensive. Actually painting the syntax highlighted text to
  617. -- the screen is often more expensive than the lexing operation. On modern computer systems,
  618. -- I see no difference in speed between Lua lexers and Scintilla's C++ ones. Optimize lexers for
  619. -- speed by re-arranging `lexer.add_rule()` calls so that the most common rules match first. Do
  620. -- keep in mind that order matters for similar rules.
  621. --
  622. -- In some cases, folding may be far more expensive than lexing, particularly in lexers with a
  623. -- lot of potential fold points. If your lexer is exhibiting signs of slowness, try disabling
  624. -- folding in your text editor first. If that speeds things up, you can try reducing the number
  625. -- of fold points you added, overriding `lexer.fold()` with your own implementation, or simply
  626. -- eliminating folding support from your lexer.
  627. --
  628. -- #### Limitations
  629. --
  630. -- Embedded preprocessor languages like PHP cannot completely embed themselves into their parent
  631. -- languages because the parent's tagged patterns do not support start and end rules. This
  632. -- mostly goes unnoticed, but code like
  633. --
  634. -- <div id="<?php echo $id; ?>">
  635. --
  636. -- will not style correctly. Also, these types of languages cannot currently embed themselves
  637. -- into their parent's child languages either.
  638. --
  639. -- A language cannot embed itself into something like an interpolated string because it is
  640. -- possible that if lexing starts within the embedded entity, it will not be detected as such,
  641. -- so a child to parent transition cannot happen. For example, the following Ruby code will
  642. -- not style correctly:
  643. --
  644. -- sum = "1 + 2 = #{1 + 2}"
  645. --
  646. -- Also, there is the potential for recursion for languages embedding themselves within themselves.
  647. --
  648. -- #### Troubleshooting
  649. --
  650. -- Errors in lexers can be tricky to debug. Lexers print Lua errors to `io.stderr` and `_G.print()`
  651. -- statements to `io.stdout`. Running your editor from a terminal is the easiest way to see
  652. -- errors as they occur.
  653. --
  654. -- #### Risks
  655. --
  656. -- Poorly written lexers have the ability to crash Scintilla (and thus its containing application),
  657. -- so unsaved data might be lost. However, I have only observed these crashes in early lexer
  658. -- development, when syntax errors or pattern errors are present. Once the lexer actually
  659. -- starts processing and tagging text (either correctly or incorrectly, it does not matter),
  660. -- I have not observed any crashes.
  661. --
  662. -- #### Acknowledgements
  663. --
  664. -- Thanks to Peter Odding for his [lexer post][] on the Lua mailing list that provided inspiration,
  665. -- and thanks to Roberto Ierusalimschy for LPeg.
  666. --
  667. -- [lexer post]: http://lua-users.org/lists/lua-l/2007-04/msg00116.html
  668. -- @module lexer
  669. local M = {}
  670. --- The tag name for default elements.
  671. -- @field DEFAULT
  672. --- The tag name for comment elements.
  673. -- @field COMMENT
  674. --- The tag name for string elements.
  675. -- @field STRING
  676. --- The tag name for number elements.
  677. -- @field NUMBER
  678. --- The tag name for keyword elements.
  679. -- @field KEYWORD
  680. --- The tag name for identifier elements.
  681. -- @field IDENTIFIER
  682. --- The tag name for operator elements.
  683. -- @field OPERATOR
  684. --- The tag name for error elements.
  685. -- @field ERROR
  686. --- The tag name for preprocessor elements.
  687. -- @field PREPROCESSOR
  688. --- The tag name for constant elements.
  689. -- @field CONSTANT
  690. --- The tag name for variable elements.
  691. -- @field VARIABLE
  692. --- The tag name for function elements.
  693. -- @field FUNCTION
  694. --- The tag name for class elements.
  695. -- @field CLASS
  696. --- The tag name for type elements.
  697. -- @field TYPE
  698. --- The tag name for label elements.
  699. -- @field LABEL
  700. --- The tag name for regex elements.
  701. -- @field REGEX
  702. --- The tag name for embedded elements.
  703. -- @field EMBEDDED
  704. --- The tag name for builtin function elements.
  705. -- @field FUNCTION_BUILTIN
  706. --- The tag name for builtin constant elements.
  707. -- @field CONSTANT_BUILTIN
  708. --- The tag name for function method elements.
  709. -- @field FUNCTION_METHOD
  710. --- The tag name for function tag elements, typically in markup.
  711. -- @field TAG
  712. --- The tag name for function attribute elements, typically in markup.
  713. -- @field ATTRIBUTE
  714. --- The tag name for builtin variable elements.
  715. -- @field VARIABLE_BUILTIN
  716. --- The tag name for heading elements, typically in markup.
  717. -- @field HEADING
  718. --- The tag name for bold elements, typically in markup.
  719. -- @field BOLD
  720. --- The tag name for builtin italic elements, typically in markup.
  721. -- @field ITALIC
  722. --- The tag name for underlined elements, typically in markup.
  723. -- @field UNDERLINE
  724. --- The tag name for code elements, typically in markup.
  725. -- @field CODE
  726. --- The tag name for link elements, typically in markup.
  727. -- @field LINK
  728. --- The tag name for reference elements, typically in markup.
  729. -- @field REFERENCE
  730. --- The tag name for annotation elements.
  731. -- @field ANNOTATION
  732. --- The tag name for list item elements, typically in markup.
  733. -- @field LIST
  734. --- The initial (root) fold level.
  735. -- @field FOLD_BASE
  736. --- Flag indicating that the line is blank.
  737. -- @field FOLD_BLANK
  738. --- Flag indicating the line is fold point.
  739. -- @field FOLD_HEADER
  740. -- This comment is needed for LDoc to process the previous field.
  741. if not lpeg then lpeg = require('lpeg') end -- Scintillua's Lua environment defines _G.lpeg
  742. local lpeg = lpeg
  743. local P, R, S, V, B = lpeg.P, lpeg.R, lpeg.S, lpeg.V, lpeg.B
  744. local Ct, Cc, Cp, Cmt, C = lpeg.Ct, lpeg.Cc, lpeg.Cp, lpeg.Cmt, lpeg.C
  745. --- Default tags.
  746. local default = {
  747. 'whitespace', 'comment', 'string', 'number', 'keyword', 'identifier', 'operator', 'error',
  748. 'preprocessor', 'constant', 'variable', 'function', 'class', 'type', 'label', 'regex', 'embedded',
  749. 'function.builtin', 'constant.builtin', 'function.method', 'tag', 'attribute', 'variable.builtin',
  750. 'heading', 'bold', 'italic', 'underline', 'code', 'link', 'reference', 'annotation', 'list'
  751. }
  752. for _, name in ipairs(default) do M[name:upper():gsub('%.', '_')] = name end
  753. --- Names for predefined Scintilla styles.
  754. -- Having these here simplifies style number handling between Scintillua and Scintilla.
  755. local predefined = {
  756. 'default', 'line.number', 'brace.light', 'brace.bad', 'control.char', 'indent.guide', 'call.tip',
  757. 'fold.display.text'
  758. }
  759. for _, name in ipairs(predefined) do M[name:upper():gsub('%.', '_')] = name end
  760. --- Creates and returns a pattern that tags pattern *patt* with name *name* in lexer *lexer*.
  761. -- If *name* is not a predefined tag name, its Scintilla style will likely need to be defined
  762. -- by the editor or theme using this lexer.
  763. -- @param lexer The lexer to tag the given pattern in.
  764. -- @param name The name to use.
  765. -- @param patt The LPeg pattern to tag.
  766. -- @return pattern
  767. -- @usage local number = lex:tag(lexer.NUMBER, lexer.number)
  768. -- @usage local addition = lex:tag('addition', '+' * lexer.word)
  769. function M.tag(lexer, name, patt)
  770. if not lexer._TAGS then
  771. -- Create the initial maps for tag names to style numbers and styles.
  772. local tags = {}
  773. for i, name in ipairs(default) do tags[name], tags[i] = i, name end
  774. for i, name in ipairs(predefined) do tags[name], tags[i + 32] = i + 32, name end
  775. lexer._TAGS, lexer._num_styles = tags, #default + 1
  776. lexer._extra_tags = {}
  777. end
  778. if not assert(lexer._TAGS, 'not a lexer instance')[name] then
  779. local num_styles = lexer._num_styles
  780. if num_styles == 33 then num_styles = num_styles + 8 end -- skip predefined
  781. assert(num_styles <= 256, 'too many styles defined (256 MAX)')
  782. lexer._TAGS[name], lexer._TAGS[num_styles], lexer._num_styles = num_styles, name, num_styles + 1
  783. lexer._extra_tags[name] = true
  784. -- If the lexer is a proxy or a child that embedded itself, make this tag name known to
  785. -- the parent lexer.
  786. if lexer._lexer then lexer._lexer:tag(name, false) end
  787. end
  788. return Cc(name) * (P(patt) / 0) * Cp()
  789. end
  790. --- Returns a unique grammar rule name for the given lexer's i-th word list.
  791. local function word_list_id(lexer, i) return lexer._name .. '_wordlist' .. i end
  792. --- Either returns a pattern for lexer *lexer* (if given) that matches one word in the word list
  793. -- identified by string *word_list*, ignoring case if *case_sensitive* is `true`, or, if *lexer*
  794. -- is not given, creates and returns a pattern that matches any single word in list or string
  795. -- *word_list*, ignoring case if *case_insensitive* is `true`.
  796. -- This is a convenience function for simplifying a set of ordered choice word patterns and
  797. -- potentially allowing downstream users to configure word lists.
  798. -- If there is ultimately no word list set via `set_word_list()`, no error will be raised,
  799. -- but the returned pattern will not match anything.
  800. -- @param[opt] lexer Optional lexer to match a word in a wordlist for. This parameter may be
  801. -- omitted for lexer-agnostic matching.
  802. -- @param word_list Either a string name of the word list to match from if *lexer* is given,
  803. -- or, if *lexer* is omitted, a list of words or a string list of words separated by spaces.
  804. -- @param[opt] case_insensitive Optional boolean flag indicating whether or not the word match
  805. -- is case-insensitive. The default value is `false`.
  806. -- @return pattern
  807. -- @usage lex:add_rule('keyword', lex:tag(lexer.KEYWORD, lex:word_match(lexer.KEYWORD)))
  808. -- @usage local keyword = lex:tag(lexer.KEYWORD, lexer.word_match{'foo', 'bar', 'baz'})
  809. -- @usage local keyword = lex:tag(lexer.KEYWORD, lexer.word_match({'foo-bar', 'foo-baz',
  810. -- 'bar-foo', 'bar-baz', 'baz-foo', 'baz-bar'}, true))
  811. -- @usage local keyword = lex:tag(lexer.KEYWORD, lexer.word_match('foo bar baz'))
  812. function M.word_match(lexer, word_list, case_insensitive)
  813. if type(lexer) == 'table' and getmetatable(lexer) then
  814. if lexer._lexer then
  815. -- If this lexer is a proxy (e.g. rails), get the true parent (ruby) in order to get the
  816. -- parent's word list. If this lexer is a child embedding itself (e.g. php), continue
  817. -- getting its word list, not the parent's (html).
  818. local parent = lexer._lexer
  819. if not parent._CHILDREN or not parent._CHILDREN[lexer] then lexer = parent end
  820. end
  821. if not lexer._WORDLISTS then lexer._WORDLISTS = {case_insensitive = {}} end
  822. local i = lexer._WORDLISTS[word_list] or #lexer._WORDLISTS + 1
  823. lexer._WORDLISTS[word_list], lexer._WORDLISTS[i] = i, '' -- empty placeholder word list
  824. lexer._WORDLISTS.case_insensitive[i] = case_insensitive
  825. return V(word_list_id(lexer, i))
  826. end
  827. -- Lexer-agnostic word match.
  828. word_list, case_insensitive = lexer, word_list
  829. if type(word_list) == 'string' then
  830. local words = word_list -- space-separated list of words
  831. word_list = {}
  832. for word in words:gmatch('%S+') do word_list[#word_list + 1] = word end
  833. end
  834. local word_chars = M.alnum + '_'
  835. local extra_chars = ''
  836. for _, word in ipairs(word_list) do
  837. word_list[case_insensitive and word:lower() or word] = true
  838. for char in word:gmatch('[^%w_%s]') do
  839. if not extra_chars:find(char, 1, true) then extra_chars = extra_chars .. char end
  840. end
  841. end
  842. if extra_chars ~= '' then word_chars = word_chars + S(extra_chars) end
  843. -- Optimize small word sets as ordered choice. "Small" is arbitrary.
  844. if #word_list <= 6 and not case_insensitive then
  845. local choice = P(false)
  846. for _, word in ipairs(word_list) do choice = choice + word:match('%S+') end
  847. return choice * -word_chars
  848. end
  849. return Cmt(word_chars^1, function(input, index, word)
  850. if case_insensitive then word = word:lower() end
  851. return word_list[word]
  852. end)
  853. end
  854. --- Sets in lexer *lexer* the word list identified by string or number *name* to string or
  855. -- list *word_list*, appending to any existing word list if *append* is `true`.
  856. -- This only has an effect if *lexer* uses `word_match()` to reference the given list.
  857. -- Case-insensitivity is specified by `word_match()`.
  858. -- @param lexer The lexer to add the given word list to.
  859. -- @param name The string name or number of the word list to set.
  860. -- @param word_list A list of words or a string list of words separated by spaces.
  861. -- @param append Whether or not to append *word_list* to the existing word list (if any). The
  862. -- default value is `false`.
  863. function M.set_word_list(lexer, name, word_list, append)
  864. if word_list == 'scintillua' then return end -- for SciTE
  865. if lexer._lexer then
  866. -- If this lexer is a proxy (e.g. rails), get the true parent (ruby) in order to set the
  867. -- parent's word list. If this lexer is a child embedding itself (e.g. php), continue
  868. -- setting its word list, not the parent's (html).
  869. local parent = lexer._lexer
  870. if not parent._CHILDREN or not parent._CHILDREN[lexer] then lexer = parent end
  871. end
  872. assert(lexer._WORDLISTS, 'lexer has no word lists')
  873. local i = tonumber(lexer._WORDLISTS[name]) or name -- lexer._WORDLISTS[name] --> i
  874. if type(i) ~= 'number' or i > #lexer._WORDLISTS then return end -- silently return
  875. if type(word_list) == 'string' then
  876. local list = {}
  877. for word in word_list:gmatch('%S+') do list[#list + 1] = word end
  878. word_list = list
  879. end
  880. if not append or lexer._WORDLISTS[i] == '' then
  881. lexer._WORDLISTS[i] = word_list
  882. else
  883. local list = lexer._WORDLISTS[i]
  884. for _, word in ipairs(word_list) do list[#list + 1] = word end
  885. end
  886. lexer._grammar_table = nil -- invalidate
  887. end
  888. --- Adds pattern *rule* identified by string *id* to the ordered list of rules for lexer *lexer*.
  889. -- @param lexer The lexer to add the given rule to.
  890. -- @param id The id associated with this rule. It does not have to be the same as the name
  891. -- passed to `tag()`.
  892. -- @param rule The LPeg pattern of the rule.
  893. -- @see modify_rule
  894. function M.add_rule(lexer, id, rule)
  895. if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent
  896. if not lexer._rules then lexer._rules = {} end
  897. if id == 'whitespace' and lexer._rules[id] then -- legacy
  898. lexer:modify_rule(id, rule)
  899. return
  900. end
  901. lexer._rules[#lexer._rules + 1], lexer._rules[id] = id, rule
  902. lexer._grammar_table = nil -- invalidate
  903. end
  904. --- Replaces in lexer *lexer* the existing rule identified by string *id* with pattern *rule*.
  905. -- @param lexer The lexer to modify.
  906. -- @param id The id associated with this rule.
  907. -- @param rule The LPeg pattern of the rule.
  908. function M.modify_rule(lexer, id, rule)
  909. if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent
  910. assert(lexer._rules[id], 'rule does not exist')
  911. lexer._rules[id] = rule
  912. lexer._grammar_table = nil -- invalidate
  913. end
  914. --- Returns a unique grammar rule name for the given lexer's rule name.
  915. local function rule_id(lexer, name) return lexer._name .. '.' .. name end
  916. --- Returns the rule identified by string *id*.
  917. -- @param lexer The lexer to fetch a rule from.
  918. -- @param id The id of the rule to fetch.
  919. -- @return pattern
  920. function M.get_rule(lexer, id)
  921. if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent
  922. if id == 'whitespace' then return V(rule_id(lexer, id)) end -- special case
  923. return assert(lexer._rules[id], 'rule does not exist')
  924. end
  925. --- Embeds child lexer *child* in parent lexer *lexer* using patterns *start_rule* and *end_rule*,
  926. -- which signal the beginning and end of the embedded lexer, respectively.
  927. -- @param lexer The parent lexer.
  928. -- @param child The child lexer.
  929. -- @param start_rule The pattern that signals the beginning of the embedded lexer.
  930. -- @param end_rule The pattern that signals the end of the embedded lexer.
  931. -- @usage html:embed(css, css_start_rule, css_end_rule)
  932. -- @usage html:embed(lex, php_start_rule, php_end_rule) -- from php lexer
  933. function M.embed(lexer, child, start_rule, end_rule)
  934. if lexer._lexer then lexer = lexer._lexer end -- proxy; get true parent
  935. -- Add child rules.
  936. assert(child._rules, 'cannot embed lexer with no rules')
  937. if not child._start_rules then child._start_rules = {} end
  938. if not child._end_rules then child._end_rules = {} end
  939. child._start_rules[lexer], child._end_rules[lexer] = start_rule, end_rule
  940. if not lexer._CHILDREN then lexer._CHILDREN = {} end
  941. lexer._CHILDREN[#lexer._CHILDREN + 1], lexer._CHILDREN[child] = child, true
  942. -- Add child tags.
  943. for name in pairs(child._extra_tags) do lexer:tag(name, true) end
  944. -- Add child fold symbols.
  945. if child._fold_points then
  946. for tag_name, symbols in pairs(child._fold_points) do
  947. if tag_name ~= '_symbols' then
  948. for symbol, v in pairs(symbols) do lexer:add_fold_point(tag_name, symbol, v) end
  949. end
  950. end
  951. end
  952. -- Add child word lists.
  953. if child._WORDLISTS then
  954. for name, i in pairs(child._WORDLISTS) do
  955. if type(name) == 'string' and type(i) == 'number' then
  956. name = child._name .. '.' .. name
  957. lexer:word_match(name) -- for side effects
  958. lexer:set_word_list(name, child._WORDLISTS[i])
  959. end
  960. end
  961. end
  962. child._lexer = lexer -- use parent's rules if child is embedding itself
  963. end
  964. --- Adds to lexer *lexer* a fold point whose beginning and end points are tagged with string
  965. -- *tag_name* tags and have string content *start_symbol* and *end_symbol*, respectively.
  966. -- In the event that *start_symbol* may or may not be a fold point depending on context, and that
  967. -- additional processing is required, *end_symbol* may be a function that ultimately returns
  968. -- `1` (indicating a beginning fold point), `-1` (indicating an ending fold point), or `0`
  969. -- (indicating no fold point). That function is passed the following arguments:
  970. --
  971. -- - `text`: The text being processed for fold points.
  972. -- - `pos`: The position in *text* of the beginning of the line currently being processed.
  973. -- - `line`: The text of the line currently being processed.
  974. -- - `s`: The position of *start_symbol* in *line*.
  975. -- - `symbol`: *start_symbol* itself.
  976. -- @param lexer The lexer to add a fold point to.
  977. -- @param tag_name The tag name for text that indicates a fold point.
  978. -- @param start_symbol The text that indicates the beginning of a fold point.
  979. -- @param end_symbol Either the text that indicates the end of a fold point, or a function that
  980. -- returns whether or not *start_symbol* is a beginning fold point (1), an ending fold point
  981. -- (-1), or not a fold point at all (0).
  982. -- @usage lex:add_fold_point(lexer.OPERATOR, '{', '}')
  983. -- @usage lex:add_fold_point(lexer.KEYWORD, 'if', 'end')
  984. -- @usage lex:add_fold_point('custom', function(text, pos, line, s, symbol) ... end)
  985. function M.add_fold_point(lexer, tag_name, start_symbol, end_symbol)
  986. if not start_symbol and not end_symbol then return end -- from legacy fold_consecutive_lines()
  987. if not lexer._fold_points then lexer._fold_points = {_symbols = {}} end
  988. local symbols = lexer._fold_points._symbols
  989. if not lexer._fold_points[tag_name] then lexer._fold_points[tag_name] = {} end
  990. if lexer._case_insensitive_fold_points then
  991. start_symbol = start_symbol:lower()
  992. if type(end_symbol) == 'string' then end_symbol = end_symbol:lower() end
  993. end
  994. if type(end_symbol) == 'string' then
  995. if not symbols[end_symbol] then symbols[#symbols + 1], symbols[end_symbol] = end_symbol, true end
  996. lexer._fold_points[tag_name][start_symbol] = 1
  997. lexer._fold_points[tag_name][end_symbol] = -1
  998. else
  999. lexer._fold_points[tag_name][start_symbol] = end_symbol -- function or int
  1000. end
  1001. if not symbols[start_symbol] then
  1002. symbols[#symbols + 1], symbols[start_symbol] = start_symbol, true
  1003. end
  1004. -- If the lexer is a proxy or a child that embedded itself, copy this fold point to the
  1005. -- parent lexer.
  1006. if lexer._lexer then lexer._lexer:add_fold_point(tag_name, start_symbol, end_symbol) end
  1007. end
  1008. --- Recursively adds the rules for the given lexer and its children to the given grammar.
  1009. -- @param g The grammar to add rules to.
  1010. -- @param lexer The lexer whose rules to add.
  1011. local function add_lexer(g, lexer)
  1012. local rule = P(false)
  1013. -- Add this lexer's rules.
  1014. for _, name in ipairs(lexer._rules) do
  1015. local id = rule_id(lexer, name)
  1016. g[id] = lexer._rules[name] -- ['lua.keyword'] = keyword_patt
  1017. rule = rule + V(id) -- V('lua.keyword') + V('lua.function') + V('lua.constant') + ...
  1018. end
  1019. local any_id = lexer._name .. '_fallback'
  1020. g[any_id] = lexer:tag(M.DEFAULT, M.any) -- ['lua_fallback'] = any_char
  1021. rule = rule + V(any_id) -- ... + V('lua.operator') + V('lua_fallback')
  1022. -- Add this lexer's word lists.
  1023. if lexer._WORDLISTS then
  1024. for i = 1, #lexer._WORDLISTS do
  1025. local id = word_list_id(lexer, i)
  1026. local list, case_insensitive = lexer._WORDLISTS[i], lexer._WORDLISTS.case_insensitive[i]
  1027. local patt = list ~= '' and M.word_match(list, case_insensitive) or P(false)
  1028. g[id] = patt -- ['lua_wordlist.1'] = word_match_patt or P(false)
  1029. end
  1030. end
  1031. -- Add this child lexer's end rules.
  1032. if lexer._end_rules then
  1033. for parent, end_rule in pairs(lexer._end_rules) do
  1034. local back_id = lexer._name .. '_to_' .. parent._name
  1035. g[back_id] = end_rule -- ['css_to_html'] = css_end_rule
  1036. rule = rule - V(back_id) + -- (V('css.property') + ... + V('css_fallback')) - V('css_to_html')
  1037. V(back_id) * V(parent._name) -- V('css_to_html') * V('html')
  1038. end
  1039. end
  1040. -- Add this child lexer's start rules.
  1041. if lexer._start_rules then
  1042. for parent, start_rule in pairs(lexer._start_rules) do
  1043. local to_id = parent._name .. '_to_' .. lexer._name
  1044. g[to_id] = start_rule * V(lexer._name) -- ['html_to_css'] = css_start_rule * V('css')
  1045. end
  1046. end
  1047. -- Finish adding this lexer's rules.
  1048. local rule_id = lexer._name .. '_rule'
  1049. g[rule_id] = rule -- ['lua_rule'] = V('lua.keyword') + ... + V('lua_fallback')
  1050. g[lexer._name] = V(rule_id)^0 -- ['lua'] = V('lua_rule')^0
  1051. -- Add this lexer's children's rules.
  1052. -- TODO: preprocessor languages like PHP should also embed themselves into their parent's
  1053. -- children like HTML's CSS and Javascript.
  1054. if not lexer._CHILDREN then return end
  1055. for _, child in ipairs(lexer._CHILDREN) do
  1056. add_lexer(g, child)
  1057. local to_id = lexer._name .. '_to_' .. child._name
  1058. g[rule_id] = V(to_id) + g[rule_id] -- ['html_rule'] = V('html_to_css') + V('html.comment') + ...
  1059. -- Add a child's inherited parent's rules (e.g. rhtml parent with rails child inheriting ruby).
  1060. if child._parent_name then
  1061. local name = child._name
  1062. child._name = child._parent_name -- ensure parent and transition rule names are correct
  1063. add_lexer(g, child)
  1064. child._name = name -- restore
  1065. local to_id = lexer._name .. '_to_' .. child._parent_name
  1066. g[rule_id] = V(to_id) + g[rule_id] -- ['html_rule'] = V('html_to_ruby') + V('html.comment') + ...
  1067. end
  1068. end
  1069. end
  1070. --- Returns a grammar for the given lexer and initial rule, (re)constructing it if necessary.
  1071. -- @param lexer The lexer to build a grammar for.
  1072. -- @param init_style The current style. Multiple-language lexers use this to determine which
  1073. -- language to start lexing in.
  1074. local function build_grammar(lexer, init_style)
  1075. if not lexer._rules then return end
  1076. if not lexer._initial_rule then lexer._initial_rule = lexer._parent_name or lexer._name end
  1077. if not lexer._grammar_table then
  1078. local grammar = {lexer._initial_rule}
  1079. if not lexer._parent_name then
  1080. add_lexer(grammar, lexer)
  1081. -- {'lua',
  1082. -- ['lua.keyword'] = patt, ['lua.function'] = patt, ...,
  1083. -- ['lua_wordlist.1'] = patt, ['lua_wordlist.2'] = patt, ...,
  1084. -- ['lua_rule'] = V('lua.keyword') + ... + V('lua_fallback'),
  1085. -- ['lua'] = V('lua_rule')^0
  1086. -- }
  1087. -- {'html'
  1088. -- ['html.comment'] = patt, ['html.doctype'] = patt, ...,
  1089. -- ['html_wordlist.1'] = patt, ['html_wordlist.2'] = patt, ...,
  1090. -- ['html_rule'] = V('html_to_css') * V('css') + V('html.comment') + ... + V('html_fallback'),
  1091. -- ['html'] = V('html')^0,
  1092. -- ['css.property'] = patt, ['css.value'] = patt, ...,
  1093. -- ['css_wordlist.1'] = patt, ['css_wordlist.2'] = patt, ...,
  1094. -- ['css_to_html'] = patt,
  1095. -- ['css_rule'] = ((V('css.property') + ... + V('css_fallback')) - V('css_to_html')) +
  1096. -- V('css_to_html') * V('html'),
  1097. -- ['html_to_css'] = patt,
  1098. -- ['css'] = V('css_rule')^0
  1099. -- }
  1100. else
  1101. local name = lexer._name
  1102. lexer._name = lexer._parent_name -- ensure parent and transition rule names are correct
  1103. add_lexer(grammar, lexer)
  1104. lexer._name = name -- restore
  1105. -- {'html',
  1106. -- ...
  1107. -- ['html_rule'] = V('html_to_php') * V('php') + V('html_to_css') * V('css') +
  1108. -- V('html.comment') + ... + V('html_fallback'),
  1109. -- ...
  1110. -- ['php.keyword'] = patt, ['php.type'] = patt, ...,
  1111. -- ['php_wordlist.1'] = patt, ['php_wordlist.2'] = patt, ...,
  1112. -- ['php_to_html'] = patt,
  1113. -- ['php_rule'] = ((V('php.keyword') + ... + V('php_fallback')) - V('php_to_html')) +
  1114. -- V('php_to_html') * V('html')
  1115. -- ['html_to_php'] = patt,
  1116. -- ['php'] = V('php_rule')^0
  1117. -- }
  1118. end
  1119. lexer._grammar, lexer._grammar_table = Ct(P(grammar)), grammar
  1120. end
  1121. -- For multilang lexers, build a new grammar whose initial rule is the current language
  1122. -- if necessary. LPeg does not allow a variable initial rule.
  1123. if lexer._CHILDREN then
  1124. for style_num, tag in ipairs(lexer._TAGS) do
  1125. if style_num == init_style then
  1126. local lexer_name = tag:match('^whitespace%.(.+)$') or lexer._parent_name or lexer._name
  1127. if lexer._initial_rule == lexer_name then break end
  1128. if not lexer._grammar_table[lexer_name] then
  1129. -- For proxy lexers like RHTML, the 'whitespace.rhtml' tag would produce the 'rhtml'
  1130. -- lexer name, but there is no 'rhtml' rule. It should be the 'html' rule (parent)
  1131. -- instead.
  1132. lexer_name = lexer._parent_name or lexer._name
  1133. end
  1134. lexer._initial_rule = lexer_name
  1135. lexer._grammar_table[1] = lexer._initial_rule
  1136. lexer._grammar = Ct(P(lexer._grammar_table))
  1137. return lexer._grammar
  1138. end
  1139. end
  1140. end
  1141. return lexer._grammar
  1142. end
  1143. --- Lexes a chunk of text *text* (that has an initial style number of *init_style*) using lexer
  1144. -- *lexer*, returning a list of tag names and positions.
  1145. -- @param lexer The lexer to lex text with.
  1146. -- @param text The text in the buffer to lex.
  1147. -- @param init_style The current style. Multiple-language lexers use this to determine which
  1148. -- language to start lexing in.
  1149. -- @return list of tag names and positions.
  1150. function M.lex(lexer, text, init_style)
  1151. local grammar = build_grammar(lexer, init_style)
  1152. if not grammar then return {M.DEFAULT, #text + 1} end
  1153. if M._standalone then M._text, M.line_state = text, {} end
  1154. if lexer._lex_by_line then
  1155. local line_from_position = M.line_from_position
  1156. local function append(tags, line_tags, offset)
  1157. for i = 1, #line_tags, 2 do
  1158. tags[#tags + 1], tags[#tags + 2] = line_tags[i], line_tags[i + 1] + offset
  1159. end
  1160. end
  1161. local tags = {}
  1162. local offset = 0
  1163. rawset(M, 'line_from_position', function(pos) return line_from_position(pos + offset) end)
  1164. for line in text:gmatch('[^\r\n]*\r?\n?') do
  1165. local line_tags = grammar:match(line)
  1166. if line_tags then append(tags, line_tags, offset) end
  1167. offset = offset + #line
  1168. -- Use the default tag to the end of the line if none was specified.
  1169. if tags[#tags] ~= offset + 1 then
  1170. tags[#tags + 1], tags[#tags + 2] = 'default', offset + 1
  1171. end
  1172. end
  1173. rawset(M, 'line_from_position', line_from_position)
  1174. return tags
  1175. end
  1176. return grammar:match(text)
  1177. end
  1178. --- Determines fold points in a chunk of text *text* using lexer *lexer*, returning a table of
  1179. -- fold levels associated with line numbers.
  1180. -- *text* starts on line number *start_line* with a beginning fold level of *start_level*
  1181. -- in the buffer.
  1182. -- @param lexer The lexer to fold text with.
  1183. -- @param text The text in the buffer to fold.
  1184. -- @param start_line The line number *text* starts on, counting from 1.
  1185. -- @param start_level The fold level *text* starts on.
  1186. -- @return table of fold levels associated with line numbers.
  1187. function M.fold(lexer, text, start_line, start_level)
  1188. local folds = {}
  1189. if text == '' then return folds end
  1190. local fold = M.property_int['fold'] > 0
  1191. local FOLD_BASE = M.FOLD_BASE or 0x400
  1192. local FOLD_HEADER, FOLD_BLANK = M.FOLD_HEADER or 0x2000, M.FOLD_BLANK or 0x1000
  1193. if M._standalone then M._text, M.line_state = text, {} end
  1194. if fold and lexer._fold_points then
  1195. local lines = {}
  1196. for p, l in (text .. '\n'):gmatch('()(.-)\r?\n') do lines[#lines + 1] = {p, l} end
  1197. local fold_zero_sum_lines = M.property_int['fold.scintillua.on.zero.sum.lines'] > 0
  1198. local fold_compact = M.property_int['fold.scintillua.compact'] > 0
  1199. local fold_points = lexer._fold_points
  1200. local fold_point_symbols = fold_points._symbols
  1201. local style_at, fold_level = M.style_at, M.fold_level
  1202. local line_num, prev_level = start_line, start_level
  1203. local current_level = prev_level
  1204. for _, captures in ipairs(lines) do
  1205. local pos, line = captures[1], captures[2]
  1206. if line ~= '' then
  1207. if lexer._case_insensitive_fold_points then line = line:lower() end
  1208. local ranges = {}
  1209. local function is_valid_range(s, e)
  1210. if not s or not e then return false end
  1211. for i = 1, #ranges - 1, 2 do
  1212. local range_s, range_e = ranges[i], ranges[i + 1]
  1213. if s >= range_s and s <= range_e or e >= range_s and e <= range_e then
  1214. return false
  1215. end
  1216. end
  1217. ranges[#ranges + 1] = s
  1218. ranges[#ranges + 1] = e
  1219. return true
  1220. end
  1221. local level_decreased = false
  1222. for _, symbol in ipairs(fold_point_symbols) do
  1223. local word = not symbol:find('[^%w_]')
  1224. local s, e = line:find(symbol, 1, true)
  1225. while is_valid_range(s, e) do
  1226. -- if not word or line:find('^%f[%w_]' .. symbol .. '%f[^%w_]', s) then
  1227. local word_before = s > 1 and line:find('^[%w_]', s - 1)
  1228. local word_after = line:find('^[%w_]', e + 1)
  1229. if not word or not (word_before or word_after) then
  1230. local style_name = style_at[pos + s - 1]
  1231. local symbols = fold_points[style_name]
  1232. if not symbols and style_name:find('%.') then
  1233. symbols = fold_points[style_name:match('^[^.]+')]
  1234. end
  1235. local level = symbols and symbols[symbol]
  1236. if type(level) == 'function' then
  1237. level = level(text, pos, line, s, symbol)
  1238. end
  1239. if type(level) == 'number' then
  1240. current_level = current_level + level
  1241. if level < 0 and current_level < prev_level then
  1242. -- Potential zero-sum line. If the level were to go back up on the same line,
  1243. -- the line may be marked as a fold header.
  1244. level_decreased = true
  1245. end
  1246. end
  1247. end
  1248. s, e = line:find(symbol, s + 1, true)
  1249. end
  1250. end
  1251. folds[line_num] = prev_level
  1252. if current_level > prev_level then
  1253. folds[line_num] = prev_level + FOLD_HEADER
  1254. elseif level_decreased and current_level == prev_level and fold_zero_sum_lines then
  1255. if line_num > start_line then
  1256. folds[line_num] = prev_level - 1 + FOLD_HEADER
  1257. else
  1258. -- Typing within a zero-sum line.
  1259. local level = fold_level[line_num] - 1
  1260. if level > FOLD_HEADER then level = level - FOLD_HEADER end
  1261. if level > FOLD_BLANK then level = level - FOLD_BLANK end
  1262. folds[line_num] = level + FOLD_HEADER
  1263. current_level = current_level + 1
  1264. end
  1265. end
  1266. if current_level < FOLD_BASE then current_level = FOLD_BASE end
  1267. prev_level = current_level
  1268. else
  1269. folds[line_num] = prev_level + (fold_compact and FOLD_BLANK or 0)
  1270. end
  1271. line_num = line_num + 1
  1272. end
  1273. elseif fold and
  1274. (lexer._fold_by_indentation or M.property_int['fold.scintillua.by.indentation'] > 0) then
  1275. -- Indentation based folding.
  1276. -- Calculate indentation per line.
  1277. local indentation = {}
  1278. for indent, line in (text .. '\n'):gmatch('([\t ]*)([^\r\n]*)\r?\n') do
  1279. indentation[#indentation + 1] = line ~= '' and #indent
  1280. end
  1281. -- Find the first non-blank line before start_line. If the current line is indented, make
  1282. -- that previous line a header and update the levels of any blank lines inbetween. If the
  1283. -- current line is blank, match the level of the previous non-blank line.
  1284. local current_level = start_level
  1285. for i = start_line, 1, -1 do
  1286. local level = M.fold_level[i]
  1287. if level >= FOLD_HEADER then level = level - FOLD_HEADER end
  1288. if level < FOLD_BLANK then
  1289. local indent = M.indent_amount[i]
  1290. if indentation[1] and indentation[1] > indent then
  1291. folds[i] = FOLD_BASE + indent + FOLD_HEADER
  1292. for j = i + 1, start_line - 1 do folds[j] = start_level + FOLD_BLANK end
  1293. elseif not indentation[1] then
  1294. current_level = FOLD_BASE + indent
  1295. end
  1296. break
  1297. end
  1298. end
  1299. -- Iterate over lines, setting fold numbers and fold flags.
  1300. for i = 1, #indentation do
  1301. if indentation[i] then
  1302. current_level = FOLD_BASE + indentation[i]
  1303. folds[start_line + i - 1] = current_level
  1304. for j = i + 1, #indentation do
  1305. if indentation[j] then
  1306. if FOLD_BASE + indentation[j] > current_level then
  1307. folds[start_line + i - 1] = current_level + FOLD_HEADER
  1308. current_level = FOLD_BASE + indentation[j] -- for any blanks below
  1309. end
  1310. break
  1311. end
  1312. end
  1313. else
  1314. folds[start_line + i - 1] = current_level + FOLD_BLANK
  1315. end
  1316. end
  1317. else
  1318. -- No folding, reset fold levels if necessary.
  1319. local current_line = start_line
  1320. for _ in text:gmatch('\r?\n') do
  1321. folds[current_line] = start_level
  1322. current_line = current_line + 1
  1323. end
  1324. end
  1325. return folds
  1326. end
  1327. --- Creates a returns a new lexer with the given name.
  1328. -- @param name The lexer's name.
  1329. -- @param opts Table of lexer options. Options currently supported:
  1330. -- - `lex_by_line`: Whether or not the lexer only processes whole lines of text (instead of
  1331. -- arbitrary chunks of text) at a time. Line lexers cannot look ahead to subsequent lines.
  1332. -- The default value is `false`.
  1333. -- - `fold_by_indentation`: Whether or not the lexer does not define any fold points and that
  1334. -- fold points should be calculated based on changes in line indentation. The default value
  1335. -- is `false`.
  1336. -- - `case_insensitive_fold_points`: Whether or not fold points added via
  1337. -- `lexer.add_fold_point()` ignore case. The default value is `false`.
  1338. -- - `no_user_word_lists`: Does not automatically allocate word lists that can be set by
  1339. -- users. This should really only be set by non-programming languages like markup languages.
  1340. -- - `inherit`: Lexer to inherit from. The default value is `nil`.
  1341. -- @usage lexer.new('rhtml', {inherit = lexer.load('html')})
  1342. function M.new(name, opts)
  1343. local lexer = setmetatable({
  1344. _name = assert(name, 'lexer name expected'), _lex_by_line = opts and opts['lex_by_line'],
  1345. _fold_by_indentation = opts and opts['fold_by_indentation'],
  1346. _case_insensitive_fold_points = opts and opts['case_insensitive_fold_points'],
  1347. _no_user_word_lists = opts and opts['no_user_word_lists'], _lexer = opts and opts['inherit']
  1348. }, {
  1349. __index = {
  1350. tag = M.tag, word_match = M.word_match, set_word_list = M.set_word_list,
  1351. add_rule = M.add_rule, modify_rule = M.modify_rule, get_rule = M.get_rule,
  1352. add_fold_point = M.add_fold_point, embed = M.embed, lex = M.lex, fold = M.fold, --
  1353. add_style = function() end -- legacy
  1354. }
  1355. })
  1356. -- Add initial whitespace rule.
  1357. -- Use a unique whitespace tag name since embedded lexing relies on these unique names.
  1358. lexer:add_rule('whitespace', lexer:tag('whitespace.' .. name, M.space^1))
  1359. return lexer
  1360. end
  1361. --- Creates a substitute for some Scintilla tables and functions that Scintillua depends on
  1362. -- when using it as a standalone module.
  1363. local function initialize_standalone_library()
  1364. M.property = setmetatable({['scintillua.lexers'] = package.path:gsub('/%?%.lua', '/lexers')}, {
  1365. __index = function() return '' end, __newindex = function(t, k, v) rawset(t, k, tostring(v)) end
  1366. })
  1367. M.line_from_position = function(pos)
  1368. local line = 1
  1369. for s in M._text:gmatch('[^\n]*()') do
  1370. if pos <= s then return line end
  1371. line = line + 1
  1372. end
  1373. return line - 1 -- should not get to here
  1374. end
  1375. M.indent_amount = setmetatable({}, {
  1376. __index = function(_, line)
  1377. local current_line = 1
  1378. for s in M._text:gmatch('()[^\n]*') do
  1379. if current_line == line then
  1380. return #M._text:match('^[ \t]*', s):gsub('\t', string.rep(' ', 8))
  1381. end
  1382. current_line = current_line + 1
  1383. end
  1384. end
  1385. })
  1386. M._standalone = true
  1387. end
  1388. --- Searches for the given *name* in the given *path*.
  1389. -- This is a safe implementation of Lua 5.2's `package.searchpath()` function that does not
  1390. -- require the package module to be loaded.
  1391. local function searchpath(name, path)
  1392. local tried = {}
  1393. for part in path:gmatch('[^;]+') do
  1394. local filename = part:gsub('%?', name)
  1395. local ok, errmsg = loadfile(filename)
  1396. if ok or not errmsg:find('cannot open') then return filename end
  1397. tried[#tried + 1] = string.format("no file '%s'", filename)
  1398. end
  1399. return nil, table.concat(tried, '\n')
  1400. end
  1401. --- Initializes or loads and then returns the lexer of string name *name*.
  1402. -- Scintilla calls this function in order to load a lexer. Parent lexers also call this function
  1403. -- in order to load child lexers and vice-versa. The user calls this function in order to load
  1404. -- a lexer when using Scintillua as a Lua library.
  1405. -- @param name The name of the lexing language.
  1406. -- @param[opt] alt_name Optional alternate name of the lexing language. This is useful for
  1407. -- embedding the same child lexer with multiple sets of start and end tags.
  1408. -- @return lexer object
  1409. function M.load(name, alt_name)
  1410. assert(name, 'no lexer given')
  1411. if not M.property then initialize_standalone_library() end
  1412. if not M.property_int then
  1413. -- Separate from initialize_standalone_library() so applications that choose to define
  1414. -- M.property do not also have to define this.
  1415. M.property_int = setmetatable({}, {
  1416. __index = function(t, k) return tonumber(M.property[k]) or 0 end,
  1417. __newindex = function() error('read-only property') end
  1418. })
  1419. end
  1420. -- Load the language lexer with its rules, tags, etc.
  1421. local path = M.property['scintillua.lexers']:gsub(';', '/?.lua;') .. '/?.lua'
  1422. local ro_lexer = setmetatable({
  1423. WHITESPACE = 'whitespace.' .. (alt_name or name) -- legacy
  1424. }, {__index = M})
  1425. local env = {
  1426. 'assert', 'error', 'ipairs', 'math', 'next', 'pairs', 'print', 'select', 'string', 'table',
  1427. 'tonumber', 'tostring', 'type', 'utf8', '_VERSION', lexer = ro_lexer, lpeg = lpeg, --
  1428. require = function() return ro_lexer end -- legacy
  1429. }
  1430. for _, name in ipairs(env) do env[name] = _G[name] end
  1431. local lexer = assert(loadfile(assert(searchpath(name, path)), 't', env))(alt_name or name)
  1432. assert(lexer, string.format("'%s.lua' did not return a lexer", name))
  1433. -- If the lexer is a proxy or a child that embedded itself, set the parent to be the main
  1434. -- lexer. Keep a reference to the old parent name since embedded child start and end rules
  1435. -- reference and use that name.
  1436. if lexer._lexer then
  1437. lexer = lexer._lexer
  1438. lexer._parent_name, lexer._name = lexer._name, alt_name or name
  1439. end
  1440. M.property['scintillua.comment.' .. (alt_name or name)] = M.property['scintillua.comment']
  1441. return lexer
  1442. end
  1443. --- Returns a list of all known lexer names.
  1444. -- This function is not available to lexers and requires the LuaFileSystem (`lfs`) module to
  1445. -- be available.
  1446. -- @param[opt] path Optional ';'-delimited list of directories to search for lexers in. The
  1447. -- default value is Scintillua's configured lexer path.
  1448. -- @return lexer name list
  1449. function M.names(path)
  1450. local lfs = require('lfs')
  1451. if not path then path = M.property and M.property['scintillua.lexers'] end
  1452. if not path or path == '' then
  1453. for part in package.path:gmatch('[^;]+') do
  1454. local dir = part:match('^(.-[/\\]?lexers)[/\\]%?%.lua$')
  1455. if dir then
  1456. path = dir
  1457. break
  1458. end
  1459. end
  1460. end
  1461. local lexers = {}
  1462. for dir in assert(path, 'lexer path not configured or found'):gmatch('[^;]+') do
  1463. if lfs.attributes(dir, 'mode') == 'directory' then
  1464. for file in lfs.dir(dir) do
  1465. local name = file:match('^(.+)%.lua$')
  1466. if name and name ~= 'lexer' and not lexers[name] then
  1467. lexers[#lexers + 1], lexers[name] = name, true
  1468. end
  1469. end
  1470. end
  1471. end
  1472. table.sort(lexers)
  1473. return lexers
  1474. end
  1475. --- Map of file extensions, without the '.' prefix, to their associated lexer names.
  1476. -- This map has precedence over Scintillua's built-in map.
  1477. -- @see detect
  1478. M.detect_extensions = {}
  1479. --- Map of line patterns to their associated lexer names.
  1480. -- These are Lua string patterns, not LPeg patterns.
  1481. -- This map has precedence over Scintillua's built-in map.
  1482. -- @see detect
  1483. M.detect_patterns = {}
  1484. --- Returns the name of the lexer often associated with filename *filename* and/or content
  1485. -- line *line*.
  1486. -- @param[opt] filename Optional string filename. The default value is read from the
  1487. -- 'lexer.scintillua.filename' property.
  1488. -- @param[opt] line Optional string first content line, such as a shebang line. The default
  1489. -- value is read from the 'lexer.scintillua.line' property.
  1490. -- @return string lexer name to pass to `load()`, or `nil` if none was detected
  1491. -- @see detect_extensions
  1492. -- @see detect_patterns
  1493. function M.detect(filename, line)
  1494. if not filename then filename = M.property and M.property['lexer.scintillua.filename'] or '' end
  1495. if not line then line = M.property and M.property['lexer.scintillua.line'] or '' end
  1496. -- Locally scoped in order to avoid persistence in memory.
  1497. local extensions = {
  1498. as = 'actionscript', asc = 'actionscript', --
  1499. adb = 'ada', ads = 'ada', --
  1500. g = 'antlr', g4 = 'antlr', --
  1501. ans = 'apdl', inp = 'apdl', mac = 'apdl', --
  1502. apl = 'apl', --
  1503. applescript = 'applescript', --
  1504. asm = 'asm', ASM = 'asm', s = 'asm', S = 'asm', --
  1505. asa = 'asp', asp = 'asp', hta = 'asp', --
  1506. ahk = 'autohotkey', --
  1507. au3 = 'autoit', a3x = 'autoit', --
  1508. awk = 'awk', --
  1509. bat = 'batch', cmd = 'batch', --
  1510. bib = 'bibtex', --
  1511. boo = 'boo', --
  1512. cs = 'csharp', --
  1513. c = 'ansi_c', C = 'ansi_c', cc = 'cpp', cpp = 'cpp', cxx = 'cpp', ['c++'] = 'cpp', h = 'cpp',
  1514. hh = 'cpp', hpp = 'cpp', hxx = 'cpp', ['h++'] = 'cpp', --
  1515. ck = 'chuck', --
  1516. clj = 'clojure', cljs = 'clojure', cljc = 'clojure', edn = 'clojure', --
  1517. ['CMakeLists.txt'] = 'cmake', cmake = 'cmake', ['cmake.in'] = 'cmake', ctest = 'cmake',
  1518. ['ctest.in'] = 'cmake', --
  1519. coffee = 'coffeescript', --
  1520. cr = 'crystal', --
  1521. css = 'css', --
  1522. cu = 'cuda', cuh = 'cuda', --
  1523. d = 'dmd', di = 'dmd', --
  1524. dart = 'dart', --
  1525. desktop = 'desktop', --
  1526. diff = 'diff', patch = 'diff', --
  1527. Dockerfile = 'dockerfile', --
  1528. dot = 'dot', --
  1529. e = 'eiffel', eif = 'eiffel', --
  1530. ex = 'elixir', exs = 'elixir', --
  1531. elm = 'elm', --
  1532. erl = 'erlang', hrl = 'erlang', --
  1533. fs = 'fsharp', --
  1534. fan = 'fantom', --
  1535. dsp = 'faust', --
  1536. fnl = 'fennel', --
  1537. fish = 'fish', --
  1538. forth = 'forth', frt = 'forth', --
  1539. f = 'fortran', ['for'] = 'fortran', ftn = 'fortran', fpp = 'fortran', f77 = 'fortran',
  1540. f90 = 'fortran', f95 = 'fortran', f03 = 'fortran', f08 = 'fortran', --
  1541. fstab = 'fstab', --
  1542. gd = 'gap', gi = 'gap', gap = 'gap', --
  1543. gmi = 'gemini', --
  1544. po = 'gettext', pot = 'gettext', --
  1545. feature = 'gherkin', --
  1546. gleam = 'gleam', --
  1547. glslf = 'glsl', glslv = 'glsl', --
  1548. dem = 'gnuplot', plt = 'gnuplot', --
  1549. go = 'go', --
  1550. groovy = 'groovy', gvy = 'groovy', --
  1551. gtkrc = 'gtkrc', --
  1552. ha = 'hare', --
  1553. hs = 'haskell', --
  1554. htm = 'html', html = 'html', shtm = 'html', shtml = 'html', xhtml = 'html', vue = 'html', --
  1555. icn = 'icon', --
  1556. idl = 'idl', odl = 'idl', --
  1557. ni = 'inform', --
  1558. cfg = 'ini', cnf = 'ini', inf = 'ini', ini = 'ini', reg = 'ini', --
  1559. io = 'io_lang', --
  1560. bsh = 'java', java = 'java', --
  1561. js = 'javascript', jsfl = 'javascript', --
  1562. jq = 'jq', --
  1563. json = 'json', --
  1564. jsp = 'jsp', --
  1565. jl = 'julia', --
  1566. bbl = 'latex', dtx = 'latex', ins = 'latex', ltx = 'latex', tex = 'latex', sty = 'latex', --
  1567. ledger = 'ledger', journal = 'ledger', --
  1568. less = 'less', --
  1569. lily = 'lilypond', ly = 'lilypond', --
  1570. cl = 'lisp', el = 'lisp', lisp = 'lisp', lsp = 'lisp', --
  1571. litcoffee = 'litcoffee', --
  1572. lgt = 'logtalk', --
  1573. lua = 'lua', --
  1574. GNUmakefile = 'makefile', iface = 'makefile', mak = 'makefile', makefile = 'makefile',
  1575. Makefile = 'makefile', --
  1576. md = 'markdown', --
  1577. ['meson.build'] = 'meson', --
  1578. moon = 'moonscript', --
  1579. myr = 'myrddin', --
  1580. n = 'nemerle', --
  1581. link = 'networkd', network = 'networkd', netdev = 'networkd', --
  1582. nim = 'nim', --
  1583. nsh = 'nsis', nsi = 'nsis', nsis = 'nsis', --
  1584. obs = 'objeck', --
  1585. m = 'objective_c', mm = 'objective_c', objc = 'objective_c', --
  1586. caml = 'caml', ml = 'caml', mli = 'caml', mll = 'caml', mly = 'caml', --
  1587. dpk = 'pascal', dpr = 'pascal', p = 'pascal', pas = 'pascal', --
  1588. al = 'perl', perl = 'perl', pl = 'perl', pm = 'perl', pod = 'perl', --
  1589. inc = 'php', php = 'php', php3 = 'php', php4 = 'php', phtml = 'php', --
  1590. p8 = 'pico8', --
  1591. pike = 'pike', pmod = 'pike', --
  1592. PKGBUILD = 'pkgbuild', --
  1593. pony = 'pony', --
  1594. eps = 'ps', ps = 'ps', --
  1595. ps1 = 'powershell', --
  1596. prolog = 'prolog', --
  1597. props = 'props', properties = 'props', --
  1598. proto = 'protobuf', --
  1599. pure = 'pure', --
  1600. sc = 'python', py = 'python', pyw = 'python', --
  1601. R = 'rstats', Rout = 'rstats', Rhistory = 'rstats', Rt = 'rstats', ['Rout.save'] = 'rstats',
  1602. ['Rout.fail'] = 'rstats', --
  1603. re = 'reason', --
  1604. r = 'rebol', reb = 'rebol', --
  1605. rst = 'rest', --
  1606. orx = 'rexx', rex = 'rexx', --
  1607. erb = 'rhtml', rhtml = 'rhtml', --
  1608. rsc = 'routeros', --
  1609. spec = 'rpmspec', --
  1610. Rakefile = 'ruby', rake = 'ruby', rb = 'ruby', rbw = 'ruby', --
  1611. rs = 'rust', --
  1612. sass = 'sass', scss = 'sass', --
  1613. scala = 'scala', --
  1614. sch = 'scheme', scm = 'scheme', --
  1615. bash = 'bash', bashrc = 'bash', bash_profile = 'bash', configure = 'bash', csh = 'bash',
  1616. ksh = 'bash', mksh = 'bash', sh = 'bash', zsh = 'bash', --
  1617. changes = 'smalltalk', st = 'smalltalk', sources = 'smalltalk', --
  1618. sml = 'sml', fun = 'sml', sig = 'sml', --
  1619. sno = 'snobol4', SNO = 'snobol4', --
  1620. spin = 'spin', --
  1621. ddl = 'sql', sql = 'sql', --
  1622. automount = 'systemd', device = 'systemd', mount = 'systemd', path = 'systemd',
  1623. scope = 'systemd', service = 'systemd', slice = 'systemd', socket = 'systemd', swap = 'systemd',
  1624. target = 'systemd', timer = 'systemd', --
  1625. taskpaper = 'taskpaper', --
  1626. tcl = 'tcl', tk = 'tcl', --
  1627. texi = 'texinfo', --
  1628. toml = 'toml', --
  1629. ['1'] = 'troff', ['2'] = 'troff', ['3'] = 'troff', ['4'] = 'troff', ['5'] = 'troff',
  1630. ['6'] = 'troff', ['7'] = 'troff', ['8'] = 'troff', ['9'] = 'troff', ['1x'] = 'troff',
  1631. ['2x'] = 'troff', ['3x'] = 'troff', ['4x'] = 'troff', ['5x'] = 'troff', ['6x'] = 'troff',
  1632. ['7x'] = 'troff', ['8x'] = 'troff', ['9x'] = 'troff', --
  1633. t2t = 'txt2tags', --
  1634. ts = 'typescript', --
  1635. vala = 'vala', --
  1636. vcf = 'vcard', vcard = 'vcard', --
  1637. v = 'verilog', ver = 'verilog', --
  1638. vh = 'vhdl', vhd = 'vhdl', vhdl = 'vhdl', --
  1639. bas = 'vb', cls = 'vb', ctl = 'vb', dob = 'vb', dsm = 'vb', dsr = 'vb', frm = 'vb', pag = 'vb',
  1640. vb = 'vb', vba = 'vb', vbs = 'vb', --
  1641. wsf = 'wsf', --
  1642. dtd = 'xml', svg = 'xml', xml = 'xml', xsd = 'xml', xsl = 'xml', xslt = 'xml', xul = 'xml', --
  1643. xs = 'xs', xsin = 'xs', xsrc = 'xs', --
  1644. xtend = 'xtend', --
  1645. yaml = 'yaml', yml = 'yaml', --
  1646. zig = 'zig'
  1647. }
  1648. local patterns = {
  1649. ['^#!.+[/ ][gm]?awk'] = 'awk', ['^#!.+[/ ]lua'] = 'lua', ['^#!.+[/ ]octave'] = 'matlab',
  1650. ['^#!.+[/ ]perl'] = 'perl', ['^#!.+[/ ]php'] = 'php', ['^#!.+[/ ]python'] = 'python',
  1651. ['^#!.+[/ ]ruby'] = 'ruby', ['^#!.+[/ ]bash'] = 'bash', ['^#!.+/m?ksh'] = 'bash',
  1652. ['^#!.+/sh'] = 'bash', ['^%s*class%s+%S+%s*<%s*ApplicationController'] = 'rails',
  1653. ['^%s*class%s+%S+%s*<%s*ActionController::Base'] = 'rails',
  1654. ['^%s*class%s+%S+%s*<%s*ActiveRecord::Base'] = 'rails',
  1655. ['^%s*class%s+%S+%s*<%s*ActiveRecord::Migration'] = 'rails', ['^%s*<%?xml%s'] = 'xml',
  1656. ['^#cloud%-config'] = 'yaml'
  1657. }
  1658. for patt, name in pairs(M.detect_patterns) do if line:find(patt) then return name end end
  1659. for patt, name in pairs(patterns) do if line:find(patt) then return name end end
  1660. local name, ext = filename:match('[^/\\]+$'), filename:match('[^.]*$')
  1661. return M.detect_extensions[name] or extensions[name] or M.detect_extensions[ext] or
  1662. extensions[ext]
  1663. end
  1664. -- The following are utility functions lexers will have access to.
  1665. -- Common patterns.
  1666. --- A pattern that matches any single character.
  1667. M.any = P(1)
  1668. --- A pattern that matches any alphabetic character ('A'-'Z', 'a'-'z').
  1669. M.alpha = R('AZ', 'az')
  1670. --- A pattern that matches any digit ('0'-'9').
  1671. M.digit = R('09')
  1672. --- A pattern that matches any alphanumeric character ('A'-'Z', 'a'-'z', '0'-'9').
  1673. M.alnum = R('AZ', 'az', '09')
  1674. --- A pattern that matches any lower case character ('a'-'z').
  1675. M.lower = R('az')
  1676. --- A pattern that matches any upper case character ('A'-'Z').
  1677. M.upper = R('AZ')
  1678. --- A pattern that matches any hexadecimal digit ('0'-'9', 'A'-'F', 'a'-'f').
  1679. M.xdigit = R('09', 'AF', 'af')
  1680. --- A pattern that matches any graphical character ('!' to '~').
  1681. M.graph = R('!~')
  1682. --- A pattern that matches any punctuation character ('!' to '/', ':' to '@', '[' to ''', '{'
  1683. -- to '~').
  1684. M.punct = R('!/', ':@', '[\'', '{~')
  1685. --- A pattern that matches any whitespace character ('\t', '\v', '\f', '\n', '\r', space).
  1686. M.space = S('\t\v\f\n\r ')
  1687. --- A pattern that matches a sequence of end of line characters.
  1688. M.newline = P('\r')^-1 * '\n'
  1689. --- A pattern that matches any single, non-newline character.
  1690. M.nonnewline = 1 - M.newline
  1691. --- Returns a pattern that matches a decimal number, whose digits may be separated by character
  1692. -- *c*.
  1693. function M.dec_num_(c) return M.digit * (P(c)^-1 * M.digit)^0 end
  1694. --- Returns a pattern that matches a hexadecimal number, whose digits may be separated by
  1695. -- character *c*.
  1696. function M.hex_num_(c) return '0' * S('xX') * (P(c)^-1 * M.xdigit)^1 end
  1697. --- Returns a pattern that matches an octal number, whose digits may be separated by character *c*.
  1698. function M.oct_num_(c) return '0' * (P(c)^-1 * R('07'))^1 * -M.xdigit end
  1699. --- Returns a pattern that matches a binary number, whose digits may be separated by character *c*.
  1700. function M.bin_num_(c) return '0' * S('bB') * (P(c)^-1 * S('01'))^1 * -M.xdigit end
  1701. --- Returns a pattern that matches either a decimal, hexadecimal, octal, or binary number,
  1702. -- whose digits may be separated by character *c*.
  1703. function M.integer_(c)
  1704. return S('+-')^-1 * (M.hex_num_(c) + M.bin_num_(c) + M.oct_num_(c) + M.dec_num_(c))
  1705. end
  1706. local function exp_(c) return S('eE') * S('+-')^-1 * M.digit * (P(c)^-1 * M.digit)^0 end
  1707. --- Returns a pattern that matches a floating point number, whose digits may be separated by
  1708. -- character *c*.
  1709. function M.float_(c)
  1710. return S('+-')^-1 *
  1711. ((M.dec_num_(c)^-1 * '.' * M.dec_num_(c) + M.dec_num_(c) * '.' * M.dec_num_(c)^-1 * -P('.')) *
  1712. exp_(c)^-1 + (M.dec_num_(c) * exp_(c)))
  1713. end
  1714. --- Returns a pattern that matches a typical number, either a floating point, decimal, hexadecimal,
  1715. -- octal, or binary number, and whose digits may be separated by character *c*.
  1716. function M.number_(c) return M.float_(c) + M.integer_(c) end
  1717. --- A pattern that matches a decimal number.
  1718. M.dec_num = M.dec_num_(false)
  1719. --- A pattern that matches a hexadecimal number.
  1720. M.hex_num = M.hex_num_(false)
  1721. --- A pattern that matches an octal number.
  1722. M.oct_num = M.oct_num_(false)
  1723. --- A pattern that matches a binary number.
  1724. M.bin_num = M.bin_num_(false)
  1725. --- A pattern that matches either a decimal, hexadecimal, octal, or binary number.
  1726. M.integer = M.integer_(false)
  1727. --- A pattern that matches a floating point number.
  1728. M.float = M.float_(false)
  1729. --- A pattern that matches a typical number, either a floating point, decimal, hexadecimal,
  1730. -- octal, or binary number.
  1731. M.number = M.number_(false)
  1732. --- A pattern that matches a typical word. Words begin with a letter or underscore and consist
  1733. -- of alphanumeric and underscore characters.
  1734. M.word = (M.alpha + '_') * (M.alnum + '_')^0
  1735. --- Creates and returns a pattern that matches from string or pattern *prefix* until the end of
  1736. -- the line.
  1737. -- *escape* indicates whether the end of the line can be escaped with a '\' character.
  1738. -- @param[opt] prefix Optional string or pattern prefix to start matching at. The default value
  1739. -- is any non-newline character.
  1740. -- @param[opt] escape Optional flag indicating whether or not newlines can be escaped by a '\'
  1741. -- character. The default value is `false`.
  1742. -- @return pattern
  1743. -- @usage local line_comment = lexer.to_eol('//')
  1744. -- @usage local line_comment = lexer.to_eol(S('#;'))
  1745. function M.to_eol(prefix, escape)
  1746. return (prefix or M.nonnewline) *
  1747. (not escape and M.nonnewline or 1 - (M.newline + '\\') + '\\' * M.any)^0
  1748. end
  1749. --- Creates and returns a pattern that matches a range of text bounded by strings or patterns *s*
  1750. -- and *e*.
  1751. -- This is a convenience function for matching more complicated ranges like strings with escape
  1752. -- characters, balanced parentheses, and block comments (nested or not). *e* is optional and
  1753. -- defaults to *s*. *single_line* indicates whether or not the range must be on a single line;
  1754. -- *escapes* indicates whether or not to allow '\' as an escape character; and *balanced*
  1755. -- indicates whether or not to handle balanced ranges like parentheses, and requires *s* and *e*
  1756. -- to be different.
  1757. -- @param s String or pattern start of a range.
  1758. -- @param[opt] e Optional string or pattern end of a range. The default value is *s*.
  1759. -- @param[opt] single_line Optional flag indicating whether or not the range must be on a single
  1760. -- line. The default value is `false`.
  1761. -- @param[opt] escapes Optional flag indicating whether or not the range end may be escaped
  1762. -- by a '\' character. The default value is `false` unless *s* and *e* are identical,
  1763. -- single-character strings. In that case, the default value is `true`.
  1764. -- @param[opt] balanced Optional flag indicating whether or not to match a balanced range,
  1765. -- like the "%b" Lua pattern. This flag only applies if *s* and *e* are different.
  1766. -- @return pattern
  1767. -- @usage local dq_str_escapes = lexer.range('"')
  1768. -- @usage local dq_str_noescapes = lexer.range('"', false, false)
  1769. -- @usage local unbalanced_parens = lexer.range('(', ')')
  1770. -- @usage local balanced_parens = lexer.range('(', ')', false, false, true)
  1771. function M.range(s, e, single_line, escapes, balanced)
  1772. if type(e) ~= 'string' and type(e) ~= 'userdata' then
  1773. e, single_line, escapes, balanced = s, e, single_line, escapes
  1774. end
  1775. local any = M.any - e
  1776. if single_line then any = any - '\n' end
  1777. if balanced then any = any - s end
  1778. -- Only allow escapes by default for ranges with identical, single-character string delimiters.
  1779. if escapes == nil then escapes = type(s) == 'string' and #s == 1 and s == e end
  1780. if escapes then any = any - '\\' + '\\' * M.any end
  1781. if balanced and s ~= e then return P{s * (any + V(1))^0 * P(e)^-1} end
  1782. return s * any^0 * P(e)^-1
  1783. end
  1784. --- Creates and returns a pattern that matches pattern *patt* only when it comes after one of
  1785. -- the characters in string *set* (or when there are no characters behind *patt*), skipping
  1786. -- over any characters in string *skip*, which is whitespace by default.
  1787. -- @param set String character set like one passed to `lpeg.S()`.
  1788. -- @param patt The LPeg pattern to match after a set character.
  1789. -- @param skip String character set to skip over. The default value is ' \t\r\n\v\f' (whitespace).
  1790. -- @usage local regex = lexer.after_set('+-*!%^&|=,([{', lexer.range('/'))
  1791. function M.after_set(set, patt, skip)
  1792. if not skip then skip = ' \t\r\n\v\f' end
  1793. local set_chars, skip_chars = {}, {}
  1794. -- Note: cannot use utf8.codes() because Lua 5.1 is still supported.
  1795. for char in set:gmatch('.') do set_chars[string.byte(char)] = true end
  1796. for char in skip:gmatch('.') do skip_chars[string.byte(char)] = true end
  1797. return (B(S(set)) + -B(1)) * patt + Cmt(C(patt), function(input, index, match, ...)
  1798. local pos = index - #match
  1799. if #skip > 0 then while pos > 1 and skip_chars[input:byte(pos - 1)] do pos = pos - 1 end end
  1800. if pos == 1 or set_chars[input:byte(pos - 1)] then return index, ... end
  1801. return nil
  1802. end)
  1803. end
  1804. --- Creates and returns a pattern that matches pattern *patt* only at the beginning of a line,
  1805. -- or after any line indentation if *allow_indent* is `true`.
  1806. -- @param patt The LPeg pattern to match on the beginning of a line.
  1807. -- @param allow_indent Whether or not to consider line indentation as the start of a line. The
  1808. -- default value is `false`.
  1809. -- @return pattern
  1810. -- @usage local preproc = lex:tag(lexer.PREPROCESSOR, lexer.starts_line(lexer.to_eol('#')))
  1811. function M.starts_line(patt, allow_indent)
  1812. return M.after_set('\r\n\v\f', patt, allow_indent and ' \t' or '')
  1813. end
  1814. M.colors = {} -- legacy
  1815. M.styles = setmetatable({}, { -- legacy
  1816. __index = function() return setmetatable({}, {__concat = function() return nil end}) end,
  1817. __newindex = function() end
  1818. })
  1819. M.property_expanded = setmetatable({}, {__index = function() return '' end}) -- legacy
  1820. -- Legacy function for creates and returns a token pattern with token name *name* and pattern
  1821. -- *patt*.
  1822. -- Use `tag()` instead.
  1823. -- @param name The name of token.
  1824. -- @param patt The LPeg pattern associated with the token.
  1825. -- @return pattern
  1826. -- @usage local number = token(lexer.NUMBER, lexer.number)
  1827. -- @usage local addition = token('addition', '+' * lexer.word)
  1828. function M.token(name, patt) return Cc(name) * (P(patt) / 0) * Cp() end
  1829. -- Legacy function that creates and returns a pattern that verifies the first non-whitespace
  1830. -- character behind the current match position is in string set *s*.
  1831. -- @param s String character set like one passed to `lpeg.S()`.
  1832. -- @return pattern
  1833. -- @usage local regex = #P('/') * lexer.last_char_includes('+-*!%^&|=,([{') * lexer.range('/')
  1834. function M.last_char_includes(s) return M.after_set(s, true) end
  1835. function M.fold_consecutive_lines() end -- legacy
  1836. -- The functions and fields below were defined in C.
  1837. --- Table of fold level bit-masks for line numbers starting from 1. (Read-only)
  1838. -- Fold level masks are composed of an integer level combined with any of the following bits:
  1839. --
  1840. -- - `lexer.FOLD_BASE`
  1841. -- The initial fold level.
  1842. -- - `lexer.FOLD_BLANK`
  1843. -- The line is blank.
  1844. -- - `lexer.FOLD_HEADER`
  1845. -- The line is a header, or fold point.
  1846. -- @table fold_level
  1847. --- Table of indentation amounts in character columns, for line numbers starting from
  1848. -- 1. (Read-only)
  1849. -- @table indent_amount
  1850. --- Table of integer line states for line numbers starting from 1.
  1851. -- Line states can be used by lexers for keeping track of persistent states. For example,
  1852. -- the output lexer uses this to mark lines that have warnings or errors.
  1853. -- @table line_state
  1854. --- Map of key-value string pairs.
  1855. -- @table property
  1856. --- Map of key-value pairs with values interpreted as numbers, or `0` if not found. (Read-only)
  1857. -- @table property_int
  1858. --- Table of style names at positions in the buffer starting from 1. (Read-only)
  1859. -- @table style_at
  1860. --- Returns the line number (starting from 1) of the line that contains position *pos*, which
  1861. -- starts from 1.
  1862. -- @param pos The position to get the line number of.
  1863. -- @return number
  1864. -- @function line_from_position
  1865. return M