logo

oasis-root

Compiled tree of Oasis Linux based on own branch at <https://hacktivis.me/git/oasis/> git clone https://anongit.hacktivis.me/git/oasis-root.git

lex.1p (37027B)


  1. '\" et
  2. .TH LEX "1P" 2017 "IEEE/The Open Group" "POSIX Programmer's Manual"
  3. .\"
  4. .SH PROLOG
  5. This manual page is part of the POSIX Programmer's Manual.
  6. The Linux implementation of this interface may differ (consult
  7. the corresponding Linux manual page for details of Linux behavior),
  8. or the interface may not be implemented on Linux.
  9. .\"
  10. .SH NAME
  11. lex
  12. \(em generate programs for lexical tasks (\fBDEVELOPMENT\fP)
  13. .SH SYNOPSIS
  14. .LP
  15. .nf
  16. lex \fB[\fR-t\fB] [\fR-n|-v\fB] [\fIfile\fR...\fB]\fR
  17. .fi
  18. .SH DESCRIPTION
  19. The
  20. .IR lex
  21. utility shall generate C programs to be used in lexical processing of
  22. character input, and that can be used as an interface to
  23. .IR yacc .
  24. The C programs shall be generated from
  25. .IR lex
  26. source code and conform to the ISO\ C standard, without depending on any undefined,
  27. unspecified, or implementation-defined behavior, except in cases where
  28. the code is copied directly from the supplied source, or in cases that
  29. are documented by the implementation. Usually, the
  30. .IR lex
  31. utility shall write the program it generates to the file
  32. .BR lex.yy.c ;
  33. the state of this file is unspecified if
  34. .IR lex
  35. exits with a non-zero exit status. See the EXTENDED DESCRIPTION
  36. section for a complete description of the
  37. .IR lex
  38. input language.
  39. .SH OPTIONS
  40. The
  41. .IR lex
  42. utility shall conform to the Base Definitions volume of POSIX.1\(hy2017,
  43. .IR "Section 12.2" ", " "Utility Syntax Guidelines",
  44. except for Guideline 9.
  45. .P
  46. The following options shall be supported:
  47. .IP "\fB\-n\fP" 10
  48. Suppress the summary of statistics usually written with the
  49. .BR \-v
  50. option. If no table sizes are specified in the
  51. .IR lex
  52. source code and the
  53. .BR \-v
  54. option is not specified, then
  55. .BR \-n
  56. is implied.
  57. .IP "\fB\-t\fP" 10
  58. Write the resulting program to standard output instead of
  59. .BR lex.yy.c .
  60. .IP "\fB\-v\fP" 10
  61. Write a summary of
  62. .IR lex
  63. statistics to the standard output. (See the discussion of
  64. .IR lex
  65. table sizes in
  66. .IR "Definitions in lex".)
  67. If the
  68. .BR \-t
  69. option is specified and
  70. .BR \-n
  71. is not specified, this report shall be written to standard error. If
  72. table sizes are specified in the
  73. .IR lex
  74. source code, and if the
  75. .BR \-n
  76. option is not specified, the
  77. .BR \-v
  78. option may be enabled.
  79. .SH OPERANDS
  80. The following operand shall be supported:
  81. .IP "\fIfile\fR" 10
  82. A pathname of an input file. If more than one such
  83. .IR file
  84. is specified, all files shall be concatenated to produce a single
  85. .IR lex
  86. program. If no
  87. .IR file
  88. operands are specified, or if a
  89. .IR file
  90. operand is
  91. .BR '\-' ,
  92. the standard input shall be used.
  93. .SH STDIN
  94. The standard input shall be used if no
  95. .IR file
  96. operands are specified, or if a
  97. .IR file
  98. operand is
  99. .BR '\-' .
  100. See INPUT FILES.
  101. .SH "INPUT FILES"
  102. The input files shall be text files containing
  103. .IR lex
  104. source code, as described in the EXTENDED DESCRIPTION section.
  105. .SH "ENVIRONMENT VARIABLES"
  106. The following environment variables shall affect the execution of
  107. .IR lex :
  108. .IP "\fILANG\fP" 10
  109. Provide a default value for the internationalization variables that are
  110. unset or null. (See the Base Definitions volume of POSIX.1\(hy2017,
  111. .IR "Section 8.2" ", " "Internationalization Variables"
  112. for the precedence of internationalization variables used to determine
  113. the values of locale categories.)
  114. .IP "\fILC_ALL\fP" 10
  115. If set to a non-empty string value, override the values of all the
  116. other internationalization variables.
  117. .IP "\fILC_COLLATE\fP" 10
  118. .br
  119. Determine the locale for the behavior of ranges, equivalence classes,
  120. and multi-character collating elements within regular expressions. If
  121. this variable is not set to the POSIX locale, the results are
  122. unspecified.
  123. .IP "\fILC_CTYPE\fP" 10
  124. Determine the locale for the interpretation of sequences of bytes of
  125. text data as characters (for example, single-byte as opposed to
  126. multi-byte characters in arguments and input files), and the behavior
  127. of character classes within regular expressions. If this variable is
  128. not set to the POSIX locale, the results are unspecified.
  129. .IP "\fILC_MESSAGES\fP" 10
  130. .br
  131. Determine the locale that should be used to affect the format and
  132. contents of diagnostic messages written to standard error.
  133. .IP "\fINLSPATH\fP" 10
  134. Determine the location of message catalogs for the processing of
  135. .IR LC_MESSAGES .
  136. .SH "ASYNCHRONOUS EVENTS"
  137. Default.
  138. .SH STDOUT
  139. If the
  140. .BR \-t
  141. option is specified, the text file of C source code output of
  142. .IR lex
  143. shall be written to standard output.
  144. .P
  145. If the
  146. .BR \-t
  147. option is not specified:
  148. .IP " *" 4
  149. Implementation-defined informational, error, and warning messages
  150. concerning the contents of
  151. .IR lex
  152. source code input shall be written to either the standard output or
  153. standard error.
  154. .IP " *" 4
  155. If the
  156. .BR \-v
  157. option is specified and the
  158. .BR \-n
  159. option is not specified,
  160. .IR lex
  161. statistics shall also be written to either the standard output or
  162. standard error, in an implementation-defined format. These
  163. statistics may also be generated if table sizes are specified with a
  164. .BR '%'
  165. operator in the
  166. .IR Definitions
  167. section, as long as the
  168. .BR \-n
  169. option is not specified.
  170. .SH STDERR
  171. If the
  172. .BR \-t
  173. option is specified, implementation-defined informational, error, and
  174. warning messages concerning the contents of
  175. .IR lex
  176. source code input shall be written to the standard error.
  177. .P
  178. If the
  179. .BR \-t
  180. option is not specified:
  181. .IP " 1." 4
  182. Implementation-defined informational, error, and warning messages
  183. concerning the contents of
  184. .IR lex
  185. source code input shall be written to either the standard output or
  186. standard error.
  187. .IP " 2." 4
  188. If the
  189. .BR \-v
  190. option is specified and the
  191. .BR \-n
  192. option is not specified,
  193. .IR lex
  194. statistics shall also be written to either the standard output or
  195. standard error, in an implementation-defined format. These
  196. statistics may also be generated if table sizes are specified with a
  197. .BR '%'
  198. operator in the
  199. .IR Definitions
  200. section, as long as the
  201. .BR \-n
  202. option is not specified.
  203. .SH "OUTPUT FILES"
  204. A text file containing C source code shall be written to
  205. .BR lex.yy.c ,
  206. or to the standard output if the
  207. .BR \-t
  208. option is present.
  209. .SH "EXTENDED DESCRIPTION"
  210. Each input file shall contain
  211. .IR lex
  212. source code, which is a table of regular expressions with corresponding
  213. actions in the form of C program fragments.
  214. .P
  215. When
  216. .BR lex.yy.c
  217. is compiled and linked with the
  218. .IR lex
  219. library (using the
  220. .BR "\-l\ l"
  221. operand with
  222. .IR c99 ),
  223. the resulting program shall read character input from the standard
  224. input and shall partition it into strings that match the given
  225. expressions.
  226. .br
  227. .P
  228. When an expression is matched, these actions shall occur:
  229. .IP " *" 4
  230. The input string that was matched shall be left in
  231. .IR yytext
  232. as a null-terminated string;
  233. .IR yytext
  234. shall either be an external character array or a pointer to a
  235. character string. As explained in
  236. .IR "Definitions in lex",
  237. the type can be explicitly selected using the
  238. .BR %array
  239. or
  240. .BR %pointer
  241. declarations, but the default is implementation-defined.
  242. .IP " *" 4
  243. The external
  244. .BR int
  245. .IR yyleng
  246. shall be set to the length of the matching string.
  247. .IP " *" 4
  248. The expression's corresponding program fragment, or action, shall be
  249. executed.
  250. .P
  251. During pattern matching,
  252. .IR lex
  253. shall search the set of patterns for the single longest possible
  254. match. Among rules that match the same number of characters, the rule
  255. given first shall be chosen.
  256. .P
  257. The general format of
  258. .IR lex
  259. source shall be:
  260. .sp
  261. .RS
  262. .IR Definitions
  263. .BR %%
  264. .IR Rules
  265. .BR %%
  266. .IR User Subroutines
  267. .RE
  268. .P
  269. The first
  270. .BR \(dq%%\(dq
  271. is required to mark the beginning of the rules (regular expressions and
  272. actions); the second
  273. .BR \(dq%%\(dq
  274. is required only if user subroutines follow.
  275. .P
  276. Any line in the
  277. .IR Definitions
  278. section beginning with a
  279. <blank>
  280. shall be assumed to be a C program fragment and shall be copied to the
  281. external definition area of the
  282. .BR lex.yy.c
  283. file. Similarly, anything in the
  284. .IR Definitions
  285. section included between delimiter lines containing only
  286. .BR \(dq%{\(dq
  287. and
  288. .BR \(dq%}\(dq
  289. shall also be copied unchanged to the external definition area of the
  290. .BR lex.yy.c
  291. file.
  292. .P
  293. Any such input (beginning with a
  294. <blank>
  295. or within
  296. .BR \(dq%{\(dq
  297. and
  298. .BR \(dq%}\(dq
  299. delimiter lines) appearing at the beginning of the
  300. .IR Rules
  301. section before any rules are specified shall be written to
  302. .BR lex.yy.c
  303. after the declarations of variables for the
  304. \fIyylex\fR()
  305. function and before the first line of code in
  306. \fIyylex\fR().
  307. Thus, user variables local to
  308. \fIyylex\fR()
  309. can be declared here, as well as application code to execute upon entry
  310. to
  311. \fIyylex\fR().
  312. .P
  313. The action taken by
  314. .IR lex
  315. when encountering any input beginning with a
  316. <blank>
  317. or within
  318. .BR \(dq%{\(dq
  319. and
  320. .BR \(dq%}\(dq
  321. delimiter lines appearing in the
  322. .IR Rules
  323. section but coming after one or more rules is undefined. The presence
  324. of such input may result in an erroneous definition of the
  325. \fIyylex\fR()
  326. function.
  327. .P
  328. C-language code in the input shall not contain C-language trigraphs.
  329. The C-language code within
  330. .BR \(dq%{\(dq
  331. and
  332. .BR \(dq%}\(dq
  333. delimiter lines shall not contain any lines consisting only of
  334. .BR \(dq%}\(dq ,
  335. or only of
  336. .BR \(dq%%\(dq .
  337. .SS "Definitions in lex"
  338. .P
  339. .IR Definitions
  340. appear before the first
  341. .BR \(dq%%\(dq
  342. delimiter. Any line in this section not contained between
  343. .BR \(dq%{\(dq
  344. and
  345. .BR \(dq%}\(dq
  346. lines and not beginning with a
  347. <blank>
  348. shall be assumed to define a
  349. .IR lex
  350. substitution string. The format of these lines shall be:
  351. .sp
  352. .RS 4
  353. .nf
  354. \fIname substitute\fR
  355. .fi
  356. .P
  357. .RE
  358. .P
  359. If a
  360. .IR name
  361. does not meet the requirements for identifiers in the ISO\ C standard, the result
  362. is undefined. The string
  363. .IR substitute
  364. shall replace the string {\c
  365. .IR name }
  366. when it is used in a rule. The
  367. .IR name
  368. string shall be recognized in this context only when the braces are
  369. provided and when it does not appear within a bracket expression or
  370. within double-quotes.
  371. .P
  372. In the
  373. .IR Definitions
  374. section, any line beginning with a
  375. <percent-sign>
  376. (\c
  377. .BR '%' )
  378. character and followed by an alphanumeric word beginning with either
  379. .BR 's'
  380. or
  381. .BR 'S'
  382. shall define a set of start conditions. Any line beginning with a
  383. .BR '%'
  384. followed by a word beginning with either
  385. .BR 'x'
  386. or
  387. .BR 'X'
  388. shall define a set of exclusive start conditions. When the generated
  389. scanner is in a
  390. .BR %s
  391. state, patterns with no state specified shall be also active; in a
  392. .BR %x
  393. state, such patterns shall not be active. The rest of the line, after
  394. the first word, shall be considered to be one or more
  395. <blank>-separated
  396. names of start conditions. Start condition names shall be constructed
  397. in the same way as definition names. Start conditions can be used to
  398. restrict the matching of regular expressions to one or more states as
  399. described in
  400. .IR "Regular Expressions in lex".
  401. .P
  402. Implementations shall accept either of the following two
  403. mutually-exclusive declarations in the
  404. .IR Definitions
  405. section:
  406. .IP "\fB%array\fR" 10
  407. Declare the type of
  408. .IR yytext
  409. to be a null-terminated character array.
  410. .IP "\fB%pointer\fR" 10
  411. Declare the type of
  412. .IR yytext
  413. to be a pointer to a null-terminated character string.
  414. .P
  415. The default type of
  416. .IR yytext
  417. is implementation-defined. If an application refers to
  418. .IR yytext
  419. outside of the scanner source file (that is, via an
  420. .BR extern ),
  421. the application shall include the appropriate
  422. .BR %array
  423. or
  424. .BR %pointer
  425. declaration in the scanner source file.
  426. .P
  427. Implementations shall accept declarations in the
  428. .IR Definitions
  429. section for setting certain internal table sizes. The declarations are
  430. shown in the following table.
  431. .sp
  432. .ce 1
  433. \fBTable: Table Size Declarations in \fIlex\fP\fR
  434. .TS
  435. center tab(!) box;
  436. cB | cB | cB
  437. l | l | n.
  438. Declaration!Description!Minimum Value
  439. _
  440. %\fBp \fIn\fR!Number of positions!2\|500
  441. %\fBn \fIn\fR!Number of states!500
  442. %\fBa \fIn\fR!Number of transitions!2\|000
  443. %\fBe \fIn\fR!Number of parse tree nodes!1\|000
  444. %\fBk \fIn\fR!Number of packed character classes!1\|000
  445. %\fBo \fIn\fR!Size of the output array!3\|000
  446. .TE
  447. .P
  448. In the table,
  449. .IR n
  450. represents a positive decimal integer, preceded by one or more
  451. <blank>
  452. characters. The exact meaning of these table size numbers is
  453. implementation-defined. The implementation shall document how these
  454. numbers affect the
  455. .IR lex
  456. utility and how they are related to any output that may be generated by
  457. the implementation should limitations be encountered during the
  458. execution of
  459. .IR lex .
  460. It shall be possible to determine from this output which of the table
  461. size values needs to be modified to permit
  462. .IR lex
  463. to successfully generate tables for the input language. The values in
  464. the column Minimum Value represent the lowest values conforming
  465. implementations shall provide.
  466. .SS "Rules in lex"
  467. .P
  468. The rules in
  469. .IR lex
  470. source files are a table in which the left column contains regular
  471. expressions and the right column contains actions (C program fragments)
  472. to be executed when the expressions are recognized.
  473. .sp
  474. .RS 4
  475. .nf
  476. \fIERE action
  477. ERE action\fP
  478. \&...
  479. .fi
  480. .P
  481. .RE
  482. .P
  483. The extended regular expression (ERE) portion of a row shall be
  484. separated from
  485. .IR action
  486. by one or more
  487. <blank>
  488. characters. A regular expression containing
  489. <blank>
  490. characters shall be recognized under one of the following conditions:
  491. .IP " *" 4
  492. The entire expression appears within double-quotes.
  493. .IP " *" 4
  494. The
  495. <blank>
  496. characters appear within double-quotes or square brackets.
  497. .IP " *" 4
  498. Each
  499. <blank>
  500. is preceded by a
  501. <backslash>
  502. character.
  503. .SS "User Subroutines in lex"
  504. .P
  505. Anything in the user subroutines section shall be copied to
  506. .BR lex.yy.c
  507. following
  508. \fIyylex\fR().
  509. .SS "Regular Expressions in lex"
  510. .P
  511. The
  512. .IR lex
  513. utility shall support the set of extended regular expressions (see the Base Definitions volume of POSIX.1\(hy2017,
  514. .IR "Section 9.4" ", " "Extended Regular Expressions"),
  515. with the following additions and exceptions to the syntax:
  516. .IP "\fR\&\(dq...\&\(dq\fR" 10
  517. Any string enclosed in double-quotes shall represent the characters
  518. within the double-quotes as themselves, except that
  519. <backslash>-escapes
  520. (which appear in the following table) shall be recognized. Any
  521. <backslash>-escape
  522. sequence shall be terminated by the closing quote. For example,
  523. .BR \(dq\e01\(dq \c
  524. .BR \(dq1\(dq
  525. represents a single string: the octal value 1 followed by the
  526. character
  527. .BR '1' .
  528. .IP "<\fIstate\fR>\fIr\fR,\ <\fIstate1,state2,\fR.\|.\|.>\fIr\fR" 10
  529. .br
  530. The regular expression
  531. .IR r
  532. shall be matched only when the program is in one of the start
  533. conditions indicated by
  534. .IR state ,
  535. .IR state1 ,
  536. and so on; see
  537. .IR "Actions in lex".
  538. (As an exception to the typographical conventions of the rest of this volume of POSIX.1\(hy2017,
  539. in this case <\fIstate\fP> does not represent a metavariable, but the
  540. literal angle-bracket characters surrounding a symbol.) The start
  541. condition shall be recognized as such only at the beginning of a
  542. regular expression.
  543. .IP "\fIr\fP/\fIx\fP" 10
  544. The regular expression
  545. .IR r
  546. shall be matched only if it is followed by an occurrence of regular
  547. expression
  548. .IR x
  549. (\c
  550. .IR x
  551. is the instance of trailing context, further defined below). The token
  552. returned in
  553. .IR yytext
  554. shall only match
  555. .IR r .
  556. If the trailing portion of
  557. .IR r
  558. matches the beginning of
  559. .IR x ,
  560. the result is unspecified. The
  561. .IR r
  562. expression cannot include further trailing context or the
  563. .BR '$'
  564. (match-end-of-line) operator;
  565. .IR x
  566. cannot include the
  567. .BR '\(ha'
  568. (match-beginning-of-line) operator, nor trailing context, nor the
  569. .BR '$'
  570. operator. That is, only one occurrence of trailing context is allowed
  571. in a
  572. .IR lex
  573. regular expression, and the
  574. .BR '\(ha'
  575. operator only can be used at the beginning of such an expression.
  576. .IP "{\fIname\fR}" 10
  577. When
  578. .IR name
  579. is one of the substitution symbols from the
  580. .IR Definitions
  581. section, the string, including the enclosing braces, shall be replaced
  582. by the
  583. .IR substitute
  584. value. The
  585. .IR substitute
  586. value shall be treated in the extended regular expression as if it were
  587. enclosed in parentheses. No substitution shall occur if {\c
  588. .IR name }
  589. occurs within a bracket expression or within double-quotes.
  590. .P
  591. Within an ERE, a
  592. <backslash>
  593. character shall be considered to begin an escape sequence as specified
  594. in the table in the Base Definitions volume of POSIX.1\(hy2017,
  595. .IR "Chapter 5" ", " "File Format Notation"
  596. (\c
  597. .BR '\e\e' ,
  598. .BR '\ea' ,
  599. .BR '\eb' ,
  600. .BR '\ef' ,
  601. .BR '\en' ,
  602. .BR '\er' ,
  603. .BR '\et' ,
  604. .BR '\ev' ).
  605. In addition, the escape sequences in the following table shall be
  606. recognized.
  607. .P
  608. A literal
  609. <newline>
  610. cannot occur within an ERE; the escape sequence
  611. .BR '\en'
  612. can be used to represent a
  613. <newline>.
  614. A
  615. <newline>
  616. shall not be matched by a period operator.
  617. .br
  618. .sp
  619. .ce 1
  620. \fBTable: Escape Sequences in \fIlex\fP\fR
  621. .ad l
  622. .TS
  623. center tab(@) box;
  624. cB | cB | cB
  625. cB | cB | cB
  626. lf5 | lw(2.4i) | lw(2.4i).
  627. Escape
  628. Sequence@Description@Meaning
  629. _
  630. \e\fIdigits\fP@T{
  631. A
  632. <backslash>
  633. character followed by the longest sequence of one, two, or three
  634. octal-digit characters (01234567). If all of the digits are 0 (that is,
  635. representation of the NUL character), the behavior is undefined.
  636. T}@T{
  637. The character whose encoding is represented by the one, two, or
  638. three-digit octal integer. Multi-byte characters require
  639. multiple, concatenated escape sequences of this type, including the
  640. leading
  641. <backslash>
  642. for each byte.
  643. T}
  644. _
  645. \ex\fIdigits\fP@T{
  646. A
  647. <backslash>
  648. character followed by the longest sequence of hexadecimal-digit
  649. characters (01234567abcdefABCDEF). If all of the digits are 0 (that is,
  650. representation of the NUL character), the behavior is undefined.
  651. T}@T{
  652. The character whose encoding is represented by the hexadecimal
  653. integer.
  654. T}
  655. _
  656. \ec@T{
  657. A
  658. <backslash>
  659. character followed by any character not described in this
  660. table or in the table in the Base Definitions volume of POSIX.1\(hy2017,
  661. .IR "Chapter 5" ", " "File Format Notation"
  662. (\c
  663. .BR '\e\e' ,
  664. .BR '\ea' ,
  665. .BR '\eb' ,
  666. .BR '\ef' ,
  667. .BR '\en' ,
  668. .BR '\er' ,
  669. .BR '\et' ,
  670. .BR '\ev' ).
  671. T}@T{
  672. The character
  673. .BR 'c' ,
  674. unchanged.
  675. T}
  676. .TE
  677. .ad b
  678. .TP 10
  679. .BR Note:
  680. If a
  681. .BR '\ex'
  682. sequence needs to be immediately followed by a hexadecimal digit
  683. character, a sequence such as
  684. .BR \(dq\ex1\(dq \c
  685. .BR \(dq1\(dq
  686. can be used, which represents a character containing the value 1,
  687. followed by the character
  688. .BR '1' .
  689. .P
  690. .P
  691. The order of precedence given to extended regular expressions for
  692. .IR lex
  693. differs from that specified in the Base Definitions volume of POSIX.1\(hy2017,
  694. .IR "Section 9.4" ", " "Extended Regular Expressions".
  695. The order of precedence for
  696. .IR lex
  697. shall be as shown in the following table, from high to low.
  698. .TP 10
  699. .BR Note:
  700. The escaped characters entry is not meant to imply that these are
  701. operators, but they are included in the table to show their
  702. relationships to the true operators. The start condition, trailing
  703. context, and anchoring notations have been omitted from the table
  704. because of the placement restrictions described in this section; they
  705. can only appear at the beginning or ending of an ERE.
  706. .P
  707. .br
  708. .sp
  709. .ce 1
  710. \fBTable: ERE Precedence in \fIlex\fP\fR
  711. .TS
  712. center tab(@) box;
  713. cB | cB
  714. lf2 | lf5.
  715. Extended Regular Expression@Precedence
  716. _
  717. collation-related bracket symbols@[= =] [: :] [. .]
  718. escaped characters@\e<\fIspecial character\fP>
  719. bracket expression@[ ]
  720. quoting@"..."
  721. grouping@( )
  722. definition@{\fIname\fP}
  723. single-character RE duplication@* + ?
  724. concatenation
  725. interval expression@{m,n}
  726. alternation@|
  727. .TE
  728. .P
  729. The ERE anchoring operators
  730. .BR '\(ha'
  731. and
  732. .BR '$'
  733. do not appear in the table. With
  734. .IR lex
  735. regular expressions, these operators are restricted in their use: the
  736. .BR '\(ha'
  737. operator can only be used at the beginning of an entire regular
  738. expression, and the
  739. .BR '$'
  740. operator only at the end. The operators apply to the entire regular
  741. expression. Thus, for example, the pattern
  742. .BR \(dq(\(haabc)|(def$)\(dq
  743. is undefined; it can instead be written as two separate rules, one with
  744. the regular expression
  745. .BR \(dq\(haabc\(dq
  746. and one with
  747. .BR \(dqdef$\(dq ,
  748. which share a common action via the special
  749. .BR '|'
  750. action (see below). If the pattern were written
  751. .BR \(dq\(haabc|def$\(dq ,
  752. it would match either
  753. .BR \(dqabc\(dq
  754. or
  755. .BR \(dqdef\(dq
  756. on a line by itself.
  757. .P
  758. Unlike the general ERE rules, embedded anchoring is not allowed by most
  759. historical
  760. .IR lex
  761. implementations. An example of embedded anchoring would be for
  762. patterns such as
  763. .BR \(dq(\(ha|\ )foo(\ |$)\(dq
  764. to match
  765. .BR \(dqfoo\(dq
  766. when it exists as a complete word. This functionality can be obtained
  767. using existing
  768. .IR lex
  769. features:
  770. .sp
  771. .RS 4
  772. .nf
  773. \(hafoo/[ \en] |
  774. " foo"/[ \en] /* Found foo as a separate word. */
  775. .fi
  776. .P
  777. .RE
  778. .P
  779. Note also that
  780. .BR '$'
  781. is a form of trailing context (it is equivalent to
  782. .BR \(dq/\en\(dq )
  783. and as such cannot be used with regular expressions containing another
  784. instance of the operator (see the preceding discussion of trailing
  785. context).
  786. .P
  787. The additional regular expressions trailing-context operator
  788. .BR '/'
  789. can be used as an ordinary character if presented within double-quotes,
  790. .BR \(dq/\(dq ;
  791. preceded by a
  792. <backslash>,
  793. .BR \(dq\e/\(dq ;
  794. or within a bracket expression,
  795. .BR \(dq[/]\(dq .
  796. The start-condition
  797. .BR '<'
  798. and
  799. .BR '>'
  800. operators shall be special only in a start condition at the beginning
  801. of a regular expression; elsewhere in the regular expression they shall
  802. be treated as ordinary characters.
  803. .SS "Actions in lex"
  804. .P
  805. The action to be taken when an ERE is matched can be a C program
  806. fragment or the special actions described below; the program fragment
  807. can contain one or more C statements, and can also include special
  808. actions. The empty C statement
  809. .BR ';'
  810. shall be a valid action; any string in the
  811. .BR lex.yy.c
  812. input that matches the pattern portion of such a rule is effectively
  813. ignored or skipped. However, the absence of an action shall not be
  814. valid, and the action
  815. .IR lex
  816. takes in such a condition is undefined.
  817. .P
  818. The specification for an action, including C statements and special
  819. actions, can extend across several lines if enclosed in braces:
  820. .sp
  821. .RS 4
  822. .nf
  823. \fIERE\fP <\fIone or more blanks\fR> { \fIprogram statement
  824. program statement\fP }
  825. .fi
  826. .P
  827. .RE
  828. .P
  829. The program statements shall not contain unbalanced curly brace
  830. preprocessing tokens.
  831. .P
  832. The default action when a string in the input to a
  833. .BR lex.yy.c
  834. program is not matched by any expression shall be to copy the string to
  835. the output. Because the default behavior of a program generated by
  836. .IR lex
  837. is to read the input and copy it to the output, a minimal
  838. .IR lex
  839. source program that has just
  840. .BR \(dq%%\(dq
  841. shall generate a C program that simply copies the input to the output
  842. unchanged.
  843. .P
  844. Four special actions shall be available:
  845. .sp
  846. .RS 4
  847. .nf
  848. | ECHO; REJECT; BEGIN
  849. .fi
  850. .P
  851. .RE
  852. .IP "\fR|\fR" 10
  853. The action
  854. .BR '|'
  855. means that the action for the next rule is the action for this rule.
  856. Unlike the other three actions,
  857. .BR '|'
  858. cannot be enclosed in braces or be
  859. <semicolon>-terminated;
  860. the application shall ensure that it is specified alone, with no other
  861. actions.
  862. .IP "\fBECHO;\fR" 10
  863. Write the contents of the string
  864. .IR yytext
  865. on the output.
  866. .IP "\fBREJECT;\fR" 10
  867. Usually only a single expression is matched by a given string in the
  868. input.
  869. .BR REJECT
  870. means ``continue to the next expression that matches the current
  871. input'', and shall cause whatever rule was the second choice after the
  872. current rule to be executed for the same input. Thus, multiple rules
  873. can be matched and executed for one input string or overlapping input
  874. strings. For example, given the regular expressions
  875. .BR \(dqxyz\(dq
  876. and
  877. .BR \(dqxy\(dq
  878. and the input
  879. .BR \(dqxyz\(dq ,
  880. usually only the regular expression
  881. .BR \(dqxyz\(dq
  882. would match. The next attempted match would start after
  883. .BR z.
  884. If the last action in the
  885. .BR \(dqxyz\(dq
  886. rule is
  887. .BR REJECT ,
  888. both this rule and the
  889. .BR \(dqxy\(dq
  890. rule would be executed. The
  891. .BR REJECT
  892. action may be implemented in such a fashion that flow of control does
  893. not continue after it, as if it were equivalent to a
  894. .BR goto
  895. to another part of
  896. \fIyylex\fR().
  897. The use of
  898. .BR REJECT
  899. may result in somewhat larger and slower scanners.
  900. .IP "\fBBEGIN\fR" 10
  901. The action:
  902. .RS 10
  903. .sp
  904. .RS 4
  905. .nf
  906. BEGIN \fInewstate\fP;
  907. .fi
  908. .P
  909. .RE
  910. .P
  911. switches the state (start condition) to
  912. .IR newstate .
  913. If the string
  914. .IR newstate
  915. has not been declared previously as a start condition in the
  916. .IR Definitions
  917. section, the results are unspecified. The initial state is indicated
  918. by the digit
  919. .BR '0'
  920. or the token
  921. .BR INITIAL .
  922. .RE
  923. .P
  924. The functions or macros described below are accessible to user code
  925. included in the
  926. .IR lex
  927. input. It is unspecified whether they appear in the C code output of
  928. .IR lex ,
  929. or are accessible only through the
  930. .BR "\-l\ l"
  931. operand to
  932. .IR c99
  933. (the
  934. .IR lex
  935. library).
  936. .IP "\fBint\ \fIyylex\fR(\fBvoid\fR)" 6
  937. .br
  938. Performs lexical analysis on the input; this is the primary function
  939. generated by the
  940. .IR lex
  941. utility. The function shall return zero when the end of input is
  942. reached; otherwise, it shall return non-zero values (tokens) determined
  943. by the actions that are selected.
  944. .IP "\fBint\ \fIyymore\fR(\fBvoid\fR)" 6
  945. .br
  946. When called, indicates that when the next input string is recognized,
  947. it is to be appended to the current value of
  948. .IR yytext
  949. rather than replacing it; the value in
  950. .IR yyleng
  951. shall be adjusted accordingly.
  952. .IP "\fBint\ \fIyyless\fR(\fBint\ \fIn\fR)" 6
  953. .br
  954. Retains
  955. .IR n
  956. initial characters in
  957. .IR yytext ,
  958. NUL-terminated, and treats the remaining characters as if they had not
  959. been read; the value in
  960. .IR yyleng
  961. shall be adjusted accordingly.
  962. .IP "\fBint\ \fIinput\fR(\fBvoid\fR)" 6
  963. .br
  964. Returns the next character from the input, or zero on end-of-file. It
  965. shall obtain input from the stream pointer
  966. .IR yyin ,
  967. although possibly via an intermediate buffer. Thus, once scanning has
  968. begun, the effect of altering the value of
  969. .IR yyin
  970. is undefined. The character read shall be removed from the input
  971. stream of the scanner without any processing by the scanner.
  972. .IP "\fBint\ \fIunput\fR(\fBint\ \fIc\fR)" 6
  973. .br
  974. Returns the character
  975. .BR 'c'
  976. to the input;
  977. .IR yytext
  978. and
  979. .IR yyleng
  980. are undefined until the next expression is matched. The result of
  981. using
  982. \fIunput\fR()
  983. for more characters than have been input is unspecified.
  984. .P
  985. The following functions shall appear only in the
  986. .IR lex
  987. library accessible through the
  988. .BR "\-l\ l"
  989. operand; they can therefore be redefined by a conforming application:
  990. .IP "\fBint\ \fIyywrap\fR(\fBvoid\fR)" 6
  991. .br
  992. Called by
  993. \fIyylex\fR()
  994. at end-of-file; the default
  995. \fIyywrap\fR()
  996. shall always return 1. If the application requires
  997. \fIyylex\fR()
  998. to continue processing with another source of input, then the
  999. application can include a function
  1000. \fIyywrap\fR(),
  1001. which associates another file with the external variable
  1002. .BR "FILE *"
  1003. .IR yyin
  1004. and shall return a value of zero.
  1005. .IP "\fBint\ \fImain\fR(\fBint\ \fIargc\fR, \fBchar *\fIargv\fR[\|])" 6
  1006. .br
  1007. Calls
  1008. \fIyylex\fR()
  1009. to perform lexical analysis, then exits. The user code can contain
  1010. \fImain\fR()
  1011. to perform application-specific operations, calling
  1012. \fIyylex\fR()
  1013. as applicable.
  1014. .P
  1015. Except for
  1016. \fIinput\fR(),
  1017. \fIunput\fR(),
  1018. and
  1019. \fImain\fR(),
  1020. all external and static names generated by
  1021. .IR lex
  1022. shall begin with the prefix
  1023. .BR yy
  1024. or
  1025. .BR YY .
  1026. .SH "EXIT STATUS"
  1027. The following exit values shall be returned:
  1028. .IP "\00" 6
  1029. Successful completion.
  1030. .IP >0 6
  1031. An error occurred.
  1032. .SH "CONSEQUENCES OF ERRORS"
  1033. Default.
  1034. .LP
  1035. .IR "The following sections are informative."
  1036. .SH "APPLICATION USAGE"
  1037. Conforming applications are warned that in the
  1038. .IR Rules
  1039. section, an ERE without an action is not acceptable, but need not be
  1040. detected as erroneous by
  1041. .IR lex .
  1042. This may result in compilation or runtime errors.
  1043. .P
  1044. The purpose of
  1045. \fIinput\fR()
  1046. is to take characters off the input stream and discard them as far as
  1047. the lexical analysis is concerned. A common use is to discard the body
  1048. of a comment once the beginning of a comment is recognized.
  1049. .P
  1050. The
  1051. .IR lex
  1052. utility is not fully internationalized in its treatment of regular
  1053. expressions in the
  1054. .IR lex
  1055. source code or generated lexical analyzer. It would seem desirable to
  1056. have the lexical analyzer interpret the regular expressions given in
  1057. the
  1058. .IR lex
  1059. source according to the environment specified when the lexical analyzer
  1060. is executed, but this is not possible with the current
  1061. .IR lex
  1062. technology. Furthermore, the very nature of the lexical analyzers
  1063. produced by
  1064. .IR lex
  1065. must be closely tied to the lexical requirements of the input language
  1066. being described, which is frequently locale-specific anyway. (For
  1067. example, writing an analyzer that is used for French text is not
  1068. automatically useful for processing other languages.)
  1069. .SH EXAMPLES
  1070. The following is an example of a
  1071. .IR lex
  1072. program that implements a rudimentary scanner for a Pascal-like
  1073. syntax:
  1074. .sp
  1075. .RS 4
  1076. .nf
  1077. %{
  1078. /* Need this for the call to atof() below. */
  1079. #include <math.h>
  1080. /* Need this for printf(), fopen(), and stdin below. */
  1081. #include <stdio.h>
  1082. %}
  1083. .P
  1084. DIGIT [0-9]
  1085. ID [a-z][a-z0-9]*
  1086. .P
  1087. %%
  1088. .P
  1089. {DIGIT}+ {
  1090. printf("An integer: %s (%d)\en", yytext,
  1091. atoi(yytext));
  1092. }
  1093. .P
  1094. {DIGIT}+"."{DIGIT}* {
  1095. printf("A float: %s (%g)\en", yytext,
  1096. atof(yytext));
  1097. }
  1098. .P
  1099. if|then|begin|end|procedure|function {
  1100. printf("A keyword: %s\en", yytext);
  1101. }
  1102. .P
  1103. {ID} printf("An identifier: %s\en", yytext);
  1104. .P
  1105. "+"|"-"|"*"|"/" printf("An operator: %s\en", yytext);
  1106. .P
  1107. "{"[\(ha}\en]*"}" /* Eat up one-line comments. */
  1108. .P
  1109. [ \et\en]+ /* Eat up white space. */
  1110. .P
  1111. \&. printf("Unrecognized character: %s\en", yytext);
  1112. .P
  1113. %%
  1114. .P
  1115. int main(int argc, char *argv[])
  1116. {
  1117. ++argv, --argc; /* Skip over program name. */
  1118. if (argc > 0)
  1119. yyin = fopen(argv[0], "r");
  1120. else
  1121. yyin = stdin;
  1122. .P
  1123. yylex();
  1124. }
  1125. .fi
  1126. .P
  1127. .RE
  1128. .SH RATIONALE
  1129. Even though the
  1130. .BR \-c
  1131. option and references to the C language are retained in this
  1132. description,
  1133. .IR lex
  1134. may be generalized to other languages, as was done at one time for EFL,
  1135. the Extended FORTRAN Language. Since the
  1136. .IR lex
  1137. input specification is essentially language-independent, versions of
  1138. this utility could be written to produce Ada, Modula-2, or Pascal code,
  1139. and there are known historical implementations that do so.
  1140. .P
  1141. The current description of
  1142. .IR lex
  1143. bypasses the issue of dealing with internationalized EREs in the
  1144. .IR lex
  1145. source code or generated lexical analyzer. If it follows the model used
  1146. by
  1147. .IR awk
  1148. (the source code is assumed to be presented in the POSIX locale, but
  1149. input and output are in the locale specified by the environment
  1150. variables), then the tables in the lexical analyzer produced by
  1151. .IR lex
  1152. would interpret EREs specified in the
  1153. .IR lex
  1154. source in terms of the environment variables specified when
  1155. .IR lex
  1156. was executed. The desired effect would be to have the lexical analyzer
  1157. interpret the EREs given in the
  1158. .IR lex
  1159. source according to the environment specified when the lexical analyzer
  1160. is executed, but this is not possible with the current
  1161. .IR lex
  1162. technology.
  1163. .P
  1164. The description of octal and hexadecimal-digit escape sequences agrees
  1165. with the ISO\ C standard use of escape sequences.
  1166. .P
  1167. Earlier versions of this standard allowed for implementations with
  1168. bytes other than eight bits, but this has been modified in this
  1169. version.
  1170. .P
  1171. There is no detailed output format specification. The observed behavior
  1172. of
  1173. .IR lex
  1174. under four different historical implementations was that none of these
  1175. implementations consistently reported the line numbers for error and
  1176. warning messages. Furthermore, there was a desire that
  1177. .IR lex
  1178. be allowed to output additional diagnostic messages. Leaving message
  1179. formats unspecified avoids these formatting questions and problems with
  1180. internationalization.
  1181. .P
  1182. Although the
  1183. .BR %x
  1184. specifier for
  1185. .IR exclusive
  1186. start conditions is not historical practice, it is believed to be a
  1187. minor change to historical implementations and greatly enhances the
  1188. usability of
  1189. .IR lex
  1190. programs since it permits an application to obtain the expected
  1191. functionality with fewer statements.
  1192. .P
  1193. The
  1194. .BR %array
  1195. and
  1196. .BR %pointer
  1197. declarations were added as a compromise between historical systems.
  1198. The System V-based
  1199. .IR lex
  1200. copies the matched text to a
  1201. .IR yytext
  1202. array. The
  1203. .IR flex
  1204. program, supported in BSD and GNU systems, uses a pointer. In the
  1205. latter case, significant performance improvements are available for
  1206. some scanners. Most historical programs should require no change in
  1207. porting from one system to another because the string being referenced
  1208. is null-terminated in both cases. (The method used by
  1209. .IR flex
  1210. in its case is to null-terminate the token in place by remembering the
  1211. character that used to come right after the token and replacing it
  1212. before continuing on to the next scan.) Multi-file programs with
  1213. external references to
  1214. .IR yytext
  1215. outside the scanner source file should continue to operate on their
  1216. historical systems, but would require one of the new declarations to be
  1217. considered strictly portable.
  1218. .P
  1219. The description of EREs avoids unnecessary duplication of ERE details
  1220. because their meanings within a
  1221. .IR lex
  1222. ERE are the same as that for the ERE in this volume of POSIX.1\(hy2017.
  1223. .P
  1224. The reason for the undefined condition associated with text beginning
  1225. with a
  1226. <blank>
  1227. or within
  1228. .BR \(dq%{\(dq
  1229. and
  1230. .BR \(dq%}\(dq
  1231. delimiter lines appearing in the
  1232. .IR Rules
  1233. section is historical practice. Both the BSD and System V
  1234. .IR lex
  1235. copy the indented (or enclosed) input in the
  1236. .IR Rules
  1237. section (except at the beginning) to unreachable areas of the
  1238. \fIyylex\fR()
  1239. function (the code is written directly after a
  1240. .IR break
  1241. statement). In some cases, the System V
  1242. .IR lex
  1243. generates an error message or a syntax error, depending on the form of
  1244. indented input.
  1245. .P
  1246. The intention in breaking the list of functions into those that may
  1247. appear in
  1248. .BR lex.yy.c
  1249. \fIversus\fR those that only appear in
  1250. .BR libl.a
  1251. is that only those functions in
  1252. .BR libl.a
  1253. can be reliably redefined by a conforming application.
  1254. .P
  1255. The descriptions of standard output and standard error are somewhat
  1256. complicated because historical
  1257. .IR lex
  1258. implementations chose to issue diagnostic messages to standard output
  1259. (unless
  1260. .BR \-t
  1261. was given). POSIX.1\(hy2008 allows this behavior, but leaves an opening
  1262. for the more expected behavior of using standard error for diagnostics.
  1263. Also, the System V behavior of writing the statistics when any table
  1264. sizes are given is allowed, while BSD-derived systems can avoid it. The
  1265. programmer can always precisely obtain the desired results by using
  1266. either the
  1267. .BR \-t
  1268. or
  1269. .BR \-n
  1270. options.
  1271. .P
  1272. The OPERANDS section does not mention the use of
  1273. .BR \-
  1274. as a synonym for standard input; not all historical implementations
  1275. support such usage for any of the
  1276. .IR file
  1277. operands.
  1278. .P
  1279. A description of the
  1280. .IR "translation table"
  1281. was deleted from early proposals because of its relatively low usage in
  1282. historical applications.
  1283. .P
  1284. The change to the definition of the
  1285. \fIinput\fR()
  1286. function that allows buffering of input presents the opportunity for
  1287. major performance gains in some applications.
  1288. .P
  1289. The following examples clarify the differences between
  1290. .IR lex
  1291. regular expressions and regular expressions appearing elsewhere in
  1292. \&this volume of POSIX.1\(hy2017. For regular expressions of the form
  1293. .BR \(dqr/x\(dq ,
  1294. the string matching
  1295. .IR r
  1296. is always returned; confusion may arise when the beginning of
  1297. .IR x
  1298. matches the trailing portion of
  1299. .IR r .
  1300. For example, given the regular expression
  1301. .BR \(dqa*b/cc\(dq
  1302. and the input
  1303. .BR \(dqaaabcc\(dq ,
  1304. .IR yytext
  1305. would contain the string
  1306. .BR \(dqaaab\(dq
  1307. on this match. But given the regular expression
  1308. .BR \(dqx*/xy\(dq
  1309. and the input
  1310. .BR \(dqxxxy\(dq ,
  1311. the token
  1312. .BR xxx ,
  1313. not
  1314. .BR xx ,
  1315. is returned by some implementations because
  1316. .BR xxx
  1317. matches
  1318. .BR \(dqx*\(dq .
  1319. .P
  1320. In the rule
  1321. .BR \(dqab*/bc\(dq ,
  1322. the
  1323. .BR \(dqb*\(dq
  1324. at the end of
  1325. .IR r
  1326. extends
  1327. .IR r 's
  1328. match into the beginning of the trailing context, so the result is
  1329. unspecified. If this rule were
  1330. .BR \(dqab/bc\(dq ,
  1331. however, the rule matches the text
  1332. .BR \(dqab\(dq
  1333. when it is followed by the text
  1334. .BR \(dqbc\(dq .
  1335. In this latter case, the matching of
  1336. .IR r
  1337. cannot extend into the beginning of
  1338. .IR x ,
  1339. so the result is specified.
  1340. .SH "FUTURE DIRECTIONS"
  1341. None.
  1342. .SH "SEE ALSO"
  1343. .IR "\fIc99\fR\^",
  1344. .IR "\fIed\fR\^",
  1345. .IR "\fIyacc\fR\^"
  1346. .P
  1347. The Base Definitions volume of POSIX.1\(hy2017,
  1348. .IR "Chapter 5" ", " "File Format Notation",
  1349. .IR "Chapter 8" ", " "Environment Variables",
  1350. .IR "Chapter 9" ", " "Regular Expressions",
  1351. .IR "Section 12.2" ", " "Utility Syntax Guidelines"
  1352. .\"
  1353. .SH COPYRIGHT
  1354. Portions of this text are reprinted and reproduced in electronic form
  1355. from IEEE Std 1003.1-2017, Standard for Information Technology
  1356. -- Portable Operating System Interface (POSIX), The Open Group Base
  1357. Specifications Issue 7, 2018 Edition,
  1358. Copyright (C) 2018 by the Institute of
  1359. Electrical and Electronics Engineers, Inc and The Open Group.
  1360. In the event of any discrepancy between this version and the original IEEE and
  1361. The Open Group Standard, the original IEEE and The Open Group Standard
  1362. is the referee document. The original Standard can be obtained online at
  1363. http://www.opengroup.org/unix/online.html .
  1364. .PP
  1365. Any typographical or formatting errors that appear
  1366. in this page are most likely
  1367. to have been introduced during the conversion of the source files to
  1368. man page format. To report such errors, see
  1369. https://www.kernel.org/doc/man-pages/reporting_bugs.html .