lex.1p (37027B)
- '\" et
- .TH LEX "1P" 2017 "IEEE/The Open Group" "POSIX Programmer's Manual"
- .\"
- .SH PROLOG
- This manual page is part of the POSIX Programmer's Manual.
- The Linux implementation of this interface may differ (consult
- the corresponding Linux manual page for details of Linux behavior),
- or the interface may not be implemented on Linux.
- .\"
- .SH NAME
- lex
- \(em generate programs for lexical tasks (\fBDEVELOPMENT\fP)
- .SH SYNOPSIS
- .LP
- .nf
- lex \fB[\fR-t\fB] [\fR-n|-v\fB] [\fIfile\fR...\fB]\fR
- .fi
- .SH DESCRIPTION
- The
- .IR lex
- utility shall generate C programs to be used in lexical processing of
- character input, and that can be used as an interface to
- .IR yacc .
- The C programs shall be generated from
- .IR lex
- source code and conform to the ISO\ C standard, without depending on any undefined,
- unspecified, or implementation-defined behavior, except in cases where
- the code is copied directly from the supplied source, or in cases that
- are documented by the implementation. Usually, the
- .IR lex
- utility shall write the program it generates to the file
- .BR lex.yy.c ;
- the state of this file is unspecified if
- .IR lex
- exits with a non-zero exit status. See the EXTENDED DESCRIPTION
- section for a complete description of the
- .IR lex
- input language.
- .SH OPTIONS
- The
- .IR lex
- utility shall conform to the Base Definitions volume of POSIX.1\(hy2017,
- .IR "Section 12.2" ", " "Utility Syntax Guidelines",
- except for Guideline 9.
- .P
- The following options shall be supported:
- .IP "\fB\-n\fP" 10
- Suppress the summary of statistics usually written with the
- .BR \-v
- option. If no table sizes are specified in the
- .IR lex
- source code and the
- .BR \-v
- option is not specified, then
- .BR \-n
- is implied.
- .IP "\fB\-t\fP" 10
- Write the resulting program to standard output instead of
- .BR lex.yy.c .
- .IP "\fB\-v\fP" 10
- Write a summary of
- .IR lex
- statistics to the standard output. (See the discussion of
- .IR lex
- table sizes in
- .IR "Definitions in lex".)
- If the
- .BR \-t
- option is specified and
- .BR \-n
- is not specified, this report shall be written to standard error. If
- table sizes are specified in the
- .IR lex
- source code, and if the
- .BR \-n
- option is not specified, the
- .BR \-v
- option may be enabled.
- .SH OPERANDS
- The following operand shall be supported:
- .IP "\fIfile\fR" 10
- A pathname of an input file. If more than one such
- .IR file
- is specified, all files shall be concatenated to produce a single
- .IR lex
- program. If no
- .IR file
- operands are specified, or if a
- .IR file
- operand is
- .BR '\-' ,
- the standard input shall be used.
- .SH STDIN
- The standard input shall be used if no
- .IR file
- operands are specified, or if a
- .IR file
- operand is
- .BR '\-' .
- See INPUT FILES.
- .SH "INPUT FILES"
- The input files shall be text files containing
- .IR lex
- source code, as described in the EXTENDED DESCRIPTION section.
- .SH "ENVIRONMENT VARIABLES"
- The following environment variables shall affect the execution of
- .IR lex :
- .IP "\fILANG\fP" 10
- Provide a default value for the internationalization variables that are
- unset or null. (See the Base Definitions volume of POSIX.1\(hy2017,
- .IR "Section 8.2" ", " "Internationalization Variables"
- for the precedence of internationalization variables used to determine
- the values of locale categories.)
- .IP "\fILC_ALL\fP" 10
- If set to a non-empty string value, override the values of all the
- other internationalization variables.
- .IP "\fILC_COLLATE\fP" 10
- .br
- Determine the locale for the behavior of ranges, equivalence classes,
- and multi-character collating elements within regular expressions. If
- this variable is not set to the POSIX locale, the results are
- unspecified.
- .IP "\fILC_CTYPE\fP" 10
- Determine the locale for the interpretation of sequences of bytes of
- text data as characters (for example, single-byte as opposed to
- multi-byte characters in arguments and input files), and the behavior
- of character classes within regular expressions. If this variable is
- not set to the POSIX locale, the results are unspecified.
- .IP "\fILC_MESSAGES\fP" 10
- .br
- Determine the locale that should be used to affect the format and
- contents of diagnostic messages written to standard error.
- .IP "\fINLSPATH\fP" 10
- Determine the location of message catalogs for the processing of
- .IR LC_MESSAGES .
- .SH "ASYNCHRONOUS EVENTS"
- Default.
- .SH STDOUT
- If the
- .BR \-t
- option is specified, the text file of C source code output of
- .IR lex
- shall be written to standard output.
- .P
- If the
- .BR \-t
- option is not specified:
- .IP " *" 4
- Implementation-defined informational, error, and warning messages
- concerning the contents of
- .IR lex
- source code input shall be written to either the standard output or
- standard error.
- .IP " *" 4
- If the
- .BR \-v
- option is specified and the
- .BR \-n
- option is not specified,
- .IR lex
- statistics shall also be written to either the standard output or
- standard error, in an implementation-defined format. These
- statistics may also be generated if table sizes are specified with a
- .BR '%'
- operator in the
- .IR Definitions
- section, as long as the
- .BR \-n
- option is not specified.
- .SH STDERR
- If the
- .BR \-t
- option is specified, implementation-defined informational, error, and
- warning messages concerning the contents of
- .IR lex
- source code input shall be written to the standard error.
- .P
- If the
- .BR \-t
- option is not specified:
- .IP " 1." 4
- Implementation-defined informational, error, and warning messages
- concerning the contents of
- .IR lex
- source code input shall be written to either the standard output or
- standard error.
- .IP " 2." 4
- If the
- .BR \-v
- option is specified and the
- .BR \-n
- option is not specified,
- .IR lex
- statistics shall also be written to either the standard output or
- standard error, in an implementation-defined format. These
- statistics may also be generated if table sizes are specified with a
- .BR '%'
- operator in the
- .IR Definitions
- section, as long as the
- .BR \-n
- option is not specified.
- .SH "OUTPUT FILES"
- A text file containing C source code shall be written to
- .BR lex.yy.c ,
- or to the standard output if the
- .BR \-t
- option is present.
- .SH "EXTENDED DESCRIPTION"
- Each input file shall contain
- .IR lex
- source code, which is a table of regular expressions with corresponding
- actions in the form of C program fragments.
- .P
- When
- .BR lex.yy.c
- is compiled and linked with the
- .IR lex
- library (using the
- .BR "\-l\ l"
- operand with
- .IR c99 ),
- the resulting program shall read character input from the standard
- input and shall partition it into strings that match the given
- expressions.
- .br
- .P
- When an expression is matched, these actions shall occur:
- .IP " *" 4
- The input string that was matched shall be left in
- .IR yytext
- as a null-terminated string;
- .IR yytext
- shall either be an external character array or a pointer to a
- character string. As explained in
- .IR "Definitions in lex",
- the type can be explicitly selected using the
- .BR %array
- or
- .BR %pointer
- declarations, but the default is implementation-defined.
- .IP " *" 4
- The external
- .BR int
- .IR yyleng
- shall be set to the length of the matching string.
- .IP " *" 4
- The expression's corresponding program fragment, or action, shall be
- executed.
- .P
- During pattern matching,
- .IR lex
- shall search the set of patterns for the single longest possible
- match. Among rules that match the same number of characters, the rule
- given first shall be chosen.
- .P
- The general format of
- .IR lex
- source shall be:
- .sp
- .RS
- .IR Definitions
- .BR %%
- .IR Rules
- .BR %%
- .IR User Subroutines
- .RE
- .P
- The first
- .BR \(dq%%\(dq
- is required to mark the beginning of the rules (regular expressions and
- actions); the second
- .BR \(dq%%\(dq
- is required only if user subroutines follow.
- .P
- Any line in the
- .IR Definitions
- section beginning with a
- <blank>
- shall be assumed to be a C program fragment and shall be copied to the
- external definition area of the
- .BR lex.yy.c
- file. Similarly, anything in the
- .IR Definitions
- section included between delimiter lines containing only
- .BR \(dq%{\(dq
- and
- .BR \(dq%}\(dq
- shall also be copied unchanged to the external definition area of the
- .BR lex.yy.c
- file.
- .P
- Any such input (beginning with a
- <blank>
- or within
- .BR \(dq%{\(dq
- and
- .BR \(dq%}\(dq
- delimiter lines) appearing at the beginning of the
- .IR Rules
- section before any rules are specified shall be written to
- .BR lex.yy.c
- after the declarations of variables for the
- \fIyylex\fR()
- function and before the first line of code in
- \fIyylex\fR().
- Thus, user variables local to
- \fIyylex\fR()
- can be declared here, as well as application code to execute upon entry
- to
- \fIyylex\fR().
- .P
- The action taken by
- .IR lex
- when encountering any input beginning with a
- <blank>
- or within
- .BR \(dq%{\(dq
- and
- .BR \(dq%}\(dq
- delimiter lines appearing in the
- .IR Rules
- section but coming after one or more rules is undefined. The presence
- of such input may result in an erroneous definition of the
- \fIyylex\fR()
- function.
- .P
- C-language code in the input shall not contain C-language trigraphs.
- The C-language code within
- .BR \(dq%{\(dq
- and
- .BR \(dq%}\(dq
- delimiter lines shall not contain any lines consisting only of
- .BR \(dq%}\(dq ,
- or only of
- .BR \(dq%%\(dq .
- .SS "Definitions in lex"
- .P
- .IR Definitions
- appear before the first
- .BR \(dq%%\(dq
- delimiter. Any line in this section not contained between
- .BR \(dq%{\(dq
- and
- .BR \(dq%}\(dq
- lines and not beginning with a
- <blank>
- shall be assumed to define a
- .IR lex
- substitution string. The format of these lines shall be:
- .sp
- .RS 4
- .nf
- \fIname substitute\fR
- .fi
- .P
- .RE
- .P
- If a
- .IR name
- does not meet the requirements for identifiers in the ISO\ C standard, the result
- is undefined. The string
- .IR substitute
- shall replace the string {\c
- .IR name }
- when it is used in a rule. The
- .IR name
- string shall be recognized in this context only when the braces are
- provided and when it does not appear within a bracket expression or
- within double-quotes.
- .P
- In the
- .IR Definitions
- section, any line beginning with a
- <percent-sign>
- (\c
- .BR '%' )
- character and followed by an alphanumeric word beginning with either
- .BR 's'
- or
- .BR 'S'
- shall define a set of start conditions. Any line beginning with a
- .BR '%'
- followed by a word beginning with either
- .BR 'x'
- or
- .BR 'X'
- shall define a set of exclusive start conditions. When the generated
- scanner is in a
- .BR %s
- state, patterns with no state specified shall be also active; in a
- .BR %x
- state, such patterns shall not be active. The rest of the line, after
- the first word, shall be considered to be one or more
- <blank>-separated
- names of start conditions. Start condition names shall be constructed
- in the same way as definition names. Start conditions can be used to
- restrict the matching of regular expressions to one or more states as
- described in
- .IR "Regular Expressions in lex".
- .P
- Implementations shall accept either of the following two
- mutually-exclusive declarations in the
- .IR Definitions
- section:
- .IP "\fB%array\fR" 10
- Declare the type of
- .IR yytext
- to be a null-terminated character array.
- .IP "\fB%pointer\fR" 10
- Declare the type of
- .IR yytext
- to be a pointer to a null-terminated character string.
- .P
- The default type of
- .IR yytext
- is implementation-defined. If an application refers to
- .IR yytext
- outside of the scanner source file (that is, via an
- .BR extern ),
- the application shall include the appropriate
- .BR %array
- or
- .BR %pointer
- declaration in the scanner source file.
- .P
- Implementations shall accept declarations in the
- .IR Definitions
- section for setting certain internal table sizes. The declarations are
- shown in the following table.
- .sp
- .ce 1
- \fBTable: Table Size Declarations in \fIlex\fP\fR
- .TS
- center tab(!) box;
- cB | cB | cB
- l | l | n.
- Declaration!Description!Minimum Value
- _
- %\fBp \fIn\fR!Number of positions!2\|500
- %\fBn \fIn\fR!Number of states!500
- %\fBa \fIn\fR!Number of transitions!2\|000
- %\fBe \fIn\fR!Number of parse tree nodes!1\|000
- %\fBk \fIn\fR!Number of packed character classes!1\|000
- %\fBo \fIn\fR!Size of the output array!3\|000
- .TE
- .P
- In the table,
- .IR n
- represents a positive decimal integer, preceded by one or more
- <blank>
- characters. The exact meaning of these table size numbers is
- implementation-defined. The implementation shall document how these
- numbers affect the
- .IR lex
- utility and how they are related to any output that may be generated by
- the implementation should limitations be encountered during the
- execution of
- .IR lex .
- It shall be possible to determine from this output which of the table
- size values needs to be modified to permit
- .IR lex
- to successfully generate tables for the input language. The values in
- the column Minimum Value represent the lowest values conforming
- implementations shall provide.
- .SS "Rules in lex"
- .P
- The rules in
- .IR lex
- source files are a table in which the left column contains regular
- expressions and the right column contains actions (C program fragments)
- to be executed when the expressions are recognized.
- .sp
- .RS 4
- .nf
- \fIERE action
- ERE action\fP
- \&...
- .fi
- .P
- .RE
- .P
- The extended regular expression (ERE) portion of a row shall be
- separated from
- .IR action
- by one or more
- <blank>
- characters. A regular expression containing
- <blank>
- characters shall be recognized under one of the following conditions:
- .IP " *" 4
- The entire expression appears within double-quotes.
- .IP " *" 4
- The
- <blank>
- characters appear within double-quotes or square brackets.
- .IP " *" 4
- Each
- <blank>
- is preceded by a
- <backslash>
- character.
- .SS "User Subroutines in lex"
- .P
- Anything in the user subroutines section shall be copied to
- .BR lex.yy.c
- following
- \fIyylex\fR().
- .SS "Regular Expressions in lex"
- .P
- The
- .IR lex
- utility shall support the set of extended regular expressions (see the Base Definitions volume of POSIX.1\(hy2017,
- .IR "Section 9.4" ", " "Extended Regular Expressions"),
- with the following additions and exceptions to the syntax:
- .IP "\fR\&\(dq...\&\(dq\fR" 10
- Any string enclosed in double-quotes shall represent the characters
- within the double-quotes as themselves, except that
- <backslash>-escapes
- (which appear in the following table) shall be recognized. Any
- <backslash>-escape
- sequence shall be terminated by the closing quote. For example,
- .BR \(dq\e01\(dq \c
- .BR \(dq1\(dq
- represents a single string: the octal value 1 followed by the
- character
- .BR '1' .
- .IP "<\fIstate\fR>\fIr\fR,\ <\fIstate1,state2,\fR.\|.\|.>\fIr\fR" 10
- .br
- The regular expression
- .IR r
- shall be matched only when the program is in one of the start
- conditions indicated by
- .IR state ,
- .IR state1 ,
- and so on; see
- .IR "Actions in lex".
- (As an exception to the typographical conventions of the rest of this volume of POSIX.1\(hy2017,
- in this case <\fIstate\fP> does not represent a metavariable, but the
- literal angle-bracket characters surrounding a symbol.) The start
- condition shall be recognized as such only at the beginning of a
- regular expression.
- .IP "\fIr\fP/\fIx\fP" 10
- The regular expression
- .IR r
- shall be matched only if it is followed by an occurrence of regular
- expression
- .IR x
- (\c
- .IR x
- is the instance of trailing context, further defined below). The token
- returned in
- .IR yytext
- shall only match
- .IR r .
- If the trailing portion of
- .IR r
- matches the beginning of
- .IR x ,
- the result is unspecified. The
- .IR r
- expression cannot include further trailing context or the
- .BR '$'
- (match-end-of-line) operator;
- .IR x
- cannot include the
- .BR '\(ha'
- (match-beginning-of-line) operator, nor trailing context, nor the
- .BR '$'
- operator. That is, only one occurrence of trailing context is allowed
- in a
- .IR lex
- regular expression, and the
- .BR '\(ha'
- operator only can be used at the beginning of such an expression.
- .IP "{\fIname\fR}" 10
- When
- .IR name
- is one of the substitution symbols from the
- .IR Definitions
- section, the string, including the enclosing braces, shall be replaced
- by the
- .IR substitute
- value. The
- .IR substitute
- value shall be treated in the extended regular expression as if it were
- enclosed in parentheses. No substitution shall occur if {\c
- .IR name }
- occurs within a bracket expression or within double-quotes.
- .P
- Within an ERE, a
- <backslash>
- character shall be considered to begin an escape sequence as specified
- in the table in the Base Definitions volume of POSIX.1\(hy2017,
- .IR "Chapter 5" ", " "File Format Notation"
- (\c
- .BR '\e\e' ,
- .BR '\ea' ,
- .BR '\eb' ,
- .BR '\ef' ,
- .BR '\en' ,
- .BR '\er' ,
- .BR '\et' ,
- .BR '\ev' ).
- In addition, the escape sequences in the following table shall be
- recognized.
- .P
- A literal
- <newline>
- cannot occur within an ERE; the escape sequence
- .BR '\en'
- can be used to represent a
- <newline>.
- A
- <newline>
- shall not be matched by a period operator.
- .br
- .sp
- .ce 1
- \fBTable: Escape Sequences in \fIlex\fP\fR
- .ad l
- .TS
- center tab(@) box;
- cB | cB | cB
- cB | cB | cB
- lf5 | lw(2.4i) | lw(2.4i).
- Escape
- Sequence@Description@Meaning
- _
- \e\fIdigits\fP@T{
- A
- <backslash>
- character followed by the longest sequence of one, two, or three
- octal-digit characters (01234567). If all of the digits are 0 (that is,
- representation of the NUL character), the behavior is undefined.
- T}@T{
- The character whose encoding is represented by the one, two, or
- three-digit octal integer. Multi-byte characters require
- multiple, concatenated escape sequences of this type, including the
- leading
- <backslash>
- for each byte.
- T}
- _
- \ex\fIdigits\fP@T{
- A
- <backslash>
- character followed by the longest sequence of hexadecimal-digit
- characters (01234567abcdefABCDEF). If all of the digits are 0 (that is,
- representation of the NUL character), the behavior is undefined.
- T}@T{
- The character whose encoding is represented by the hexadecimal
- integer.
- T}
- _
- \ec@T{
- A
- <backslash>
- character followed by any character not described in this
- table or in the table in the Base Definitions volume of POSIX.1\(hy2017,
- .IR "Chapter 5" ", " "File Format Notation"
- (\c
- .BR '\e\e' ,
- .BR '\ea' ,
- .BR '\eb' ,
- .BR '\ef' ,
- .BR '\en' ,
- .BR '\er' ,
- .BR '\et' ,
- .BR '\ev' ).
- T}@T{
- The character
- .BR 'c' ,
- unchanged.
- T}
- .TE
- .ad b
- .TP 10
- .BR Note:
- If a
- .BR '\ex'
- sequence needs to be immediately followed by a hexadecimal digit
- character, a sequence such as
- .BR \(dq\ex1\(dq \c
- .BR \(dq1\(dq
- can be used, which represents a character containing the value 1,
- followed by the character
- .BR '1' .
- .P
- .P
- The order of precedence given to extended regular expressions for
- .IR lex
- differs from that specified in the Base Definitions volume of POSIX.1\(hy2017,
- .IR "Section 9.4" ", " "Extended Regular Expressions".
- The order of precedence for
- .IR lex
- shall be as shown in the following table, from high to low.
- .TP 10
- .BR Note:
- The escaped characters entry is not meant to imply that these are
- operators, but they are included in the table to show their
- relationships to the true operators. The start condition, trailing
- context, and anchoring notations have been omitted from the table
- because of the placement restrictions described in this section; they
- can only appear at the beginning or ending of an ERE.
- .P
- .br
- .sp
- .ce 1
- \fBTable: ERE Precedence in \fIlex\fP\fR
- .TS
- center tab(@) box;
- cB | cB
- lf2 | lf5.
- Extended Regular Expression@Precedence
- _
- collation-related bracket symbols@[= =] [: :] [. .]
- escaped characters@\e<\fIspecial character\fP>
- bracket expression@[ ]
- quoting@"..."
- grouping@( )
- definition@{\fIname\fP}
- single-character RE duplication@* + ?
- concatenation
- interval expression@{m,n}
- alternation@|
- .TE
- .P
- The ERE anchoring operators
- .BR '\(ha'
- and
- .BR '$'
- do not appear in the table. With
- .IR lex
- regular expressions, these operators are restricted in their use: the
- .BR '\(ha'
- operator can only be used at the beginning of an entire regular
- expression, and the
- .BR '$'
- operator only at the end. The operators apply to the entire regular
- expression. Thus, for example, the pattern
- .BR \(dq(\(haabc)|(def$)\(dq
- is undefined; it can instead be written as two separate rules, one with
- the regular expression
- .BR \(dq\(haabc\(dq
- and one with
- .BR \(dqdef$\(dq ,
- which share a common action via the special
- .BR '|'
- action (see below). If the pattern were written
- .BR \(dq\(haabc|def$\(dq ,
- it would match either
- .BR \(dqabc\(dq
- or
- .BR \(dqdef\(dq
- on a line by itself.
- .P
- Unlike the general ERE rules, embedded anchoring is not allowed by most
- historical
- .IR lex
- implementations. An example of embedded anchoring would be for
- patterns such as
- .BR \(dq(\(ha|\ )foo(\ |$)\(dq
- to match
- .BR \(dqfoo\(dq
- when it exists as a complete word. This functionality can be obtained
- using existing
- .IR lex
- features:
- .sp
- .RS 4
- .nf
- \(hafoo/[ \en] |
- " foo"/[ \en] /* Found foo as a separate word. */
- .fi
- .P
- .RE
- .P
- Note also that
- .BR '$'
- is a form of trailing context (it is equivalent to
- .BR \(dq/\en\(dq )
- and as such cannot be used with regular expressions containing another
- instance of the operator (see the preceding discussion of trailing
- context).
- .P
- The additional regular expressions trailing-context operator
- .BR '/'
- can be used as an ordinary character if presented within double-quotes,
- .BR \(dq/\(dq ;
- preceded by a
- <backslash>,
- .BR \(dq\e/\(dq ;
- or within a bracket expression,
- .BR \(dq[/]\(dq .
- The start-condition
- .BR '<'
- and
- .BR '>'
- operators shall be special only in a start condition at the beginning
- of a regular expression; elsewhere in the regular expression they shall
- be treated as ordinary characters.
- .SS "Actions in lex"
- .P
- The action to be taken when an ERE is matched can be a C program
- fragment or the special actions described below; the program fragment
- can contain one or more C statements, and can also include special
- actions. The empty C statement
- .BR ';'
- shall be a valid action; any string in the
- .BR lex.yy.c
- input that matches the pattern portion of such a rule is effectively
- ignored or skipped. However, the absence of an action shall not be
- valid, and the action
- .IR lex
- takes in such a condition is undefined.
- .P
- The specification for an action, including C statements and special
- actions, can extend across several lines if enclosed in braces:
- .sp
- .RS 4
- .nf
- \fIERE\fP <\fIone or more blanks\fR> { \fIprogram statement
- program statement\fP }
- .fi
- .P
- .RE
- .P
- The program statements shall not contain unbalanced curly brace
- preprocessing tokens.
- .P
- The default action when a string in the input to a
- .BR lex.yy.c
- program is not matched by any expression shall be to copy the string to
- the output. Because the default behavior of a program generated by
- .IR lex
- is to read the input and copy it to the output, a minimal
- .IR lex
- source program that has just
- .BR \(dq%%\(dq
- shall generate a C program that simply copies the input to the output
- unchanged.
- .P
- Four special actions shall be available:
- .sp
- .RS 4
- .nf
- | ECHO; REJECT; BEGIN
- .fi
- .P
- .RE
- .IP "\fR|\fR" 10
- The action
- .BR '|'
- means that the action for the next rule is the action for this rule.
- Unlike the other three actions,
- .BR '|'
- cannot be enclosed in braces or be
- <semicolon>-terminated;
- the application shall ensure that it is specified alone, with no other
- actions.
- .IP "\fBECHO;\fR" 10
- Write the contents of the string
- .IR yytext
- on the output.
- .IP "\fBREJECT;\fR" 10
- Usually only a single expression is matched by a given string in the
- input.
- .BR REJECT
- means ``continue to the next expression that matches the current
- input'', and shall cause whatever rule was the second choice after the
- current rule to be executed for the same input. Thus, multiple rules
- can be matched and executed for one input string or overlapping input
- strings. For example, given the regular expressions
- .BR \(dqxyz\(dq
- and
- .BR \(dqxy\(dq
- and the input
- .BR \(dqxyz\(dq ,
- usually only the regular expression
- .BR \(dqxyz\(dq
- would match. The next attempted match would start after
- .BR z.
- If the last action in the
- .BR \(dqxyz\(dq
- rule is
- .BR REJECT ,
- both this rule and the
- .BR \(dqxy\(dq
- rule would be executed. The
- .BR REJECT
- action may be implemented in such a fashion that flow of control does
- not continue after it, as if it were equivalent to a
- .BR goto
- to another part of
- \fIyylex\fR().
- The use of
- .BR REJECT
- may result in somewhat larger and slower scanners.
- .IP "\fBBEGIN\fR" 10
- The action:
- .RS 10
- .sp
- .RS 4
- .nf
- BEGIN \fInewstate\fP;
- .fi
- .P
- .RE
- .P
- switches the state (start condition) to
- .IR newstate .
- If the string
- .IR newstate
- has not been declared previously as a start condition in the
- .IR Definitions
- section, the results are unspecified. The initial state is indicated
- by the digit
- .BR '0'
- or the token
- .BR INITIAL .
- .RE
- .P
- The functions or macros described below are accessible to user code
- included in the
- .IR lex
- input. It is unspecified whether they appear in the C code output of
- .IR lex ,
- or are accessible only through the
- .BR "\-l\ l"
- operand to
- .IR c99
- (the
- .IR lex
- library).
- .IP "\fBint\ \fIyylex\fR(\fBvoid\fR)" 6
- .br
- Performs lexical analysis on the input; this is the primary function
- generated by the
- .IR lex
- utility. The function shall return zero when the end of input is
- reached; otherwise, it shall return non-zero values (tokens) determined
- by the actions that are selected.
- .IP "\fBint\ \fIyymore\fR(\fBvoid\fR)" 6
- .br
- When called, indicates that when the next input string is recognized,
- it is to be appended to the current value of
- .IR yytext
- rather than replacing it; the value in
- .IR yyleng
- shall be adjusted accordingly.
- .IP "\fBint\ \fIyyless\fR(\fBint\ \fIn\fR)" 6
- .br
- Retains
- .IR n
- initial characters in
- .IR yytext ,
- NUL-terminated, and treats the remaining characters as if they had not
- been read; the value in
- .IR yyleng
- shall be adjusted accordingly.
- .IP "\fBint\ \fIinput\fR(\fBvoid\fR)" 6
- .br
- Returns the next character from the input, or zero on end-of-file. It
- shall obtain input from the stream pointer
- .IR yyin ,
- although possibly via an intermediate buffer. Thus, once scanning has
- begun, the effect of altering the value of
- .IR yyin
- is undefined. The character read shall be removed from the input
- stream of the scanner without any processing by the scanner.
- .IP "\fBint\ \fIunput\fR(\fBint\ \fIc\fR)" 6
- .br
- Returns the character
- .BR 'c'
- to the input;
- .IR yytext
- and
- .IR yyleng
- are undefined until the next expression is matched. The result of
- using
- \fIunput\fR()
- for more characters than have been input is unspecified.
- .P
- The following functions shall appear only in the
- .IR lex
- library accessible through the
- .BR "\-l\ l"
- operand; they can therefore be redefined by a conforming application:
- .IP "\fBint\ \fIyywrap\fR(\fBvoid\fR)" 6
- .br
- Called by
- \fIyylex\fR()
- at end-of-file; the default
- \fIyywrap\fR()
- shall always return 1. If the application requires
- \fIyylex\fR()
- to continue processing with another source of input, then the
- application can include a function
- \fIyywrap\fR(),
- which associates another file with the external variable
- .BR "FILE *"
- .IR yyin
- and shall return a value of zero.
- .IP "\fBint\ \fImain\fR(\fBint\ \fIargc\fR, \fBchar *\fIargv\fR[\|])" 6
- .br
- Calls
- \fIyylex\fR()
- to perform lexical analysis, then exits. The user code can contain
- \fImain\fR()
- to perform application-specific operations, calling
- \fIyylex\fR()
- as applicable.
- .P
- Except for
- \fIinput\fR(),
- \fIunput\fR(),
- and
- \fImain\fR(),
- all external and static names generated by
- .IR lex
- shall begin with the prefix
- .BR yy
- or
- .BR YY .
- .SH "EXIT STATUS"
- The following exit values shall be returned:
- .IP "\00" 6
- Successful completion.
- .IP >0 6
- An error occurred.
- .SH "CONSEQUENCES OF ERRORS"
- Default.
- .LP
- .IR "The following sections are informative."
- .SH "APPLICATION USAGE"
- Conforming applications are warned that in the
- .IR Rules
- section, an ERE without an action is not acceptable, but need not be
- detected as erroneous by
- .IR lex .
- This may result in compilation or runtime errors.
- .P
- The purpose of
- \fIinput\fR()
- is to take characters off the input stream and discard them as far as
- the lexical analysis is concerned. A common use is to discard the body
- of a comment once the beginning of a comment is recognized.
- .P
- The
- .IR lex
- utility is not fully internationalized in its treatment of regular
- expressions in the
- .IR lex
- source code or generated lexical analyzer. It would seem desirable to
- have the lexical analyzer interpret the regular expressions given in
- the
- .IR lex
- source according to the environment specified when the lexical analyzer
- is executed, but this is not possible with the current
- .IR lex
- technology. Furthermore, the very nature of the lexical analyzers
- produced by
- .IR lex
- must be closely tied to the lexical requirements of the input language
- being described, which is frequently locale-specific anyway. (For
- example, writing an analyzer that is used for French text is not
- automatically useful for processing other languages.)
- .SH EXAMPLES
- The following is an example of a
- .IR lex
- program that implements a rudimentary scanner for a Pascal-like
- syntax:
- .sp
- .RS 4
- .nf
- %{
- /* Need this for the call to atof() below. */
- #include <math.h>
- /* Need this for printf(), fopen(), and stdin below. */
- #include <stdio.h>
- %}
- .P
- DIGIT [0-9]
- ID [a-z][a-z0-9]*
- .P
- %%
- .P
- {DIGIT}+ {
- printf("An integer: %s (%d)\en", yytext,
- atoi(yytext));
- }
- .P
- {DIGIT}+"."{DIGIT}* {
- printf("A float: %s (%g)\en", yytext,
- atof(yytext));
- }
- .P
- if|then|begin|end|procedure|function {
- printf("A keyword: %s\en", yytext);
- }
- .P
- {ID} printf("An identifier: %s\en", yytext);
- .P
- "+"|"-"|"*"|"/" printf("An operator: %s\en", yytext);
- .P
- "{"[\(ha}\en]*"}" /* Eat up one-line comments. */
- .P
- [ \et\en]+ /* Eat up white space. */
- .P
- \&. printf("Unrecognized character: %s\en", yytext);
- .P
- %%
- .P
- int main(int argc, char *argv[])
- {
- ++argv, --argc; /* Skip over program name. */
- if (argc > 0)
- yyin = fopen(argv[0], "r");
- else
- yyin = stdin;
- .P
- yylex();
- }
- .fi
- .P
- .RE
- .SH RATIONALE
- Even though the
- .BR \-c
- option and references to the C language are retained in this
- description,
- .IR lex
- may be generalized to other languages, as was done at one time for EFL,
- the Extended FORTRAN Language. Since the
- .IR lex
- input specification is essentially language-independent, versions of
- this utility could be written to produce Ada, Modula-2, or Pascal code,
- and there are known historical implementations that do so.
- .P
- The current description of
- .IR lex
- bypasses the issue of dealing with internationalized EREs in the
- .IR lex
- source code or generated lexical analyzer. If it follows the model used
- by
- .IR awk
- (the source code is assumed to be presented in the POSIX locale, but
- input and output are in the locale specified by the environment
- variables), then the tables in the lexical analyzer produced by
- .IR lex
- would interpret EREs specified in the
- .IR lex
- source in terms of the environment variables specified when
- .IR lex
- was executed. The desired effect would be to have the lexical analyzer
- interpret the EREs given in the
- .IR lex
- source according to the environment specified when the lexical analyzer
- is executed, but this is not possible with the current
- .IR lex
- technology.
- .P
- The description of octal and hexadecimal-digit escape sequences agrees
- with the ISO\ C standard use of escape sequences.
- .P
- Earlier versions of this standard allowed for implementations with
- bytes other than eight bits, but this has been modified in this
- version.
- .P
- There is no detailed output format specification. The observed behavior
- of
- .IR lex
- under four different historical implementations was that none of these
- implementations consistently reported the line numbers for error and
- warning messages. Furthermore, there was a desire that
- .IR lex
- be allowed to output additional diagnostic messages. Leaving message
- formats unspecified avoids these formatting questions and problems with
- internationalization.
- .P
- Although the
- .BR %x
- specifier for
- .IR exclusive
- start conditions is not historical practice, it is believed to be a
- minor change to historical implementations and greatly enhances the
- usability of
- .IR lex
- programs since it permits an application to obtain the expected
- functionality with fewer statements.
- .P
- The
- .BR %array
- and
- .BR %pointer
- declarations were added as a compromise between historical systems.
- The System V-based
- .IR lex
- copies the matched text to a
- .IR yytext
- array. The
- .IR flex
- program, supported in BSD and GNU systems, uses a pointer. In the
- latter case, significant performance improvements are available for
- some scanners. Most historical programs should require no change in
- porting from one system to another because the string being referenced
- is null-terminated in both cases. (The method used by
- .IR flex
- in its case is to null-terminate the token in place by remembering the
- character that used to come right after the token and replacing it
- before continuing on to the next scan.) Multi-file programs with
- external references to
- .IR yytext
- outside the scanner source file should continue to operate on their
- historical systems, but would require one of the new declarations to be
- considered strictly portable.
- .P
- The description of EREs avoids unnecessary duplication of ERE details
- because their meanings within a
- .IR lex
- ERE are the same as that for the ERE in this volume of POSIX.1\(hy2017.
- .P
- The reason for the undefined condition associated with text beginning
- with a
- <blank>
- or within
- .BR \(dq%{\(dq
- and
- .BR \(dq%}\(dq
- delimiter lines appearing in the
- .IR Rules
- section is historical practice. Both the BSD and System V
- .IR lex
- copy the indented (or enclosed) input in the
- .IR Rules
- section (except at the beginning) to unreachable areas of the
- \fIyylex\fR()
- function (the code is written directly after a
- .IR break
- statement). In some cases, the System V
- .IR lex
- generates an error message or a syntax error, depending on the form of
- indented input.
- .P
- The intention in breaking the list of functions into those that may
- appear in
- .BR lex.yy.c
- \fIversus\fR those that only appear in
- .BR libl.a
- is that only those functions in
- .BR libl.a
- can be reliably redefined by a conforming application.
- .P
- The descriptions of standard output and standard error are somewhat
- complicated because historical
- .IR lex
- implementations chose to issue diagnostic messages to standard output
- (unless
- .BR \-t
- was given). POSIX.1\(hy2008 allows this behavior, but leaves an opening
- for the more expected behavior of using standard error for diagnostics.
- Also, the System V behavior of writing the statistics when any table
- sizes are given is allowed, while BSD-derived systems can avoid it. The
- programmer can always precisely obtain the desired results by using
- either the
- .BR \-t
- or
- .BR \-n
- options.
- .P
- The OPERANDS section does not mention the use of
- .BR \-
- as a synonym for standard input; not all historical implementations
- support such usage for any of the
- .IR file
- operands.
- .P
- A description of the
- .IR "translation table"
- was deleted from early proposals because of its relatively low usage in
- historical applications.
- .P
- The change to the definition of the
- \fIinput\fR()
- function that allows buffering of input presents the opportunity for
- major performance gains in some applications.
- .P
- The following examples clarify the differences between
- .IR lex
- regular expressions and regular expressions appearing elsewhere in
- \&this volume of POSIX.1\(hy2017. For regular expressions of the form
- .BR \(dqr/x\(dq ,
- the string matching
- .IR r
- is always returned; confusion may arise when the beginning of
- .IR x
- matches the trailing portion of
- .IR r .
- For example, given the regular expression
- .BR \(dqa*b/cc\(dq
- and the input
- .BR \(dqaaabcc\(dq ,
- .IR yytext
- would contain the string
- .BR \(dqaaab\(dq
- on this match. But given the regular expression
- .BR \(dqx*/xy\(dq
- and the input
- .BR \(dqxxxy\(dq ,
- the token
- .BR xxx ,
- not
- .BR xx ,
- is returned by some implementations because
- .BR xxx
- matches
- .BR \(dqx*\(dq .
- .P
- In the rule
- .BR \(dqab*/bc\(dq ,
- the
- .BR \(dqb*\(dq
- at the end of
- .IR r
- extends
- .IR r 's
- match into the beginning of the trailing context, so the result is
- unspecified. If this rule were
- .BR \(dqab/bc\(dq ,
- however, the rule matches the text
- .BR \(dqab\(dq
- when it is followed by the text
- .BR \(dqbc\(dq .
- In this latter case, the matching of
- .IR r
- cannot extend into the beginning of
- .IR x ,
- so the result is specified.
- .SH "FUTURE DIRECTIONS"
- None.
- .SH "SEE ALSO"
- .IR "\fIc99\fR\^",
- .IR "\fIed\fR\^",
- .IR "\fIyacc\fR\^"
- .P
- The Base Definitions volume of POSIX.1\(hy2017,
- .IR "Chapter 5" ", " "File Format Notation",
- .IR "Chapter 8" ", " "Environment Variables",
- .IR "Chapter 9" ", " "Regular Expressions",
- .IR "Section 12.2" ", " "Utility Syntax Guidelines"
- .\"
- .SH COPYRIGHT
- Portions of this text are reprinted and reproduced in electronic form
- from IEEE Std 1003.1-2017, Standard for Information Technology
- -- Portable Operating System Interface (POSIX), The Open Group Base
- Specifications Issue 7, 2018 Edition,
- Copyright (C) 2018 by the Institute of
- Electrical and Electronics Engineers, Inc and The Open Group.
- In the event of any discrepancy between this version and the original IEEE and
- The Open Group Standard, the original IEEE and The Open Group Standard
- is the referee document. The original Standard can be obtained online at
- http://www.opengroup.org/unix/online.html .
- .PP
- Any typographical or formatting errors that appear
- in this page are most likely
- to have been introduced during the conversion of the source files to
- man page format. To report such errors, see
- https://www.kernel.org/doc/man-pages/reporting_bugs.html .