。1s0MEG IS01正C14882-2003E 2 Lexical conventions [lex] dirdvidalyor ente ave e nd h aer linkedpro an act ding bockslash chara 2.4}nd of
ISO/IEC ISO/IEC 14882:2003(E) 2 Lexical conventions [lex] 1 The text of the program is kept in units called source files in this International Standard. A source file together with all the headers (17.4.1.2) and source files included (16.2) via the preprocessing directive #include, less any source lines skipped by any of the conditional inclusion (16.1) preprocessing directives, is called a translation unit. [Note: a C + + program need not all be translated at the same time. ] 2 [Note: previously translated translation units and instantiation units can be preserved individually or in libraries. The separate translation units of a program communicate (3.5) by (for example) calls to functions whose identifiers have external linkage, manipulation of objects whose identifiers have external linkage, or manipulation of data files. Translation units can be separately translated and then later linked to produce an executable program. (3.5). ] 2.1 Phases of translation [lex.phases] 1 The precedence among the syntax rules of translation is specified by the following phases.13) 1 Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.) 2 Each instance of a new-line character and an immediately preceding backslash character is deleted, splicing physical source lines to form logical source lines. If, as a result, a character sequence that matches the syntax of a universal-character-name is produced, the behavior is undefined. If a source file that is not empty does not end in a new-line character, or ends in a new-line character immediately preceded by a backslash character, the behavior is undefined. 3 The source file is decomposed into preprocessing tokens (2.4) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or partial comment14). Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined. The process of dividing a source file’s characters into preprocessing tokens is context-dependent. [Example: see the handling of < within a #include preprocessing directive. ] 4 Preprocessing directives are executed and macro invocations are expanded. If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation (16.3.3), the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. 5 Each source character set member, escape sequence, or universal-character-name in character literals and string literals is converted to a member of the execution character set (2.13.2, 2.13.4). 6 Adjacent ordinary string literal tokens are concatenated. Adjacent wide string literal tokens are concatenated. 7 White-space characters separating tokens are no longer significant. Each preprocessing token is __________________ 13) Implementations must behave as if these separate phases occur, although in practice different phases might be folded together. 14) A partial preprocessing token would arise from a source file ending in the first portion of a multi-character token that requires a terminating sequence of characters, such as a header-name that is missing the closing " or >. A partial comment would arise from a source file ending with an unclosed /* comment. 9
1s0TEC14s82:203E 色1SO1E 2 Lesisal sonvention repr adsfloasIYa edcfinitis to be are perl圆 s.The prograr lex charstt -w er ser and the ba en智
ISO/IEC 14882:2003(E) ISO/IEC 2.1 Phases of translation 2 Lexical conventions converted into a token. (2.6). The resulting tokens are syntactically and semantically analyzed and translated. [Note: Source files, translation units and translated translation units need not necessarily be stored as files, nor need there be any one-to-one correspondence between these entities and any external representation. The description is conceptual only, and does not specify any particular implementation. ] 8 Translated translation units and instantiation units are combined as follows: [Note: some or all of these may be supplied from a library. ] Each translated translation unit is examined to produce a list of required instantiations. [Note: this may include instantiations which have been explicitly requested (14.7.2). ] The definitions of the required templates are located. It is implementation-defined whether the source of the translation units containing these definitions is required to be available. [Note: an implementation could encode sufficient information into the translated translation unit so as to ensure the source is not required here. ] All the required instantiations are performed to produce instantiation units. [Note: these are similar to translated translation units, but contain no references to uninstantiated templates and no template definitions. ] The program is ill-formed if any instantiation fails. 9 All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment. 2.2 Character sets [lex.charset] 1 The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:15) a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? * + - / ˆ & | ˜ ! = , \ " ’ 2 The universal-character-name construct provides a way to name other characters. hex-quad: hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit universal-character-name: \u hex-quad \U hex-quad hex-quad The character designated by the universal-character-name \UNNNNNNNN is that character whose character short name in ISO/IEC 10646 is NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose character short name in ISO/IEC 10646 is 0000NNNN. If the hexadecimal value for a universal character name is less than 0x20 or in the range 0x7F-0x9F (inclusive), or if the universal character name designates a character in the basic source character set, then the program is illformed. 3 The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are supersets of the basic execution character set and the basic __________________ 15) The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files. 10
ISOTEC IS01EC14882:2003E) m d mle-peine te e on character sets ar 23 Trigraph sequence Table sequence 2- ychek(,)alb]b[al grneph0 e exists.Fach that does not begin one of the listed above is no natrene of the language phases 3 through 。 m has h
ISO/IEC ISO/IEC 14882:2003(E) 2 Lexical conventions 2.2 Character sets execution wide-character set, respectively. The values of the members of the execution character sets are implementation-defined, and any additional members are locale-specific. 2.3 Trigraph sequences [lex.trigraph] 1 Before any other processing takes place, each occurrence of one of the following sequences of three characters (“trigraph sequences”) is replaced by the single character indicated in Table 1. Table 1—trigraph sequences _ __________________________________________________________________ _ __________________________________________________________________ trigraph replacement trigraph replacement trigraph replacement _ __________________________________________________________________ _ __________________________________________________________________ ??= # ??( [ ??< { _ __________________________________________________________________ ??/ \ ??) ] ??> } _ __________________________________________________________________ ??’ ˆ ??! | ??- ˜ 2 [Example: ??=define arraycheck(a,b) a??(b??) ??!??! b??(a??) becomes #define arraycheck(a,b) a[b] || b[a] —end example] 3 No other trigraph sequence exists. Each ? that does not begin one of the trigraphs listed above is not changed. 2.4 Preprocessing tokens [lex.pptoken] preprocessing-token: header-name identifier pp-number character-literal string-literal preprocessing-op-or-punc each non-white-space character that cannot be one of the above 1 Each preprocessing token that is converted to a token (2.6) shall have the lexical form of a keyword, an identifier, a literal, an operator, or a punctuator. 2 A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals, string literals, preprocessing-op-or-punc, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a ’ or a " character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (2.7), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in clause 16, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal. 3 If the input stream has been parsed into preprocessing tokens up to a given character, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail. 4 [Example: The program fragment 1Ex is parsed as a preprocessing number token (one that is not a valid floating or integer literal token), even though a parse as the pair of preprocessing tokens 1 and Ex might 11
IS01DC14832:2003HE 81S01E 24 Preprucessing token Bes.digraph 2.6 Tokems 三 clow.are le.cu The hc enit
ISO/IEC 14882:2003(E) ISO/IEC 2.4 Preprocessing tokens 2 Lexical conventions produce a valid expression (for example, if Ex were a macro defined as +1). Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating literal token), whether or not E is a macro name. ] 5 [Example: The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y are of built-in types, violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression. ] 2.5 Alternative tokens [lex.digraph] 1 Alternative token representations are provided for some operators and punctuators16). 2 In all respects of the language, each alternative token behaves the same, respectively, as its primary token, except for its spelling17). The set of alternative tokens is defined in Table 2. Table 2—alternative tokens _ ______________________________________________________________ _ ______________________________________________________________ alternative primary alternative primary alternative primary _ ______________________________________________________________ _ ______________________________________________________________ <% { and && and_eq &= _ ______________________________________________________________ %> } bitor | or_eq |= _ ______________________________________________________________ <: [ or || xor_eq ˆ= _ ______________________________________________________________ :> ] xor ˆ not ! _ ______________________________________________________________ %: # compl ˜ not_eq != _ ______________________________________________________________ %:%: ## bitand & 2.6 Tokens [lex.token] token: identifier keyword literal operator punctuator 1 There are five kinds of tokens: identifiers, keywords, literals,18) operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments (collectively, “white space”), as described below, are ignored except as they serve to separate tokens. [Note: Some white space is required to separate otherwise adjacent identifiers, keywords, numeric literals, and alternative tokens containing alphabetic characters. ] 2.7 Comments [lex.comment] 1 The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates with the next new-line character. If there is a formfeed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [Note: The comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment. ] __________________ 16) These include “digraphs” and additional reserved words. The term “digraph” (token consisting of two characters) is not perfectly descriptive, since one of the alternative preprocessing-tokens is %:%: and of course several primary tokens contain two characters. Nonetheless, those alternative tokens that aren’t lexical keywords are colloquially known as “digraphs”. 17) Thus the “stringized” values (16.3.2) of [ and <: will be different, maintaining the source spelling, but the tokens can otherwise be freely interchanged. 18) Literals include strings and character and numeric literals. 12
1S01EC14882:2003E) 1 a Preprocessing umhers [lex.ppnumber agmcstnletatagdadn2lBad出amgre 13
ISO/IEC ISO/IEC 14882:2003(E) 2 Lexical conventions 2.8 Header names 2.8 Header names [lex.header] header-name: <h-char-sequence> "q-char-sequence" h-char-sequence: h-char h-char-sequence h-char h-char: any member of the source character set except new-line and > q-char-sequence: q-char q-char-sequence q-char q-char: any member of the source character set except new-line and " 1 Header name preprocessing tokens shall only appear within a #include preprocessing directive (16.2). The sequences in both forms of header-names are mapped in an implementation-defined manner to headers or to external source file names as specified in 16.2. 2 If either of the characters ’ or \, or either of the character sequences /* or // appears in a q-charsequence or a h-char-sequence, or the character " appears in a h-char-sequence, the behavior is undefined.19) 2.9 Preprocessing numbers [lex.ppnumber] pp-number: digit . digit pp-number digit pp-number nondigit pp-number e sign pp-number E sign pp-number . 1 Preprocessing number tokens lexically include all integral literal tokens (2.13.1) and all floating literal tokens (2.13.3). 2 A preprocessing number does not have a type or a value; it acquires both after a successful conversion (as part of translation phase 7, 2.1) to an integral literal token or a floating literal token. 2.10 Identifiers [lex.name] identifier: nondigit identifier nondigit identifier digit __________________ 19) Thus, sequences of characters that resemble escape sequences cause undefined behavior. 13