Skip to content

Commit 9b755df

Browse files
committed
[lex] Better specify whitespace characters
This commit defines a grammar term for _whitespace-character_ and uses it consistently where the plain text term whitespace character is used. A whitespace character is defined as one of the five characters that are mentioned in the text closest to provifing a defifinition. The unicode character name is (mostly) consistently used to name these characters, and for consistency, similar changes were made to name unicode characters rather than render specified characters in code font throughout [lex]. The one exception is backslash, which is retained as-is to avoid making more issues for P2348. Note that this commit is not a replacement for P2348, merely a clearer statement of the existing specification without any normative changes.
1 parent bf43925 commit 9b755df

File tree

1 file changed

+46
-26
lines changed

1 file changed

+46
-26
lines changed

source/lex.tex

+46-26
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,9 @@
110110
\indextext{line splicing}%
111111
If the first translation character is \unicode{feff}{byte order mark},
112112
it is deleted.
113-
Each sequence of a backslash character (\textbackslash)
113+
Each sequence of a backslash character (\unicode{005c}{reverse solidus})
114114
immediately followed by
115-
zero or more whitespace characters other than new-line followed by
115+
zero or more \grammarterm{whitespace-character}s other than new-line followed by
116116
a new-line character is deleted, splicing
117117
physical source lines to form \defnx{logical source lines}{source line!logical}. Only the last
118118
backslash on any physical source line shall be eligible for being part
@@ -126,9 +126,13 @@
126126
shall be processed as if an additional new-line character were appended
127127
to the file.
128128

129-
\item The source file is decomposed into preprocessing
130-
tokens\iref{lex.pptoken} and sequences of whitespace characters
131-
(including comments). A source file shall not end in a partial
129+
\item
130+
\indextext{whitespace}%
131+
\indextext{comment}%
132+
\indextext{token!preprocessing}%
133+
The source file is decomposed into preprocessing
134+
tokens\iref{lex.pptoken} and whitespace\iref{lex.whitespace} (sequences of \grammarterm{whitespace-character}s
135+
and comments). A source file shall not end in a partial
132136
preprocessing token or in a partial comment.
133137
\begin{footnote}
134138
A partial preprocessing
@@ -140,9 +144,9 @@
140144
would arise from a source file ending with an unclosed \tcode{/*}
141145
comment.
142146
\end{footnote}
143-
Each comment\iref{lex.comment} is replaced by one space character. New-line characters are
144-
retained. Whether each nonempty sequence of whitespace characters other
145-
than new-line is retained or replaced by one space character is
147+
Each comment\iref{lex.comment} is replaced by one \unicode{0020}{space} character. New-line characters are
148+
retained. Whether each nonempty sequence of \grammarterm{whitespace-character}s other
149+
than new-line is retained or replaced by one \unicode{0020}{space} character is
146150
unspecified.
147151
As characters from the source file are consumed
148152
to form the next preprocessing token
@@ -178,7 +182,8 @@
178182
\item
179183
Adjacent \grammarterm{string-literal} tokens are concatenated\iref{lex.string}.
180184

181-
\item Whitespace characters separating tokens are no longer
185+
\item
186+
Any \grammarterm{whitespace-character}s separating tokens are no longer
182187
significant. Each preprocessing token is converted into a
183188
token\iref{lex.token}. The resulting tokens
184189
constitute a \defn{translation unit} and
@@ -467,7 +472,28 @@
467472
None of these names or aliases have leading or trailing spaces.
468473
\end{note}
469474

470-
\rSec1[lex.comment]{Comments}
475+
\rSec1[lex.whitespace]{Whitespace}
476+
\indextext{whitespace|(}%
477+
478+
\rSec2[lex.whitechar]{Whitespace Characters}
479+
480+
\indextext{character!whitespace|(}%
481+
\begin{bnf}
482+
\nontermdef{whitespace-character}\br
483+
\unicode{0009}{character tabulation}\br
484+
\textnormal{new-line}\br
485+
\unicode{000b}{line tabulation}\br
486+
\unicode{000c}{form feed}\br
487+
\unicode{0020}{space}\br
488+
\end{bnf}
489+
490+
\pnum
491+
\begin{note}
492+
Whitespace characters are used to separate elements of the \Cpp grammar.
493+
\end{note}
494+
\indextext{character!whitespace|)}
495+
496+
\rSec2[lex.comment]{Comments}
471497

472498
\pnum
473499
\indextext{comment|(}%
@@ -477,8 +503,8 @@
477503
characters \tcode{*/}. These comments do not nest.
478504
\indextext{comment!\tcode{//}}%
479505
The characters \tcode{//} start a comment, which terminates immediately before the
480-
next new-line character. If there is a form-feed or a vertical-tab
481-
character in such a comment, only whitespace characters shall appear
506+
next new-line character. If there is a \unicode{000c}{form feed} or a \unicode{000b}{line tabulation}
507+
character in such a comment, only \grammarterm{whitespace-character}s shall appear
482508
between it and the new-line that terminates the comment; no diagnostic
483509
is required.
484510
\begin{note}
@@ -489,6 +515,7 @@
489515
\tcode{/*} comment.
490516
\end{note}
491517
\indextext{comment|)}
518+
\indextext{whitespace|)}%
492519

493520
\rSec1[lex.pptoken]{Preprocessing tokens}
494521

@@ -506,7 +533,7 @@
506533
string-literal\br
507534
user-defined-string-literal\br
508535
preprocessing-op-or-punc\br
509-
\textnormal{each non-whitespace character that cannot be one of the above}
536+
\textnormal{each non-\grammarterm{whitespace-character} that cannot be one of the above}
510537
\end{bnf}
511538

512539
\pnum
@@ -520,22 +547,17 @@
520547
(\grammarterm{import-keyword}, \grammarterm{module-keyword}, and \grammarterm{export-keyword}),
521548
identifiers, preprocessing numbers, character literals (including user-defined character
522549
literals), string literals (including user-defined string literals), preprocessing
523-
operators and punctuators, and single non-whitespace characters that do not lexically
550+
operators and punctuators, and single non-\grammarterm{whitespace-character}s that do not lexically
524551
match the other preprocessing token categories.
525552
If a \unicode{0027}{apostrophe} or a \unicode{0022}{quotation mark} character
526553
matches the last category, the program is ill-formed.
527554
If any character not in the basic character set matches the last category,
528555
the program is ill-formed.
529556
Preprocessing tokens can be separated by
530557
\indextext{whitespace}%
531-
whitespace;
558+
whitespace\iref{lex.whitespace};
532559
\indextext{comment}%
533-
this consists of comments\iref{lex.comment}, or whitespace characters
534-
(\unicode{0020}{space},
535-
\unicode{0009}{character tabulation},
536-
new-line,
537-
\unicode{000b}{line tabulation}, and
538-
\unicode{000c}{form feed}), or both.
560+
this consists of comments, \grammarterm{whitespace-character}s, or both.
539561
As described in \ref{cpp}, in certain
540562
circumstances during translation phase 4, whitespace (or the absence
541563
thereof) serves as more than preprocessing token separation. Whitespace
@@ -826,9 +848,7 @@
826848
\end{footnote}
827849
operators, and other separators.
828850
\indextext{whitespace}%
829-
Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments
830-
(collectively, ``whitespace''), as described below, are ignored except
831-
as they serve to separate tokens.
851+
Whitespace\iref{lex.whitespace} is ignored except to separate tokens.
832852
\begin{note}
833853
Whitespace can separate otherwise adjacent identifiers, keywords, numeric
834854
literals, and alternative tokens containing alphabetic characters.
@@ -1790,8 +1810,8 @@
17901810
\begin{bnf}
17911811
\nontermdef{d-char}\br
17921812
\textnormal{any member of the basic character set except:}\br
1793-
\bnfindent\textnormal{\unicode{0020}{space}, \unicode{0028}{left parenthesis}, \unicode{0029}{right parenthesis}, \unicode{005c}{reverse solidus},}\br
1794-
\bnfindent\textnormal{\unicode{0009}{character tabulation}, \unicode{000b}{line tabulation}, \unicode{000c}{form feed}, and new-line}
1813+
\bnfindent\textnormal{a \grammarterm{whitespace-character}, \unicode{0028}{left parenthesis}, \unicode{0029}{right parenthesis},}\br
1814+
\bnfindent\textnormal{and \unicode{005c}{reverse solidus}}
17951815
\end{bnf}
17961816

17971817
\pnum

0 commit comments

Comments
 (0)