[PATCH] Whitespace issues during preprocessing

Mon Jan 27 22:24:44 PST 2014

Hi Harald,

Harald van Dijk <harald at gigawatt.nl> writes:
> Quite a while ago, I noticed that clang had some preprocessor issues
> with whitespace when macro arguments were empty, and reported that at
> <http://llvm.org/bugs/show_bug.cgi?id=15675>. I've now taken a closer
> look, and found a few more problems. Attached are three patches that
> should address these problems, and one patch that has no functional
> effect but makes the code a bit easier to follow. They include new tests
> and tweak one existing test. All tests pass, and to make sure I am not
> misinterpreting the standard, I've verified that all new and changed
> tests also pass with GCC.
>
> Any thoughts or comments? Can these patches be committed?

This looks pretty reasonable. I have a few minor comments inline.

> Fix whitespace handling in ## operator
>
> In x ## y, where x and y are regular tokens, whitespace between x and ##
> is ignored, and whitespace between ## and y is also ignored. When either
> x or y is a function argument, whitespace was preserved, but it should
> not be. This patch removes the checks for whitespace before ## and
> before y, and in the special case where x is an empty macro argument and
> y is a regular token, actively removes whitespace before y.
>
> One existing test is affected by that change, but as clang's output now
> matches the standard's requirements and that of GCC, I've tweaked the
> testcase.
> ---
>  lib/Lex/TokenLexer.cpp                   | 27 +++++++++++++++------------
>  test/Preprocessor/macro_paste_commaext.c |  4 ++--
>  test/Preprocessor/macro_paste_spacing.c  | 16 +++++++++++++++-
>  3 files changed, 32 insertions(+), 15 deletions(-)
>
> diff --git a/lib/Lex/TokenLexer.cpp b/lib/Lex/TokenLexer.cpp
> index 829c1bd..485db6c 100644
> --- a/lib/Lex/TokenLexer.cpp
> +++ b/lib/Lex/TokenLexer.cpp
> @@ -223,6 +223,13 @@ void TokenLexer::ExpandFunctionArguments() {
>        continue;
>      }
>  
> +    // Find out if there is a paste (##) operator before or after the token.
> +    bool NonEmptyPasteBefore =
> +      !ResultToks.empty() && ResultToks.back().is(tok::hashhash);
> +    bool PasteBefore = i != 0 && Tokens[i-1].is(tok::hashhash);
> +    bool PasteAfter = i+1 != e && Tokens[i+1].is(tok::hashhash);
> +    assert(!NonEmptyPasteBefore || PasteBefore);
> +
>      // Otherwise, if this is not an argument token, just add the token to the
>      // output buffer.
>      IdentifierInfo *II = CurTok.getIdentifierInfo();
> @@ -234,6 +241,8 @@ void TokenLexer::ExpandFunctionArguments() {
>        if (NextTokGetsSpace) {
>          ResultToks.back().setFlag(Token::LeadingSpace);
>          NextTokGetsSpace = false;
> +      } else if (PasteBefore && !NonEmptyPasteBefore) {
> +        ResultToks.back().clearFlag(Token::LeadingSpace);
>        }

The braces aren't necessary for the else if block here.

>        continue;
>      }
> @@ -242,13 +251,7 @@ void TokenLexer::ExpandFunctionArguments() {
>      // input.
>      MadeChange = true;
>  
> -    // Otherwise, this is a use of the argument.  Find out if there is a paste
> -    // (##) operator before or after the argument.
> -    bool NonEmptyPasteBefore =
> -      !ResultToks.empty() && ResultToks.back().is(tok::hashhash);
> -    bool PasteBefore = i != 0 && Tokens[i-1].is(tok::hashhash);
> -    bool PasteAfter = i+1 != e && Tokens[i+1].is(tok::hashhash);
> -    assert(!NonEmptyPasteBefore || PasteBefore);
> +    // Otherwise, this is a use of the argument.
>  
>      // In Microsoft mode, remove the comma before __VA_ARGS__ to ensure there
>      // are no trailing commas if __VA_ARGS__ is empty.
> @@ -358,8 +361,8 @@ void TokenLexer::ExpandFunctionArguments() {
>        // assembler-with-cpp mode, invalid pastes are allowed through: in this
>        // case, we do not want the extra whitespace to be added.  For example,
>        // we want ". ## foo" -> ".foo" not ". foo".
> -      if ((CurTok.hasLeadingSpace() || NextTokGetsSpace) &&
> -          !NonEmptyPasteBefore)
> +      if ((CurTok.hasLeadingSpace() && !PasteBefore) ||
> +          (NextTokGetsSpace && !NonEmptyPasteBefore))
>          ResultToks[ResultToks.size()-NumToks].setFlag(Token::LeadingSpace);

Does the comment need to be updated here?

>        NextTokGetsSpace = false;
> @@ -370,11 +373,11 @@ void TokenLexer::ExpandFunctionArguments() {
>      // 6.10.3.3p2,3) calls for a bunch of placemarker stuff to occur.  We
>      // implement this by eating ## operators when a LHS or RHS expands to
>      // empty.
> -    NextTokGetsSpace |= CurTok.hasLeadingSpace();
> +    if (!PasteBefore)
> +      NextTokGetsSpace |= i != 0 && CurTok.hasLeadingSpace();
>      if (PasteAfter) {
>        // Discard the argument token and skip (don't copy to the expansion
>        // buffer) the paste operator after it.
> -      NextTokGetsSpace |= Tokens[i+1].hasLeadingSpace();
>        ++i;
>        continue;
>      }
> @@ -385,7 +388,7 @@ void TokenLexer::ExpandFunctionArguments() {
>      assert(PasteBefore);
>      if (NonEmptyPasteBefore) {
>        assert(ResultToks.back().is(tok::hashhash));
> -      NextTokGetsSpace |= ResultToks.pop_back_val().hasLeadingSpace();
> +      ResultToks.pop_back();
>      }
>  
>      // If this is the __VA_ARGS__ token, and if the argument wasn't provided,
> diff --git a/test/Preprocessor/macro_paste_commaext.c b/test/Preprocessor/macro_paste_commaext.c
> index 7cfe43d..fdb8f98 100644
> --- a/test/Preprocessor/macro_paste_commaext.c
> +++ b/test/Preprocessor/macro_paste_commaext.c
> @@ -1,8 +1,8 @@
>  // RUN: %clang_cc1 %s -E | grep 'V);'
>  // RUN: %clang_cc1 %s -E | grep 'W, 1, 2);'
>  // RUN: %clang_cc1 %s -E | grep 'X, 1, 2);'
> -// RUN: %clang_cc1 %s -E | grep 'Y, );'
> -// RUN: %clang_cc1 %s -E | grep 'Z, );'
> +// RUN: %clang_cc1 %s -E | grep 'Y,);'
> +// RUN: %clang_cc1 %s -E | grep 'Z,);'
>  
>  #define debug(format, ...) format, ## __VA_ARGS__)
>  debug(V);
> diff --git a/test/Preprocessor/macro_paste_spacing.c b/test/Preprocessor/macro_paste_spacing.c
> index 6498ffc..4ce8e9d 100644
> --- a/test/Preprocessor/macro_paste_spacing.c
> +++ b/test/Preprocessor/macro_paste_spacing.c
> @@ -1,7 +1,21 @@
> -// RUN: %clang_cc1 %s -E | grep "^xy$"
> +// RUN: %clang_cc1 -E %s | FileCheck --strict-whitespace %s
>  
>  #define A  x ## y
>  blah
>  
>  A
> +// CHECK: {{^}}xy{{$}}
>  
> +#define B(x, y) [v ## w] [ v##w] [w ## x] [ w##x] [x ## y] [ x##y] [y ## z] [ y##z]
> +B(x,y)
> +// CHECK: [vw] [ vw] [wx] [ wx] [xy] [ xy] [yz] [ yz]
> +B(x,)
> +// CHECK: [vw] [ vw] [wx] [ wx] [x] [ x] [z] [ z]
> +B(,y)
> +// CHECK: [vw] [ vw] [w] [ w] [y] [ y] [yz] [ yz]
> +B(,)
> +// CHECK: [vw] [ vw] [w] [ w] [] [ ] [z] [ z]

These all deal with leading space, but what about trailing space? Should
there be a [v##w ] test?

> +
> +#define C(x, y, z) [x ## y ## z]
> +C(,,) C(a,,) C(,b,) C(,,c) C(a,b,) C(a,,c) C(,b,c) C(a,b,c)
> +// CHECK: [] [a] [b] [c] [ab] [ac] [bc] [abc]
> -- 
> 1.8.5.2
>

> Fix whitespace handling in empty macro arguments
>
> When a function-like macro definition ends with one of the macro's
> parameters, and the argument is empty, any whitespace before the
> parameter name in the macro definition needs to be preserved. Promoting
> the existing NextTokGetsSpace to a preserved bit-field and checking it
> at the end of the macro expansion allows it to be moved to the first
> token following the macro expansion result.
> ---
>  include/clang/Lex/TokenLexer.h      |  8 ++++++++
>  lib/Lex/TokenLexer.cpp              | 38 +++++++++++++++++++------------------
>  test/Preprocessor/macro_arg_empty.c |  7 +++++++
>  3 files changed, 35 insertions(+), 18 deletions(-)
>  create mode 100644 test/Preprocessor/macro_arg_empty.c
>
> diff --git a/include/clang/Lex/TokenLexer.h b/include/clang/Lex/TokenLexer.h
> index 7c8cfd0..247c07b 100644
> --- a/include/clang/Lex/TokenLexer.h
> +++ b/include/clang/Lex/TokenLexer.h
> @@ -81,6 +81,14 @@ class TokenLexer {
>    bool AtStartOfLine : 1;
>    bool HasLeadingSpace : 1;
>  
> +  // NextTokGetsSpace - When this is true, the next token appended to the
> +  // output list during function argument expansion will get a leading space,
> +  // regardless of whether it had one to begin with or not. This is used for
> +  // placemarker support. If still true after function argument expansion, the
> +  // leading space will be applied to the first token following the macro
> +  // expansion.
> +  bool NextTokGetsSpace : 1;
> +
>    /// OwnsTokens - This is true if this TokenLexer allocated the Tokens
>    /// array, and thus needs to free it when destroyed.  For simple object-like
>    /// macros (for example) we just point into the token buffer of the macro
> diff --git a/lib/Lex/TokenLexer.cpp b/lib/Lex/TokenLexer.cpp
> index 485db6c..921434b 100644
> --- a/lib/Lex/TokenLexer.cpp
> +++ b/lib/Lex/TokenLexer.cpp
> @@ -37,6 +37,7 @@ void TokenLexer::Init(Token &Tok, SourceLocation ELEnd, MacroInfo *MI,
>    ExpandLocEnd = ELEnd;
>    AtStartOfLine = Tok.isAtStartOfLine();
>    HasLeadingSpace = Tok.hasLeadingSpace();
> +  NextTokGetsSpace = false;
>    Tokens = &*Macro->tokens_begin();
>    OwnsTokens = false;
>    DisableMacroExpansion = false;
> @@ -95,6 +96,7 @@ void TokenLexer::Init(const Token *TokArray, unsigned NumToks,
>    ExpandLocStart = ExpandLocEnd = SourceLocation();
>    AtStartOfLine = false;
>    HasLeadingSpace = false;
> +  NextTokGetsSpace = false;
>    MacroExpansionStart = SourceLocation();
>  
>    // Set HasLeadingSpace/AtStartOfLine so that the first token will be
> @@ -122,7 +124,6 @@ void TokenLexer::destroy() {
>  /// Remove comma ahead of __VA_ARGS__, if present, according to compiler dialect
>  /// settings.  Returns true if the comma is removed.
>  static bool MaybeRemoveCommaBeforeVaArgs(SmallVectorImpl<Token> &ResultToks,
> -                                         bool &NextTokGetsSpace,
>                                           bool HasPasteOperator,
>                                           MacroInfo *Macro, unsigned MacroArgNo,
>                                           Preprocessor &PP) {
> @@ -163,8 +164,6 @@ static bool MaybeRemoveCommaBeforeVaArgs(SmallVectorImpl<Token> &ResultToks,
>    if (!ResultToks.empty() && ResultToks.back().is(tok::hashhash))
>      ResultToks.pop_back();
>  
> -  // Never add a space, even if the comma, ##, or arg had a space.
> -  NextTokGetsSpace = false;
>    return true;
>  }

Should MaybeRemoveCommaBeforeVaArgs be made a private method of
TokenLexer rather tahn static? This way you wouldn't need to change this
assignment and duplicate the logic for all of the callers.

> @@ -179,11 +178,6 @@ void TokenLexer::ExpandFunctionArguments() {
>    // we install the newly expanded sequence as the new 'Tokens' list.
>    bool MadeChange = false;
>  
> -  // NextTokGetsSpace - When this is true, the next token appended to the
> -  // output list will get a leading space, regardless of whether it had one to
> -  // begin with or not.  This is used for placemarker support.
> -  bool NextTokGetsSpace = false;
> -
>    for (unsigned i = 0, e = NumTokens; i != e; ++i) {
>      // If we found the stringify operator, get the argument stringified.  The
>      // preprocessor already verified that the following token is a macro name
> @@ -256,10 +250,13 @@ void TokenLexer::ExpandFunctionArguments() {
>      // In Microsoft mode, remove the comma before __VA_ARGS__ to ensure there
>      // are no trailing commas if __VA_ARGS__ is empty.
>      if (!PasteBefore && ActualArgs->isVarargsElidedUse() &&
> -        MaybeRemoveCommaBeforeVaArgs(ResultToks, NextTokGetsSpace,
> +        MaybeRemoveCommaBeforeVaArgs(ResultToks,
>                                       /*HasPasteOperator=*/false,
> -                                     Macro, ArgNo, PP))
> +                                     Macro, ArgNo, PP)) {
> +      // Never add a space, even if the comma or arg had a space.
> +      NextTokGetsSpace = false;
>        continue;
> +    }

If you agree with the above, this change isn't needed (other than the
argument update).

>      // If it is not the LHS/RHS of a ## operator, we must pre-expand the
>      // argument and substitute the expanded tokens into the result.  This is
> @@ -311,9 +308,10 @@ void TokenLexer::ExpandFunctionArguments() {
>                                               NextTokGetsSpace);
>          NextTokGetsSpace = false;
>        } else {
> -        // If this is an empty argument, and if there was whitespace before the
> -        // formal token, make sure the next token gets whitespace before it.
> -        NextTokGetsSpace = CurTok.hasLeadingSpace();
> +        // If this is an empty argument, if there was whitespace before the
> +        // formal token, and this is not the first token in the macro
> +        // definition, make sure the next token gets whitespace before it.
> +        NextTokGetsSpace |= i != 0 && CurTok.hasLeadingSpace();
>        }
>        continue;
>      }
> @@ -395,10 +393,14 @@ void TokenLexer::ExpandFunctionArguments() {
>      // and if the macro had at least one real argument, and if the token before
>      // the ## was a comma, remove the comma.  This is a GCC extension which is
>      // disabled when using -std=c99.
> -    if (ActualArgs->isVarargsElidedUse())
> -      MaybeRemoveCommaBeforeVaArgs(ResultToks, NextTokGetsSpace,
> -                                   /*HasPasteOperator=*/true,
> -                                   Macro, ArgNo, PP);
> +    if (ActualArgs->isVarargsElidedUse() &&
> +        MaybeRemoveCommaBeforeVaArgs(ResultToks,
> +                                     /*HasPasteOperator=*/true,
> +                                     Macro, ArgNo, PP))
> +    {
> +      // Never add a space, even if the comma, ##, or arg had a space.
> +      NextTokGetsSpace = false;
> +    }

Same thing here. If this does stay, the braces should be removed.

>      continue;
>    }
> @@ -428,7 +430,7 @@ bool TokenLexer::Lex(Token &Tok) {
>  
>      Tok.startToken();
>      Tok.setFlagValue(Token::StartOfLine , AtStartOfLine);
> -    Tok.setFlagValue(Token::LeadingSpace, HasLeadingSpace);
> +    Tok.setFlagValue(Token::LeadingSpace, HasLeadingSpace || NextTokGetsSpace);
>      if (CurToken == 0)
>        Tok.setFlag(Token::LeadingEmptyMacro);
>      return PP.HandleEndOfTokenLexer(Tok);
> diff --git a/test/Preprocessor/macro_arg_empty.c b/test/Preprocessor/macro_arg_empty.c
> new file mode 100644
> index 0000000..09b8404
> --- /dev/null
> +++ b/test/Preprocessor/macro_arg_empty.c
> @@ -0,0 +1,7 @@
> +// RUN: %clang_cc1 -E %s | FileCheck --strict-whitespace %s
> +
> +#define FOO(x) x
> +#define BAR(x) x x
> +#define BAZ(x) [x] [ x]
> +[FOO()] [ FOO()] [BAR()] BAZ()
> +// CHECK: [] [ ] [ ] [] [ ]

The question about trailing whitespace test cases applies here too.

> -- 
> 1.8.5.2
>
>
> Fix whitespace handling in empty macro expansions
>
> When a macro expansion does not result in any tokens, and the macro name
> is preceded by whitespace, the whitespace should be passed to the first
> token that follows the macro expansion. Similarly when a macro expansion
> ends with a placemarker token, and that placemarker token is preceded by
> whitespace. This worked already for top-level macro expansions, but is
> now extended to also work for nested macro expansions.
> ---
>  lib/Lex/TokenLexer.cpp          |  8 ++++++--
>  test/Preprocessor/macro_space.c | 36 +++++++++++++++++++++++++++++++++---
>  2 files changed, 39 insertions(+), 5 deletions(-)
>
> diff --git a/lib/Lex/TokenLexer.cpp b/lib/Lex/TokenLexer.cpp
> index 921434b..1769f8e 100644
> --- a/lib/Lex/TokenLexer.cpp
> +++ b/lib/Lex/TokenLexer.cpp
> @@ -484,9 +484,13 @@ bool TokenLexer::Lex(Token &Tok) {
>    if (isFirstToken) {
>      Tok.setFlagValue(Token::StartOfLine , AtStartOfLine);
>      Tok.setFlagValue(Token::LeadingSpace, HasLeadingSpace);
> -    AtStartOfLine = false;
> -    HasLeadingSpace = false;
> +  } else {
> +    // If this is not the first token, we may still need to pass through
> +    // leading whitespace if we've expanded a macro.
> +    if (HasLeadingSpace) Tok.setFlag(Token::LeadingSpace);

This would be clearer as an else if, rather than an if inside an else.

>    }
> +  AtStartOfLine = false;
> +  HasLeadingSpace = false;

Why did you move these assignments out of the if? Changing the behaviour
of AtStartOfLine certainly seems unrelated to the rest of your changes,
which makes me think this might be a mistake.

>  
>    // Handle recursive expansion!
>    if (!Tok.isAnnotation() && Tok.getIdentifierInfo() != 0) {
> diff --git a/test/Preprocessor/macro_space.c b/test/Preprocessor/macro_space.c
> index 8a47a3b..13e531f 100644
> --- a/test/Preprocessor/macro_space.c
> +++ b/test/Preprocessor/macro_space.c
> @@ -1,6 +1,36 @@
>  // RUN: %clang_cc1 -E %s | FileCheck --strict-whitespace %s
>  
> -#define XX
> -! XX,
> +#define FOO1()
> +#define FOO2(x)x
> +#define FOO3(x) x
> +#define FOO4(x)x x
> +#define FOO5(x) x x
> +#define FOO6(x) [x]
> +#define FOO7(x) [ x]
> +#define FOO8(x) [x ]
>  
> -// CHECK: {{^}}! ,{{$}}
> +#define TEST(FOO,x) FOO <FOO()> < FOO()> <FOO ()> <FOO( )> <FOO() > <FOO()x> <FOO() x> < FOO()x>
> +
> +TEST(FOO1,)
> +// CHECK: FOO1 <> < > <> <> < > <> < > < >
> +
> +TEST(FOO2,)
> +// CHECK: FOO2 <> < > <> <> < > <> < > < >
> +
> +TEST(FOO3,)
> +// CHECK: FOO3 <> < > <> <> < > <> < > < >
> +
> +TEST(FOO4,)
> +// CHECK: FOO4 < > < > < > < > < > < > < > < >
> +
> +TEST(FOO5,)
> +// CHECK: FOO5 < > < > < > < > < > < > < > < >
> +
> +TEST(FOO6,)
> +// CHECK: FOO6 <[]> < []> <[]> <[]> <[] > <[]> <[] > < []>
> +
> +TEST(FOO7,)
> +// CHECK: FOO7 <[ ]> < [ ]> <[ ]> <[ ]> <[ ] > <[ ]> <[ ] > < [ ]>
> +
> +TEST(FOO8,)
> +// CHECK: FOO8 <[ ]> < [ ]> <[ ]> <[ ]> <[ ] > <[ ]> <[ ] > < [ ]>
> -- 
> 1.8.5.2
>

> Clean up whitespace checks
>
> In TokenLexer::ExpandFunctionArguments(), CurTok.hasLeadingSpace() is
> checked in multiple locations, each time subtly differently. Checking it
> early, when the token is seen, and using NextTokGetsSpace exclusively
> after that makes the code simpler.
>
> No change in behaviour is intended.

I like this change! I was going to complain about all of the i != 0
checks in the earlier changes before I saw this :)

> ---
>  lib/Lex/TokenLexer.cpp | 16 +++++-----------
>  1 file changed, 5 insertions(+), 11 deletions(-)
>
> diff --git a/lib/Lex/TokenLexer.cpp b/lib/Lex/TokenLexer.cpp
> index 1769f8e..23bc2da 100644
> --- a/lib/Lex/TokenLexer.cpp
> +++ b/lib/Lex/TokenLexer.cpp
> @@ -183,6 +183,9 @@ void TokenLexer::ExpandFunctionArguments() {
>      // preprocessor already verified that the following token is a macro name
>      // when the #define was parsed.
>      const Token &CurTok = Tokens[i];
> +    if (i != 0 && !Tokens[i-1].is(tok::hashhash) && CurTok.hasLeadingSpace())
> +      NextTokGetsSpace = true;
> +
>      if (CurTok.is(tok::hash) || CurTok.is(tok::hashat)) {
>        int ArgNo = Macro->getArgumentNum(Tokens[i+1].getIdentifierInfo());
>        assert(ArgNo != -1 && "Token following # is not an argument?");
> @@ -207,7 +210,7 @@ void TokenLexer::ExpandFunctionArguments() {
>  
>        // The stringified/charified string leading space flag gets set to match
>        // the #/#@ operator.
> -      if (CurTok.hasLeadingSpace() || NextTokGetsSpace)
> +      if (NextTokGetsSpace)
>          Res.setFlag(Token::LeadingSpace);
>  
>        ResultToks.push_back(Res);
> @@ -304,14 +307,8 @@ void TokenLexer::ExpandFunctionArguments() {
>          // before the first token should match the whitespace of the arg
>          // identifier.
>          ResultToks[FirstResult].setFlagValue(Token::LeadingSpace,
> -                                             CurTok.hasLeadingSpace() ||
>                                               NextTokGetsSpace);
>          NextTokGetsSpace = false;
> -      } else {
> -        // If this is an empty argument, if there was whitespace before the
> -        // formal token, and this is not the first token in the macro
> -        // definition, make sure the next token gets whitespace before it.
> -        NextTokGetsSpace |= i != 0 && CurTok.hasLeadingSpace();
>        }
>        continue;
>      }
> @@ -359,8 +356,7 @@ void TokenLexer::ExpandFunctionArguments() {
>        // assembler-with-cpp mode, invalid pastes are allowed through: in this
>        // case, we do not want the extra whitespace to be added.  For example,
>        // we want ". ## foo" -> ".foo" not ". foo".
> -      if ((CurTok.hasLeadingSpace() && !PasteBefore) ||
> -          (NextTokGetsSpace && !NonEmptyPasteBefore))
> +      if (NextTokGetsSpace)
>          ResultToks[ResultToks.size()-NumToks].setFlag(Token::LeadingSpace);
>  
>        NextTokGetsSpace = false;
> @@ -371,8 +367,6 @@ void TokenLexer::ExpandFunctionArguments() {
>      // 6.10.3.3p2,3) calls for a bunch of placemarker stuff to occur.  We
>      // implement this by eating ## operators when a LHS or RHS expands to
>      // empty.
> -    if (!PasteBefore)
> -      NextTokGetsSpace |= i != 0 && CurTok.hasLeadingSpace();
>      if (PasteAfter) {
>        // Discard the argument token and skip (don't copy to the expansion
>        // buffer) the paste operator after it.

-- Justin Bogner