[PATCH] Whitespace issues during preprocessing

Wed Feb 5 15:48:39 PST 2014

On 06/02/14 00:13, Richard Smith wrote:
> On Wed, Feb 5, 2014 at 2:36 PM, Harald van Dijk <harald at gigawatt.nl
> <mailto:harald at gigawatt.nl>> wrote:
> 
>     On 04/02/14 22:24, Richard Smith wrote:
>     > On Tue, Feb 4, 2014 at 1:13 PM, Harald van Dijk
>     <harald at gigawatt.nl <mailto:harald at gigawatt.nl>
>     > <mailto:harald at gigawatt.nl <mailto:harald at gigawatt.nl>>> wrote:
>     >
>     >     On 04/02/14 20:25, Justin Bogner wrote:
>     >     > Harald van Dijk <harald at gigawatt.nl
>     <mailto:harald at gigawatt.nl> <mailto:harald at gigawatt.nl
>     <mailto:harald at gigawatt.nl>>>
>     >     writes:
>     >     >> Attached are my updated patches intended to fix various
>     whitespace
>     >     >> issues in clang, modified to take Justin Bogner's comments into
>     >     account.
>     >     >> They are intended to ensure that whitespace is not
>     inappropriately
>     >     >> removed just because a macro or macro argument expansion is
>     >     empty, and
>     >     >> does get removed during token pasting.
>     >     >
>     >     > I've committed the first four patches for you: r200785 through
>     >     r200788.
>     >
>     >     Thanks!
>     >
>     >     >> I have moved the handling of invalid token pastes from
>     >     >> TokenLexer::ExpandFunctionArguments to TokenLexer::Lex, so that
>     >     it works
>     >     >> for both object- and function-like macros, and both when ##'s
>     >     operands
>     >     >> use macro parameters and when they don't.
>     >     >
>     >     > Given that the tests needed to be changed and the behaviour
>     clearly
>     >     > hasn't followed the comment in a while, I'm not entirely
>     convinced
>     >     this
>     >     > is the right thing to do. Could the comment simply be wrong? Are
>     >     people
>     >     > relying on one behaviour or the other for invalid token
>     pastes in
>     >     > assembler-with-cpp?
>     >     >
>     >     > Basically, is there a way to objectively say one of these
>     >     behaviours is
>     >     > better than the other?
>     >
>     >     Having looked closer, the current behaviour is inconsistent in
>     a way
>     >     that cannot really be explained to someone not familiar with a few
>     >     implementation details. Given
>     >
>     >     #define foo(x) (. ## x . ## y)
>     >
>     >     foo(y) actually expands to (.y . y) in assembler-with-cpp
>     mode, with or
>     >     without my first four patches. That first result is what the
>     comment
>     >     refers to; could the fact that the second result is different
>     be an
>     >     oversight? There does not seem to be a reason for this
>     difference, at
>     >     least. Surprisingly though, this is exactly what GCC does too.
>     >
>     >     My last patch would have turned this into (.y .y), removing
>     the second
>     >     space. That makes sense to me. (. y . y) could also be a perfectly
>     >     sensible result.
>     >
>     >
>     > I think (.y .y) is probably the best answer here (consistently remove
>     > whitespace on both sides of ## wherever it appears) -- this also
>     > probably better matches what MSVC does (where token-paste can
>     result in
>     > multiple tokens, and seems to act as if the tokens were merely abutted
>     > with no adjacent whitespace). This would also behave more consistently
>     > when processing languages with a different lexical structure from C --
>     > in a language where an identifier can start with a period, (.y .y)
>     seems
>     > to unambiguously be the right answer.
> 
>     That is a good point about MSVC. There is actually one test case
>     affected by this change, using -fms-extensions, where the behaviour did
>     not match that of MSVC, and does now, at least for the compiler that
>     comes with Visual Studio 2013.
> 
>     >     I suspect, but do not actually know, that nobody really uses .
>     ## foo
>     >     unless foo is a macro argument, because when foo is not a macro
>     >     argument, there is no point in using ## in the first place. It
>     would
>     >     explain why this went unnoticed for so long. And I can imagine
>     a few
>     >     cases where this would be useful: assembler directives that
>     have subtly
>     >     different syntax, depending on the assembler used. x y would be
>     >     insufficient if x is ., if y is size, and if .size is a
>     directive, but .
>     >     size is a syntax error. ## would work here.
>     >
>     >     Even if clang's behaviour should change, though, my patch does
>     not have
>     >     adequate testing for these special cases, so shouldn't be
>     applied as is.
>     >     Should I work on better tests, or do you think it is more
>     likely that
>     >     the behaviour should remain unchanged and get decent testing?
>     >
>     >
>     > There's a risk of breaking someone's assumption by making a change
>     here
>     > (especially since we'd be introducing a difference from GCC) but I
>     think
>     > the new behavior is much more defensible, and there's probably no
>     way to
>     > find out without trying it.
> 
>     All right. I have now taken a closer look at the test cases where the
>     behaviour changes, and noticed that for blargh ## $foo -> blargh $foo it
>     would not be right to simply change the test case to blargh$foo, as that
>     misses the point of the test.
> 
> 
> Right, I see, it's just making sure that -fdollars-in-identifiers allows
> the paste to work. I like your fix to the test here.

Thanks :)

>     There was also already a test for . ## foo, but it only tested the case
>     where foo is a macro argument. I have extended that test to also check
>     what happens when foo is not a macro argument.
> 
> 
> Can you also add a test for the case where the . comes from a macro
> argument? May as well be thorough =)

In both tests, the . comes from a macro argument, so I think you mean
add cases where it doesn't come from a macro argument? Sure.

>     How does the attached patch look? I've re-tested it on sources of today,
>     but on top of my patch in the "r200787 - Fix whitespace handling in
>     empty macro expansions" thread (after your okay for that).
> 
> 
> Checking whether the previous token was a ## worries me a little. What
> about cases like this:
> 
> #define foo(x, y) x####y
> foo(, z)
>
> This expands to "## z" under our current left-to-right pasting strategy,
> but I think your patch drops the space.

In that test, I think dropping the space is correct. Leading and
trailing white space in a macro argument is insignificant, which is why

#define ID(x) x
#define FOO(x,y) ID(x)ID(y)
FOO(  [  ,  ]  )

should (and does) expand to

[]

Since there is no space between the second ## and the y, there should be
no space in the output. That said, I do see your point. If I change your
test to

#define foo(x) x#### y
foo()

then (still assuming left-to-right, otherwise the test won't work at
all) the output should be

## y

and I am now getting

##y

> Amusingly, "##z" appears to be
> the right answer under a right-to-left pasting strategy, so maybe that's
> OK? Both GCC and EDG expand this to just "z"; the standard is not
> spectacularly clear here, but I think we're right, because a ## produced
> by pasting a placemarker token onto a ## is not a token in the
> replacement list (that token is already gone).
> 
> If you want to say that we don't care about this case, I could live with
> that =)

The preprocessor is supposed to change ## from tok::hashhash to
tok::unknown in those cases during macro expansion where it is not a
paste, to avoid this and similar problems. For example, this altered
test does behave as desired:

#define hash #
#define concat_(x, y) x ## y
#define concat(x, y) concat_(x, y)
#define bar() concat(hash, hash) y
bar()

The preprocessor output with my patch is ## y, even though the y follows ##.

This test also works:

#define foo(x) x y
foo(##)

I think the reason it does not happen for x#### y is because when the
LHS of a token paste is an empty macro argument, the RHS is copied
without actually performing any paste, and in that case, the change to
tok::unknown is skipped. It's tempting to say that we don't care, but I
think this affects more than a single space. I will look into this.

Cheers,
Harald van Dijk