[PATCH] Whitespace issues during preprocessing
Harald van Dijk
harald at gigawatt.nl
Thu Feb 6 15:02:26 PST 2014
On 06/02/14 00:48, Harald van Dijk wrote:
> On 06/02/14 00:13, Richard Smith wrote:
>> On Wed, Feb 5, 2014 at 2:36 PM, Harald van Dijk <harald at gigawatt.nl
>> <mailto:harald at gigawatt.nl>> wrote:
>>
>> On 04/02/14 22:24, Richard Smith wrote:
>> > On Tue, Feb 4, 2014 at 1:13 PM, Harald van Dijk
>> <harald at gigawatt.nl <mailto:harald at gigawatt.nl>
>> > <mailto:harald at gigawatt.nl <mailto:harald at gigawatt.nl>>> wrote:
>> >
>> > On 04/02/14 20:25, Justin Bogner wrote:
>> > > Harald van Dijk <harald at gigawatt.nl
>> <mailto:harald at gigawatt.nl> <mailto:harald at gigawatt.nl
>> <mailto:harald at gigawatt.nl>>>
>> > writes:
>> > >> Attached are my updated patches intended to fix various
>> whitespace
>> > >> issues in clang, modified to take Justin Bogner's comments into
>> > account.
>> > >> They are intended to ensure that whitespace is not
>> inappropriately
>> > >> removed just because a macro or macro argument expansion is
>> > empty, and
>> > >> does get removed during token pasting.
>> > >
>> > > I've committed the first four patches for you: r200785 through
>> > r200788.
>> >
>> > Thanks!
>> >
>> > >> I have moved the handling of invalid token pastes from
>> > >> TokenLexer::ExpandFunctionArguments to TokenLexer::Lex, so that
>> > it works
>> > >> for both object- and function-like macros, and both when ##'s
>> > operands
>> > >> use macro parameters and when they don't.
>> > >
>> > > Given that the tests needed to be changed and the behaviour
>> clearly
>> > > hasn't followed the comment in a while, I'm not entirely
>> convinced
>> > this
>> > > is the right thing to do. Could the comment simply be wrong? Are
>> > people
>> > > relying on one behaviour or the other for invalid token
>> pastes in
>> > > assembler-with-cpp?
>> > >
>> > > Basically, is there a way to objectively say one of these
>> > behaviours is
>> > > better than the other?
>> >
>> > Having looked closer, the current behaviour is inconsistent in
>> a way
>> > that cannot really be explained to someone not familiar with a few
>> > implementation details. Given
>> >
>> > #define foo(x) (. ## x . ## y)
>> >
>> > foo(y) actually expands to (.y . y) in assembler-with-cpp
>> mode, with or
>> > without my first four patches. That first result is what the
>> comment
>> > refers to; could the fact that the second result is different
>> be an
>> > oversight? There does not seem to be a reason for this
>> difference, at
>> > least. Surprisingly though, this is exactly what GCC does too.
>> >
>> > My last patch would have turned this into (.y .y), removing
>> the second
>> > space. That makes sense to me. (. y . y) could also be a perfectly
>> > sensible result.
>> >
>> >
>> > I think (.y .y) is probably the best answer here (consistently remove
>> > whitespace on both sides of ## wherever it appears) -- this also
>> > probably better matches what MSVC does (where token-paste can
>> result in
>> > multiple tokens, and seems to act as if the tokens were merely abutted
>> > with no adjacent whitespace). This would also behave more consistently
>> > when processing languages with a different lexical structure from C --
>> > in a language where an identifier can start with a period, (.y .y)
>> seems
>> > to unambiguously be the right answer.
>>
>> That is a good point about MSVC. There is actually one test case
>> affected by this change, using -fms-extensions, where the behaviour did
>> not match that of MSVC, and does now, at least for the compiler that
>> comes with Visual Studio 2013.
>>
>> > I suspect, but do not actually know, that nobody really uses .
>> ## foo
>> > unless foo is a macro argument, because when foo is not a macro
>> > argument, there is no point in using ## in the first place. It
>> would
>> > explain why this went unnoticed for so long. And I can imagine
>> a few
>> > cases where this would be useful: assembler directives that
>> have subtly
>> > different syntax, depending on the assembler used. x y would be
>> > insufficient if x is ., if y is size, and if .size is a
>> directive, but .
>> > size is a syntax error. ## would work here.
>> >
>> > Even if clang's behaviour should change, though, my patch does
>> not have
>> > adequate testing for these special cases, so shouldn't be
>> applied as is.
>> > Should I work on better tests, or do you think it is more
>> likely that
>> > the behaviour should remain unchanged and get decent testing?
>> >
>> >
>> > There's a risk of breaking someone's assumption by making a change
>> here
>> > (especially since we'd be introducing a difference from GCC) but I
>> think
>> > the new behavior is much more defensible, and there's probably no
>> way to
>> > find out without trying it.
>>
>> All right. I have now taken a closer look at the test cases where the
>> behaviour changes, and noticed that for blargh ## $foo -> blargh $foo it
>> would not be right to simply change the test case to blargh$foo, as that
>> misses the point of the test.
>>
>>
>> Right, I see, it's just making sure that -fdollars-in-identifiers allows
>> the paste to work. I like your fix to the test here.
>
> Thanks :)
>
>> There was also already a test for . ## foo, but it only tested the case
>> where foo is a macro argument. I have extended that test to also check
>> what happens when foo is not a macro argument.
>>
>>
>> Can you also add a test for the case where the . comes from a macro
>> argument? May as well be thorough =)
>
> In both tests, the . comes from a macro argument, so I think you mean
> add cases where it doesn't come from a macro argument? Sure.
>
>> How does the attached patch look? I've re-tested it on sources of today,
>> but on top of my patch in the "r200787 - Fix whitespace handling in
>> empty macro expansions" thread (after your okay for that).
>>
>>
>> Checking whether the previous token was a ## worries me a little. What
>> about cases like this:
>>
>> #define foo(x, y) x####y
>> foo(, z)
>>
>> This expands to "## z" under our current left-to-right pasting strategy,
>> but I think your patch drops the space.
>
> In that test, I think dropping the space is correct. Leading and
> trailing white space in a macro argument is insignificant, which is why
>
> #define ID(x) x
> #define FOO(x,y) ID(x)ID(y)
> FOO( [ , ] )
>
> should (and does) expand to
>
> []
>
> Since there is no space between the second ## and the y, there should be
> no space in the output. That said, I do see your point. If I change your
> test to
>
> #define foo(x) x#### y
> foo()
>
> then (still assuming left-to-right, otherwise the test won't work at
> all) the output should be
>
> ## y
>
> and I am now getting
>
> ##y
>
>> Amusingly, "##z" appears to be
>> the right answer under a right-to-left pasting strategy, so maybe that's
>> OK? Both GCC and EDG expand this to just "z"; the standard is not
>> spectacularly clear here, but I think we're right, because a ## produced
>> by pasting a placemarker token onto a ## is not a token in the
>> replacement list (that token is already gone).
>>
>> If you want to say that we don't care about this case, I could live with
>> that =)
>
> The preprocessor is supposed to change ## from tok::hashhash to
> tok::unknown in those cases during macro expansion where it is not a
> paste, to avoid this and similar problems. For example, this altered
> test does behave as desired:
>
> #define hash #
> #define concat_(x, y) x ## y
> #define concat(x, y) concat_(x, y)
> #define bar() concat(hash, hash) y
> bar()
>
> The preprocessor output with my patch is ## y, even though the y follows ##.
>
> This test also works:
>
> #define foo(x) x y
> foo(##)
Ensuring that any remaining ## token is always tok::unknown turned out
to be hugely impractical, in light of the fact that invalid token pastes
may also occur in object-like macro expansions, where the RHS can
equally be a ## token, and there is no ExpandFunctionArgs() or
equivalent where the tokens can be checked before the actual expansion.
Without an additional pass before expansion, checking whether a token is
the RHS of an invalid paste becomes too complicated to be worth the
effort, and remembering invalid pastes instead is easily done.
That said...
> I think the reason it does not happen for x#### y is because when the
> LHS of a token paste is an empty macro argument, the RHS is copied
> without actually performing any paste, and in that case, the change to
> tok::unknown is skipped. It's tempting to say that we don't care, but I
> think this affects more than a single space. I will look into this.
...it does affect more than a single space, and it exposes some
interesting corner cases.
#define foo(x,y) x ## ## y y
foo(,)
No matter which ## is interpreted first, the result should be ##. For
purposes of special handling of empty argument pastes, the first
appearance of y should not be treated as following a paste operator.
At the same time, though, for purposes of pre-expansion of macro
arguments, it should be: in
foo(,foo(,))
(where it now matters that the scan for ## searches from left to right)
the first y follows a ## token, so should not be pre-expanded. The
second y does not, so should be. Normally, this would not matter, except
when the later expansion would give different results, as it does here:
later expansion occurs in the context of the expansion of foo, so the
first nested foo is not expanded. The second is, because the argument
was pre-expanded, so the result should be
## foo(,) ##
I've only been able to get this result by tracking HashHashBefore and
PasteBefore independently, where HashHashBefore indicates whether the
token follows a ## token, and PasteBefore indicates whether that ##
token was interpreted as an operator.
Note that if clang actually used placemarker tokens, this would already
be the natural result without special effort, but I understand that that
may be impractical for other reasons.
If I then take an extra look at my attempt to get whitespace handled
consistently in invalid token pastes, I end up with a similar extra
variable to track whether an attempted token paste has taken place,
instead of checking tok::hashhash, just like you suggested.
However, another problem is that when slightly modifying the example, to
#define foo(x,y) w x ## ## y z
foo(,)
argument substitution should give w ## z, but the actual expansion
should then not see the ## as a token paste operation. So even though
TokenLexer::Lex should not check for tok::hashhash, the fact that ## is
not changed to tok::unknown here is still a problem.
I have almost-working patches that I hope to finish over the weekend.
Cheers,
Harald van Dijk
More information about the cfe-commits
mailing list