[PATCH] Whitespace issues during preprocessing

Sun Feb 9 01:07:32 PST 2014

On 07/02/14 00:02, Harald van Dijk wrote:
> On 06/02/14 00:48, Harald van Dijk wrote:
>> On 06/02/14 00:13, Richard Smith wrote:
>>> On Wed, Feb 5, 2014 at 2:36 PM, Harald van Dijk <harald at gigawatt.nl
>>> <mailto:harald at gigawatt.nl>> wrote:
>>>
>>>     On 04/02/14 22:24, Richard Smith wrote:
>>>     > On Tue, Feb 4, 2014 at 1:13 PM, Harald van Dijk
>>>     <harald at gigawatt.nl <mailto:harald at gigawatt.nl>
>>>     > <mailto:harald at gigawatt.nl <mailto:harald at gigawatt.nl>>> wrote:
>>>     >
>>>     >     On 04/02/14 20:25, Justin Bogner wrote:
>>>     >     > Harald van Dijk <harald at gigawatt.nl
>>>     <mailto:harald at gigawatt.nl> <mailto:harald at gigawatt.nl
>>>     <mailto:harald at gigawatt.nl>>>
>>>     >     writes:
>>>     >     >> Attached are my updated patches intended to fix various
>>>     whitespace
>>>     >     >> issues in clang, modified to take Justin Bogner's comments into
>>>     >     account.
>>>     >     >> They are intended to ensure that whitespace is not
>>>     inappropriately
>>>     >     >> removed just because a macro or macro argument expansion is
>>>     >     empty, and
>>>     >     >> does get removed during token pasting.
>>>     >     >
>>>     >     > I've committed the first four patches for you: r200785 through
>>>     >     r200788.
>>>     >
>>>     >     Thanks!
>>>     >
>>>     >     >> I have moved the handling of invalid token pastes from
>>>     >     >> TokenLexer::ExpandFunctionArguments to TokenLexer::Lex, so that
>>>     >     it works
>>>     >     >> for both object- and function-like macros, and both when ##'s
>>>     >     operands
>>>     >     >> use macro parameters and when they don't.
>>>     >     >
>>>     >     > Given that the tests needed to be changed and the behaviour
>>>     clearly
>>>     >     > hasn't followed the comment in a while, I'm not entirely
>>>     convinced
>>>     >     this
>>>     >     > is the right thing to do. Could the comment simply be wrong? Are
>>>     >     people
>>>     >     > relying on one behaviour or the other for invalid token
>>>     pastes in
>>>     >     > assembler-with-cpp?
>>>     >     >
>>>     >     > Basically, is there a way to objectively say one of these
>>>     >     behaviours is
>>>     >     > better than the other?
>>>     >
>>>     >     Having looked closer, the current behaviour is inconsistent in
>>>     a way
>>>     >     that cannot really be explained to someone not familiar with a few
>>>     >     implementation details. Given
>>>     >
>>>     >     #define foo(x) (. ## x . ## y)
>>>     >
>>>     >     foo(y) actually expands to (.y . y) in assembler-with-cpp
>>>     mode, with or
>>>     >     without my first four patches. That first result is what the
>>>     comment
>>>     >     refers to; could the fact that the second result is different
>>>     be an
>>>     >     oversight? There does not seem to be a reason for this
>>>     difference, at
>>>     >     least. Surprisingly though, this is exactly what GCC does too.
>>>     >
>>>     >     My last patch would have turned this into (.y .y), removing
>>>     the second
>>>     >     space. That makes sense to me. (. y . y) could also be a perfectly
>>>     >     sensible result.
>>>     >
>>>     >
>>>     > I think (.y .y) is probably the best answer here (consistently remove
>>>     > whitespace on both sides of ## wherever it appears) -- this also
>>>     > probably better matches what MSVC does (where token-paste can
>>>     result in
>>>     > multiple tokens, and seems to act as if the tokens were merely abutted
>>>     > with no adjacent whitespace). This would also behave more consistently
>>>     > when processing languages with a different lexical structure from C --
>>>     > in a language where an identifier can start with a period, (.y .y)
>>>     seems
>>>     > to unambiguously be the right answer.
>>>
>>>     That is a good point about MSVC. There is actually one test case
>>>     affected by this change, using -fms-extensions, where the behaviour did
>>>     not match that of MSVC, and does now, at least for the compiler that
>>>     comes with Visual Studio 2013.
>>>
>>>     >     I suspect, but do not actually know, that nobody really uses .
>>>     ## foo
>>>     >     unless foo is a macro argument, because when foo is not a macro
>>>     >     argument, there is no point in using ## in the first place. It
>>>     would
>>>     >     explain why this went unnoticed for so long. And I can imagine
>>>     a few
>>>     >     cases where this would be useful: assembler directives that
>>>     have subtly
>>>     >     different syntax, depending on the assembler used. x y would be
>>>     >     insufficient if x is ., if y is size, and if .size is a
>>>     directive, but .
>>>     >     size is a syntax error. ## would work here.
>>>     >
>>>     >     Even if clang's behaviour should change, though, my patch does
>>>     not have
>>>     >     adequate testing for these special cases, so shouldn't be
>>>     applied as is.
>>>     >     Should I work on better tests, or do you think it is more
>>>     likely that
>>>     >     the behaviour should remain unchanged and get decent testing?
>>>     >
>>>     >
>>>     > There's a risk of breaking someone's assumption by making a change
>>>     here
>>>     > (especially since we'd be introducing a difference from GCC) but I
>>>     think
>>>     > the new behavior is much more defensible, and there's probably no
>>>     way to
>>>     > find out without trying it.
>>>
>>>     All right. I have now taken a closer look at the test cases where the
>>>     behaviour changes, and noticed that for blargh ## $foo -> blargh $foo it
>>>     would not be right to simply change the test case to blargh$foo, as that
>>>     misses the point of the test.
>>>
>>>
>>> Right, I see, it's just making sure that -fdollars-in-identifiers allows
>>> the paste to work. I like your fix to the test here.
>>
>> Thanks :)
>>
>>>     There was also already a test for . ## foo, but it only tested the case
>>>     where foo is a macro argument. I have extended that test to also check
>>>     what happens when foo is not a macro argument.
>>>
>>>
>>> Can you also add a test for the case where the . comes from a macro
>>> argument? May as well be thorough =)
>>
>> In both tests, the . comes from a macro argument, so I think you mean
>> add cases where it doesn't come from a macro argument? Sure.
>>
>>>     How does the attached patch look? I've re-tested it on sources of today,
>>>     but on top of my patch in the "r200787 - Fix whitespace handling in
>>>     empty macro expansions" thread (after your okay for that).
>>>
>>>
>>> Checking whether the previous token was a ## worries me a little. What
>>> about cases like this:
>>>
>>> #define foo(x, y) x####y
>>> foo(, z)
>>>
>>> This expands to "## z" under our current left-to-right pasting strategy,
>>> but I think your patch drops the space.
>>
>> In that test, I think dropping the space is correct. Leading and
>> trailing white space in a macro argument is insignificant, which is why
>>
>> #define ID(x) x
>> #define FOO(x,y) ID(x)ID(y)
>> FOO(  [  ,  ]  )
>>
>> should (and does) expand to
>>
>> []
>>
>> Since there is no space between the second ## and the y, there should be
>> no space in the output. That said, I do see your point. If I change your
>> test to
>>
>> #define foo(x) x#### y
>> foo()
>>
>> then (still assuming left-to-right, otherwise the test won't work at
>> all) the output should be
>>
>> ## y
>>
>> and I am now getting
>>
>> ##y
>>
>>> Amusingly, "##z" appears to be
>>> the right answer under a right-to-left pasting strategy, so maybe that's
>>> OK? Both GCC and EDG expand this to just "z"; the standard is not
>>> spectacularly clear here, but I think we're right, because a ## produced
>>> by pasting a placemarker token onto a ## is not a token in the
>>> replacement list (that token is already gone).
>>>
>>> If you want to say that we don't care about this case, I could live with
>>> that =)
>>
>> The preprocessor is supposed to change ## from tok::hashhash to
>> tok::unknown in those cases during macro expansion where it is not a
>> paste, to avoid this and similar problems. For example, this altered
>> test does behave as desired:
>>
>> #define hash #
>> #define concat_(x, y) x ## y
>> #define concat(x, y) concat_(x, y)
>> #define bar() concat(hash, hash) y
>> bar()
>>
>> The preprocessor output with my patch is ## y, even though the y follows ##.
>>
>> This test also works:
>>
>> #define foo(x) x y
>> foo(##)
> 
> Ensuring that any remaining ## token is always tok::unknown turned out
> to be hugely impractical, in light of the fact that invalid token pastes
> may also occur in object-like macro expansions, where the RHS can
> equally be a ## token, and there is no ExpandFunctionArgs() or
> equivalent where the tokens can be checked before the actual expansion.
> Without an additional pass before expansion, checking whether a token is
> the RHS of an invalid paste becomes too complicated to be worth the
> effort, and remembering invalid pastes instead is easily done.
> 
> That said...
> 
>> I think the reason it does not happen for x#### y is because when the
>> LHS of a token paste is an empty macro argument, the RHS is copied
>> without actually performing any paste, and in that case, the change to
>> tok::unknown is skipped. It's tempting to say that we don't care, but I
>> think this affects more than a single space. I will look into this.
> 
> ...it does affect more than a single space, and it exposes some
> interesting corner cases.
> 
> #define foo(x,y) x ## ## y y
> foo(,)
> 
> No matter which ## is interpreted first, the result should be ##. For
> purposes of special handling of empty argument pastes, the first
> appearance of y should not be treated as following a paste operator.
> 
> At the same time, though, for purposes of pre-expansion of macro
> arguments, it should be: in
> 
> foo(,foo(,))
> 
> (where it now matters that the scan for ## searches from left to right)
> the first y follows a ## token, so should not be pre-expanded. The
> second y does not, so should be. Normally, this would not matter, except
> when the later expansion would give different results, as it does here:
> later expansion occurs in the context of the expansion of foo, so the
> first nested foo is not expanded. The second is, because the argument
> was pre-expanded, so the result should be
> 
> ## foo(,) ##
> 
> I've only been able to get this result by tracking HashHashBefore and
> PasteBefore independently, where HashHashBefore indicates whether the
> token follows a ## token, and PasteBefore indicates whether that ##
> token was interpreted as an operator.
> 
> Note that if clang actually used placemarker tokens, this would already
> be the natural result without special effort, but I understand that that
> may be impractical for other reasons.
> 
> If I then take an extra look at my attempt to get whitespace handled
> consistently in invalid token pastes, I end up with a similar extra
> variable to track whether an attempted token paste has taken place,
> instead of checking tok::hashhash, just like you suggested.
> 
> However, another problem is that when slightly modifying the example, to
> 
> #define foo(x,y) w x ## ## y z
> foo(,)
> 
> argument substitution should give w ## z, but the actual expansion
> should then not see the ## as a token paste operation. So even though
> TokenLexer::Lex should not check for tok::hashhash, the fact that ## is
> not changed to tok::unknown here is still a problem.
> 
> I have almost-working patches that I hope to finish over the weekend.

And here they are. Especially the first of these patches seems at first
glance overly complicated, but as stated, the simpler approaches I was
able to come up with get corner cases wrong.

Note: I tested these on top of my patches in the <r200787 - Fix
whitespace handling in empty macro expansions> thread.

Cheers,
Harald van Dijk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-Ensure-does-not-get-misinterpreted.patch
Type: text/x-patch
Size: 6063 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20140209/5f4f67c6/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0004-Fix-handling-of-whitespace-in-invalid-token-pastes.patch
Type: text/x-patch
Size: 7061 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20140209/5f4f67c6/attachment-0001.bin>