[PATCH] Whitespace issues during preprocessing

Tue Feb 4 13:24:02 PST 2014

On Tue, Feb 4, 2014 at 1:13 PM, Harald van Dijk <harald at gigawatt.nl> wrote:

> On 04/02/14 20:25, Justin Bogner wrote:
> > Harald van Dijk <harald at gigawatt.nl> writes:
> >> Attached are my updated patches intended to fix various whitespace
> >> issues in clang, modified to take Justin Bogner's comments into account.
> >> They are intended to ensure that whitespace is not inappropriately
> >> removed just because a macro or macro argument expansion is empty, and
> >> does get removed during token pasting.
> >
> > I've committed the first four patches for you: r200785 through r200788.
>
> Thanks!
>
> >> I have moved the handling of invalid token pastes from
> >> TokenLexer::ExpandFunctionArguments to TokenLexer::Lex, so that it works
> >> for both object- and function-like macros, and both when ##'s operands
> >> use macro parameters and when they don't.
> >
> > Given that the tests needed to be changed and the behaviour clearly
> > hasn't followed the comment in a while, I'm not entirely convinced this
> > is the right thing to do. Could the comment simply be wrong? Are people
> > relying on one behaviour or the other for invalid token pastes in
> > assembler-with-cpp?
> >
> > Basically, is there a way to objectively say one of these behaviours is
> > better than the other?
>
> Having looked closer, the current behaviour is inconsistent in a way
> that cannot really be explained to someone not familiar with a few
> implementation details. Given
>
> #define foo(x) (. ## x . ## y)
>
> foo(y) actually expands to (.y . y) in assembler-with-cpp mode, with or
> without my first four patches. That first result is what the comment
> refers to; could the fact that the second result is different be an
> oversight? There does not seem to be a reason for this difference, at
> least. Surprisingly though, this is exactly what GCC does too.
>
> My last patch would have turned this into (.y .y), removing the second
> space. That makes sense to me. (. y . y) could also be a perfectly
> sensible result.
>

I think (.y .y) is probably the best answer here (consistently remove
whitespace on both sides of ## wherever it appears) -- this also probably
better matches what MSVC does (where token-paste can result in multiple
tokens, and seems to act as if the tokens were merely abutted with no
adjacent whitespace). This would also behave more consistently when
processing languages with a different lexical structure from C -- in a
language where an identifier can start with a period, (.y .y) seems to
unambiguously be the right answer.

> I suspect, but do not actually know, that nobody really uses . ## foo
> unless foo is a macro argument, because when foo is not a macro
> argument, there is no point in using ## in the first place. It would
> explain why this went unnoticed for so long. And I can imagine a few
> cases where this would be useful: assembler directives that have subtly
> different syntax, depending on the assembler used. x y would be
> insufficient if x is ., if y is size, and if .size is a directive, but .
> size is a syntax error. ## would work here.
>
> Even if clang's behaviour should change, though, my patch does not have
> adequate testing for these special cases, so shouldn't be applied as is.
> Should I work on better tests, or do you think it is more likely that
> the behaviour should remain unchanged and get decent testing?

There's a risk of breaking someone's assumption by making a change here
(especially since we'd be introducing a difference from GCC) but I think
the new behavior is much more defensible, and there's probably no way to
find out without trying it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20140204/cde0d294/attachment.html>