[cfe-dev] [PATCH] Automatic detection of compatibility macros for non-portable diagnostics

Thu Aug 2 10:21:30 PDT 2012

On Wed, Aug 1, 2012 at 7:51 PM, Douglas Gregor <dgregor at apple.com> wrote:

>
> On Aug 1, 2012, at 10:28 AM, Matthieu Monrocq <matthieu.monrocq at gmail.com>
> wrote:
> > I suppose that performance wise it may not be acceptable... but would it
> make sense to maintain a reverse mapping tokens => macro ?
> >
> > We already have the macro => tokens mapping (and it survives even when
> Sema is running since it's used by diagnosis), so if we could simply
> maintain another index alongside it then we would be "golden" would not we ?
>
> Well, it hinges on performance.
>
> It does seem plausible to me that we could have the preprocessor keep a
> hold of macros after they have been #undef'd, so that we could look back in
> time and answer the question "which macros were defined when we were at the
> given source location." That would actually make a lot of sense for Clang,
> because it's the way the AST works (e.g., we keep all declarations of a
> function) and would actually benefit some features, such as code
> completion. But we'd have to carefully measure the performance and memory
> impact.
>

>         - Doug
>

I thought a bit on the problem and now I'll try to summarize the problem
definition and the spectrum of possible solutions I see.

So, the problem is finding the most suitable spelling for a given code
snippet. It can arise when formatting a diagnostic message, fix-it hint or
performing a refactoring. In the first case it's desired to avoid
suggesting a spelling which can be wrong in the specific source location,
in the last two cases it's a must. So we need to be exact even when this
means additional overhead.

"The most suitable spelling" can relate to different things and have
different meaning (e.g. the least qualified type/function/variable name),
but here it's reasonable to limit the task to macros intended to hide
implementation details for compatibility (e.g. with different compilers or
language standards). I would also exclude the cases of function-like macros
and macros defining only a part of a given snippet.

So the task is to find the last active macro expanding to a specified
sequence of tokens. The "expanding to" part can mean "immediately
expanding" or "recursively expanding", and this alternative can have a
significant effect on performance (the latter case would probably require
using the TokenLexer or similar means for each comparison with the sample
token sequence). I'm not sure if we need to support more complex case of
finding recursively-expanded macros.

The structure of scopes of macro definitions is rather flat and can overlap
the AST in an arbitrary way, so there's no need to have an analog of AST
for it or to include this information in AST. What we need is just a set of
macros (with their expansion sequences) defined in any arbitrary source
location. It could be indexed by contents or not, but the information is
basically the same. The problem is that we can require this information for
an earlier source location (as it happens in AnalysisBasedWarnings, that
processes completely parsed functions, so the state of the preprocessor in
the moment of processing can be different than in any given source location
inside the function).

I see several approaches to this problem:
    1. before preprocessing define a set of token sequences we're
interested in; while preprocessing build a reverse mapping for token
sequences of interest; look up the most suitable macro by iterating through
a list of macros with a specific spelling;
    2. while preprocessing retain all #undef'ed macros with the respective
source location information; look up macro by iterating over all macros and
comparing their expansions with the sample;
    3. avoid any additional actions during preprocessing; when a look-up is
requested re-run preprocessing from the beginning to the required source
location and then iterate over all defined macros and find the latest
defined one with the specific expansion sequence.

Variant 1 adds non-trivial steps to preprocessing phase, but allows faster
look-ups. Variant 2 mostly adds a memory overhead, but shouldn't add any
computational overhead during preprocessing, but it will be slower for
look-ups. Variant 3 only adds overhead during look-up, and it's quite
significant. If we want to optimize for a very few warnings case, variant 3
seems to be quite a good choice. Variant 2 may be a good trade-off in
preprocessing and look-up overheads. Retaining of the #undef'd macros could
possibly be useful for other purposes (not sure about this, but maybe
someone has an example?). We could also come up with the variant 4, that
would be a combination of 3 + 1: don't do anything additional until we need
the first look-up; once we need it, re-run preprocessor from the beginning
and build a reverse mapping for the required sequence (or still make
clients register all token sequences before preprocessing starts, and build
a reverse mapping for all these sequences). This variant would add a
significant overhead only for the first look-up, and all subsequent ones
will come at a lower price. When there are no look-ups needed (i.e.
compilation doesn't trigger diagnostics that require spelling look-up),
there will be absolutely no overhead both in memory and in processing.

-- 
Best regards,
Alexander Kornienko
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20120802/9e62f517/attachment.html>