[cfe-dev] [PATCH] Automatic detection of compatibility macros for non-portable diagnostics

Tue Aug 7 10:07:50 PDT 2012

On Aug 2, 2012, at 10:21 AM, Alexander Kornienko <alexfh at google.com> wrote:

> On Wed, Aug 1, 2012 at 7:51 PM, Douglas Gregor <dgregor at apple.com> wrote:
> 
> On Aug 1, 2012, at 10:28 AM, Matthieu Monrocq <matthieu.monrocq at gmail.com> wrote:
> > I suppose that performance wise it may not be acceptable... but would it make sense to maintain a reverse mapping tokens => macro ?
> >
> > We already have the macro => tokens mapping (and it survives even when Sema is running since it's used by diagnosis), so if we could simply maintain another index alongside it then we would be "golden" would not we ?
> 
> Well, it hinges on performance.
> 
> It does seem plausible to me that we could have the preprocessor keep a hold of macros after they have been #undef'd, so that we could look back in time and answer the question "which macros were defined when we were at the given source location." That would actually make a lot of sense for Clang, because it's the way the AST works (e.g., we keep all declarations of a function) and would actually benefit some features, such as code completion. But we'd have to carefully measure the performance and memory impact.
> 
>         - Doug
> 
> I thought a bit on the problem and now I'll try to summarize the problem definition and the spectrum of possible solutions I see.
> 
> So, the problem is finding the most suitable spelling for a given code snippet. It can arise when formatting a diagnostic message, fix-it hint or performing a refactoring. In the first case it's desired to avoid suggesting a spelling which can be wrong in the specific source location, in the last two cases it's a must. So we need to be exact even when this means additional overhead.
> 
> "The most suitable spelling" can relate to different things and have different meaning (e.g. the least qualified type/function/variable name), but here it's reasonable to limit the task to macros intended to hide implementation details for compatibility (e.g. with different compilers or language standards). I would also exclude the cases of function-like macros and macros defining only a part of a given snippet.

Sure. We could probably narrow this further to macros starting with __attribute__ or [[.

> So the task is to find the last active macro expanding to a specified sequence of tokens. The "expanding to" part can mean "immediately expanding" or "recursively expanding", and this alternative can have a significant effect on performance (the latter case would probably require using the TokenLexer or similar means for each comparison with the sample token sequence). I'm not sure if we need to support more complex case of finding recursively-expanded macros.

I think we can safely ignore that case.

> The structure of scopes of macro definitions is rather flat and can overlap the AST in an arbitrary way, so there's no need to have an analog of AST for it or to include this information in AST. What we need is just a set of macros (with their expansion sequences) defined in any arbitrary source location. It could be indexed by contents or not, but the information is basically the same. The problem is that we can require this information for an earlier source location (as it happens in AnalysisBasedWarnings, that processes completely parsed functions, so the state of the preprocessor in the moment of processing can be different than in any given source location inside the function).
> 
> I see several approaches to this problem:
>     1. before preprocessing define a set of token sequences we're interested in; while preprocessing build a reverse mapping for token sequences of interest; look up the most suitable macro by iterating through a list of macros with a specific spelling;
>     2. while preprocessing retain all #undef'ed macros with the respective source location information; look up macro by iterating over all macros and comparing their expansions with the sample;
>     3. avoid any additional actions during preprocessing; when a look-up is requested re-run preprocessing from the beginning to the required source location and then iterate over all defined macros and find the latest defined one with the specific expansion sequence.
> 
> Variant 1 adds non-trivial steps to preprocessing phase, but allows faster look-ups. Variant 2 mostly adds a memory overhead, but shouldn't add any computational overhead during preprocessing, but it will be slower for look-ups. Variant 3 only adds overhead during look-up, and it's quite significant. If we want to optimize for a very few warnings case, variant 3 seems to be quite a good choice.

I'd rather not do variant 3. While it's generally good to avoid paying for information for warnings that you don't use, the cost here is going to be too high in the case where we *do* want to provide a warning.

> Variant 2 may be a good trade-off in preprocessing and look-up overheads. Retaining of the #undef'd macros could possibly be useful for other purposes (not sure about this, but maybe someone has an example?). We could also come up with the variant 4, that would be a combination of 3 + 1: don't do anything additional until we need the first look-up; once we need it, re-run preprocessor from the beginning and build a reverse mapping for the required sequence (or still make clients register all token sequences before preprocessing starts, and build a reverse mapping for all these sequences). This variant would add a significant overhead only for the first look-up, and all subsequent ones will come at a lower price. When there are no look-ups needed (i.e. compilation doesn't trigger diagnostics that require spelling look-up), there will be absolutely no overhead both in memory and in processing.

As I noted before, the basis of variant 2 is interesting in its own right, because Clang *should* be retaining information about macros even after they are #undef'd (assuming this doesn't have a too-high impact on memory usage).

Here's a variant 5 that may strike the right balance: when we see a macro definition, look at just its first token or two to see whether it starts with __attribute__ or [[. If so, mark it as potentially being a configuration macro. I suspect that this will have no measurable impact on performance.

When we do want to find a configuration macro for a given attribute, we look at the potential configuration macros and actually classify them, so that we can (from then on) efficiently answer the question "which configuration macro describes this particular attribute?". This operation should require far less effort than variant 3, while still delaying most of the work.

Note that variant 5 can be implemented independently of whether we keep information about #undef'd macros in the preprocessor. It's behavior will degrade gracefully, e.g., a configuration macro that is later #undef'd wouldn't be found/suggested, but that's not a case we should worry about. Keeping information about #undef'd macros would be orthogonal goodness.

	- Doug
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20120807/11c00991/attachment.html>