[cfe-dev] Feature proposal: Compile Configuration Disclosure

Sun Mar 7 10:53:29 PST 2010

Sent from my iPhone

On Mar 3, 2010, at 8:20 PM, James Widman <widman at gimpel.com> wrote:

>
> On Mar 3, 2010, at 7:05 PM, Douglas Gregor wrote:
>
>>
>> On Mar 3, 2010, at 2:44 PM, James Widman wrote:
>>
>>> Hi all,
>>>
>>> I'm interested in implementing a feature in Clang.  The basic idea  
>>> is that, given a new command-line switch, like perhaps:
>>> --disclose-config
>>> ...  Clang would output, to a plain text file (which I'll call a  
>>> "disclosure file"), information about the compilation environment  
>>> sufficient for a third-party program to observe the same sequence  
>>> of tokens that Clang saw.
>>>
>>> So basically, providing this dump enables one to perfectly emulate  
>>> not just the compiler but a *specific run* of the compiler.  Note  
>>> that tools based on Clang libraries could make use of information  
>>> in disclosure files.
>>
>> I assume we would also need the reverse operation, taking a  
>> disclosure file and setting Clang's internal options to match the  
>> configuration described?
>
> My motives were so self-serving that this did not occur to me.  (:
>
> I assumed that the driver of a 3rd-party tool would have its own  
> disclosure-file-parser and set things up appropriately.
>
> But for completeness, direct support for it in Clang would seem to  
> make sense.

Yes. I'd hang it off CompilerInvocation, which incapsulates a...  
Compiler invocation.

>>> If this turns out to be successful with Clang, my next step would  
>>> be to establish a standard specification and try to encourage  
>>> other compiler vendors to implement it.
>>
>> If successful, one should be able to configure compiler A to build  
>> things like compiler B, just by dumping the disclosure file from  
>> one and importing it into the other, no?
>
> Right.
>
>>> If people see this as a desirable feature and reasonable to  
>>> implement in Clang,
>>
>> Seems reasonable to me.
>
> Glad you think so!
>
> I'll ask about implementation details in a separate thread.
>
>>> then the next few questions are naturally raised:
>>>
>>>  1. What information would be output?
>>>  2. Where would it be output?
>>>  3. What format would be used in the output?
>>>
>>> 1) What information would be output?
>>> ====================================
>>> At minimum, a tool *needs* the following information for each  
>>> translation unit:
>>>
>>> List of Needs:
>>> --------------
>>>
>>>  - The complete set of predefined macro definitions;
>>>
>>>  - The ordered sequence of directory pathnames used for header- 
>>> searching in #include "" and #include <>;
>>>
>>>  - The ordered sequence of implicit #include directives;
>>
>> The ordering between predefined macro defs and implicit includes  
>> (e.g., based on -include on the command line) can matter.
>
> Important tip; thanks!
>
>>>  - The path name of the primary source file;
>>>
>>>  - The working directory;
>>>
>>>  - Environment variables;
>>
>> How would these affect translation ???
>
> Well, some compilers depend on environment variables.  E.g. MSVC  
> uses the variable INCLUDE, which names directories containing  
> headers that ship with the compiler.  And GCC uses several  
> environment variables; see the section 'ENVIRONMENT' in the GCC  
> manual.
>
> Granted, INCLUDE would already be covered in the second bullet point  
> above.  But it's always possible that the environment will contain  
> something that alters a compiler's behavior in some unexpected way.

The environment is harder to control, which is unfortunate. I wonder  
if all of the environment variables have corresponding flags?

> The whole point of this proposal is to bring to light essential (or  
> potentially essential) details that have traditionally caused grief  
> as a result of being hidden.  So if a compiler relies on the  
> environment then the environment is just another component of the  
> configuration; therefore it's necessary in order to reproduce the  
> compiler's behavior (and therefore appropriate to reveal in a config- 
> disclosure file).
>
>>>  - The assumed encoding of the source file and the internal  
>>> encoding used during processing;
>>>
>>>  - The name & version of the compiler;
>>>
>>>  - argv; and
>>>
>>>  - sizes of primitive types (except in the case where the main  
>>> output is preprocessor output).
>>>
>>> So those are all of the "needs" as I see them.
>>>
>>> Here are some "wants":
>>>
>>> It would be *very* helpful to automatically determine, for a given  
>>> program image (or a dynamic library or a static archive), the full  
>>> set of disclosed configurations (as in the "List of Needs" above),  
>>> because then every tool would be able to completely configure  
>>> itself for a given project with minimal involvement from the user.
>>
>> *That* is going to be tricky, because it's a lot of information to  
>> encode in each object file/library/executable.
>
> Ouch!  I see that I gave the wrong idea about this.
>
> I really need to be careful in expressing this point:  I do *not*  
> propose to encode *anything* within object files!  I only desire for  
> each tool in the tool chain to dump,
>>>> into a *new*, *separate*, plain-text file, <<<
> information that it already knows about (like names of input files  
> and names of output files).
>
> Of course, if files get renamed (or moved across a network), that  
> would make it harder to piece things back together.  Hence the  
> additional "want" about SHA values.  Then a tool can just invoke:
>
>   find some-dir -iname '*.config-disclosure'
>
> ... and with that *alone*, it could see how every stage of the build  
> is connected to every other stage.

Ah, okay.

>> Better to start with the tools dumping/loading their configurations.
>
> Right.
>
>>> If each tool in the tool chain, when given the --disclose-config  
>>> option, also sent info about its inputs and outputs to a text  
>>> file, then this could be achieved.
>>
>> Sure.
>>
>>> So for example, the linker would need only respond to "--disclose- 
>>> config" by dumping (again, to a secondary file) its environment  
>>> and the names of input files & output files.  Renames could be  
>>> detected and accounted for if SHA-1 values were passed along the  
>>> way.
>>>
>>> Other "wants" include:
>>>
>>>  - dialect information for the TU
>>>
>>>  - brief info on language extensions
>>>
>>>  - SHA of the preprocessing token sequence
>>>
>>> Keeping an unbroken chain of SHA values (a la git) could have  
>>> other uses, like verifying that a specific build has been  
>>> reproduced exactly (not counting certain things like different  
>>> expansions of __DATE__ and __TIME__).  This could be very handy  
>>> when migrating between build systems or when a user is expected to  
>>> deliver source code along with a program image.
>>
>> This feels like you're moving toward a full model of the steps it  
>> takes to compile a "project" (as an IDE or make system would know  
>> about).
>
> Yes; that's where I would like things to go.
>
> James Widman
> -- 
> Gimpel Software
> http://gimpel.com
>
>
>