[cfe-dev] Feature proposal: Compile Configuration Disclosure

Wed Mar 3 18:20:59 PST 2010

On Mar 3, 2010, at 7:05 PM, Douglas Gregor wrote:

> 
> On Mar 3, 2010, at 2:44 PM, James Widman wrote:
> 
>> Hi all,
>> 
>> I'm interested in implementing a feature in Clang.  The basic idea is that, given a new command-line switch, like perhaps:
>> --disclose-config
>> ...  Clang would output, to a plain text file (which I'll call a "disclosure file"), information about the compilation environment sufficient for a third-party program to observe the same sequence of tokens that Clang saw.
>> 
>> So basically, providing this dump enables one to perfectly emulate not just the compiler but a *specific run* of the compiler.  Note that tools based on Clang libraries could make use of information in disclosure files.
> 
> I assume we would also need the reverse operation, taking a disclosure file and setting Clang's internal options to match the configuration described?

My motives were so self-serving that this did not occur to me.  (:

I assumed that the driver of a 3rd-party tool would have its own disclosure-file-parser and set things up appropriately.

But for completeness, direct support for it in Clang would seem to make sense.

>> If this turns out to be successful with Clang, my next step would be to establish a standard specification and try to encourage other compiler vendors to implement it.
> 
> If successful, one should be able to configure compiler A to build things like compiler B, just by dumping the disclosure file from one and importing it into the other, no?

Right.

>> If people see this as a desirable feature and reasonable to implement in Clang,
> 
> Seems reasonable to me.

Glad you think so!

I'll ask about implementation details in a separate thread.

>> then the next few questions are naturally raised:
>> 
>>   1. What information would be output?
>>   2. Where would it be output?
>>   3. What format would be used in the output?
>> 
>> 1) What information would be output?
>> ====================================
>> At minimum, a tool *needs* the following information for each translation unit:
>> 
>> List of Needs:
>> --------------
>> 
>>   - The complete set of predefined macro definitions;
>> 
>>   - The ordered sequence of directory pathnames used for header-searching in #include "" and #include <>;
>> 
>>   - The ordered sequence of implicit #include directives;
> 
> The ordering between predefined macro defs and implicit includes (e.g., based on -include on the command line) can matter.

Important tip; thanks!

>>   - The path name of the primary source file;
>> 
>>   - The working directory;
>> 
>>   - Environment variables;
> 
> How would these affect translation ???

Well, some compilers depend on environment variables.  E.g. MSVC uses the variable INCLUDE, which names directories containing headers that ship with the compiler.  And GCC uses several environment variables; see the section 'ENVIRONMENT' in the GCC manual.

Granted, INCLUDE would already be covered in the second bullet point above.  But it's always possible that the environment will contain something that alters a compiler's behavior in some unexpected way.

The whole point of this proposal is to bring to light essential (or potentially essential) details that have traditionally caused grief as a result of being hidden.  So if a compiler relies on the environment then the environment is just another component of the configuration; therefore it's necessary in order to reproduce the compiler's behavior (and therefore appropriate to reveal in a config-disclosure file).

>>   - The assumed encoding of the source file and the internal encoding used during processing;
>> 
>>   - The name & version of the compiler;
>> 
>>   - argv; and
>> 
>>   - sizes of primitive types (except in the case where the main output is preprocessor output).
>> 
>> So those are all of the "needs" as I see them.  
>> 
>> Here are some "wants":
>> 
>> It would be *very* helpful to automatically determine, for a given program image (or a dynamic library or a static archive), the full set of disclosed configurations (as in the "List of Needs" above), because then every tool would be able to completely configure itself for a given project with minimal involvement from the user.
> 
> *That* is going to be tricky, because it's a lot of information to encode in each object file/library/executable.

Ouch!  I see that I gave the wrong idea about this.

I really need to be careful in expressing this point:  I do *not* propose to encode *anything* within object files!  I only desire for each tool in the tool chain to dump,
  >>> into a *new*, *separate*, plain-text file, <<<
information that it already knows about (like names of input files and names of output files).

Of course, if files get renamed (or moved across a network), that would make it harder to piece things back together.  Hence the additional "want" about SHA values.  Then a tool can just invoke:

   find some-dir -iname '*.config-disclosure'

... and with that *alone*, it could see how every stage of the build is connected to every other stage.

> Better to start with the tools dumping/loading their configurations.

Right.

>> If each tool in the tool chain, when given the --disclose-config option, also sent info about its inputs and outputs to a text file, then this could be achieved.
> 
> Sure.
> 
>> So for example, the linker would need only respond to "--disclose-config" by dumping (again, to a secondary file) its environment and the names of input files & output files.  Renames could be detected and accounted for if SHA-1 values were passed along the way.
>> 
>> Other "wants" include:
>> 
>>   - dialect information for the TU
>> 
>>   - brief info on language extensions
>> 
>>   - SHA of the preprocessing token sequence
>> 
>> Keeping an unbroken chain of SHA values (a la git) could have other uses, like verifying that a specific build has been reproduced exactly (not counting certain things like different expansions of __DATE__ and __TIME__).  This could be very handy when migrating between build systems or when a user is expected to deliver source code along with a program image.
> 
> This feels like you're moving toward a full model of the steps it takes to compile a "project" (as an IDE or make system would know about).

Yes; that's where I would like things to go.

James Widman  
-- 
Gimpel Software 
http://gimpel.com