[cfe-dev] Feature proposal: Compile Configuration Disclosure

Wed Mar 3 16:05:14 PST 2010

On Mar 3, 2010, at 2:44 PM, James Widman wrote:

> Hi all,
> 
> I'm interested in implementing a feature in Clang.  The basic idea is that, given a new command-line switch, like perhaps:
>  --disclose-config
> ...  Clang would output, to a plain text file (which I'll call a "disclosure file"), information about the compilation environment sufficient for a third-party program to observe the same sequence of tokens that Clang saw.
> 
> So basically, providing this dump enables one to perfectly emulate not just the compiler but a *specific run* of the compiler.  Note that tools based on Clang libraries could make use of information in disclosure files.

I assume we would also need the reverse operation, taking a disclosure file and setting Clang's internal options to match the configuration described?

> If this turns out to be successful with Clang, my next step would be to establish a standard specification and try to encourage other compiler vendors to implement it.

If successful, one should be able to configure compiler A to build things like compiler B, just by dumping the disclosure file from one and importing it into the other, no?

> If people see this as a desirable feature and reasonable to implement in Clang,

Seems reasonable to me.

> then the next few questions are naturally raised:
> 
>    1. What information would be output?
>    2. Where would it be output?
>    3. What format would be used in the output?
> 
> 1) What information would be output?
> ====================================
> At minimum, a tool *needs* the following information for each translation unit:
> 
> List of Needs:
> --------------
> 
>    - The complete set of predefined macro definitions;
> 
>    - The ordered sequence of directory pathnames used for header-searching in #include "" and #include <>;
> 
>    - The ordered sequence of implicit #include directives;

The ordering between predefined macro defs and implicit includes (e.g., based on -include on the command line) can matter.

>    - The path name of the primary source file;
> 
>    - The working directory;
> 
>    - Environment variables;

How would these affect translation ???

>    - The assumed encoding of the source file and the internal encoding used during processing;
> 
>    - The name & version of the compiler;
> 
>    - argv; and
> 
>    - sizes of primitive types (except in the case where the main output is preprocessor output).
> 
> So those are all of the "needs" as I see them.  
> 
> Here are some "wants":
> 
> It would be *very* helpful to automatically determine, for a given program image (or a dynamic library or a static archive), the full set of disclosed configurations (as in the "List of Needs" above), because then every tool would be able to completely configure itself for a given project with minimal involvement from the user.

*That* is going to be tricky, because it's a lot of information to encode in each object file/library/executable. Better to start with the tools dumping/loading their configurations.

>  If each tool in the tool chain, when given the --disclose-config option, also sent info about its inputs and outputs to a text file, then this could be achieved.

Sure.

> So for example, the linker would need only respond to "--disclose-config" by dumping (again, to a secondary file) its environment and the names of input files & output files.  Renames could be detected and accounted for if SHA-1 values were passed along the way.
> 
> Other "wants" include:
> 
>    - dialect information for the TU
> 
>    - brief info on language extensions
> 
>    - SHA of the preprocessing token sequence
> 
> Keeping an unbroken chain of SHA values (a la git) could have other uses, like verifying that a specific build has been reproduced exactly (not counting certain things like different expansions of __DATE__ and __TIME__).  This could be very handy when migrating between build systems or when a user is expected to deliver source code along with a program image.

This feels like you're moving toward a full model of the steps it takes to compile a "project" (as an IDE or make system would know about).

> 
> 2) Where would the configuration disclosure
>   information be output?
> ===========================================
> Personally, I don't care, and I'd be happy to try to follow any request for which there is no obvious contraindication.  But earlier, Doug suggested (in a semi-private email) that each config-disclosure file should be placed beside its associated output file, and that the name simply be something like ${MAIN_OUTPUT_FILE}.config-disclosure.  E.g. for this invocation:
> 
>   clang --disclose-config -c x.cpp -o foo.o
> 
> ...the disclosure file would be placed in 
>    foo.o.config-disclosure
> in the same directory as foo.o (which in this case happens to be the working directory).  (By the way, if anyone is not happy about that filename extension, I'm totally happy to change it.)
> 
> 3) What format would be used in the output?
> ===========================================
> I'm guessing there would be a little flux about this for a while, so XML seems like a natural and universally-grokked choice.  But if people prefer other formats like JSON I'd look at that too. The KISS principle should probably apply though.

I don't have a strong opinion on these details.

	- Doug