[cfe-dev] Feature proposal: Compile Configuration Disclosure

Wed Mar 3 14:44:59 PST 2010

Hi all,

I'm interested in implementing a feature in Clang.  The basic idea is that, given a new command-line switch, like perhaps:
  --disclose-config
...  Clang would output, to a plain text file (which I'll call a "disclosure file"), information about the compilation environment sufficient for a third-party program to observe the same sequence of tokens that Clang saw.

So basically, providing this dump enables one to perfectly emulate not just the compiler but a *specific run* of the compiler.  Note that tools based on Clang libraries could make use of information in disclosure files.

If this turns out to be successful with Clang, my next step would be to establish a standard specification and try to encourage other compiler vendors to implement it.

If people see this as a desirable feature and reasonable to implement in Clang, then the next few questions are naturally raised:

    1. What information would be output?
    2. Where would it be output?
    3. What format would be used in the output?

1) What information would be output?
====================================
At minimum, a tool *needs* the following information for each translation unit:

List of Needs:
--------------

    - The complete set of predefined macro definitions;

    - The ordered sequence of directory pathnames used for header-searching in #include "" and #include <>;

    - The ordered sequence of implicit #include directives;

    - The path name of the primary source file;

    - The working directory;

    - Environment variables;

    - The assumed encoding of the source file and the internal encoding used during processing;

    - The name & version of the compiler;

    - argv; and

    - sizes of primitive types (except in the case where the main output is preprocessor output).

So those are all of the "needs" as I see them.  

Here are some "wants":

It would be *very* helpful to automatically determine, for a given program image (or a dynamic library or a static archive), the full set of disclosed configurations (as in the "List of Needs" above), because then every tool would be able to completely configure itself for a given project with minimal involvement from the user.  If each tool in the tool chain, when given the --disclose-config option, also sent info about its inputs and outputs to a text file, then this could be achieved.

So for example, the linker would need only respond to "--disclose-config" by dumping (again, to a secondary file) its environment and the names of input files & output files.  Renames could be detected and accounted for if SHA-1 values were passed along the way.

Other "wants" include:

    - dialect information for the TU

    - brief info on language extensions

    - SHA of the preprocessing token sequence

Keeping an unbroken chain of SHA values (a la git) could have other uses, like verifying that a specific build has been reproduced exactly (not counting certain things like different expansions of __DATE__ and __TIME__).  This could be very handy when migrating between build systems or when a user is expected to deliver source code along with a program image.

2) Where would the configuration disclosure
   information be output?
===========================================
Personally, I don't care, and I'd be happy to try to follow any request for which there is no obvious contraindication.  But earlier, Doug suggested (in a semi-private email) that each config-disclosure file should be placed beside its associated output file, and that the name simply be something like ${MAIN_OUTPUT_FILE}.config-disclosure.  E.g. for this invocation:

   clang --disclose-config -c x.cpp -o foo.o

...the disclosure file would be placed in 
    foo.o.config-disclosure
in the same directory as foo.o (which in this case happens to be the working directory).  (By the way, if anyone is not happy about that filename extension, I'm totally happy to change it.)

3) What format would be used in the output?
===========================================
I'm guessing there would be a little flux about this for a while, so XML seems like a natural and universally-grokked choice.  But if people prefer other formats like JSON I'd look at that too. The KISS principle should probably apply though.

I have some initial thoughts about how I might start implementing some of the "needs" in Clang, but perhaps I should stop here and wait for feedback.

James Widman  
-- 
Gimpel Software 
http://gimpel.com