[cfe-dev] Feature proposal: Compile Configuration Disclosure
Douglas Gregor
dgregor at apple.com
Sun Mar 7 10:53:29 PST 2010
Sent from my iPhone
On Mar 3, 2010, at 8:20 PM, James Widman <widman at gimpel.com> wrote:
>
> On Mar 3, 2010, at 7:05 PM, Douglas Gregor wrote:
>
>>
>> On Mar 3, 2010, at 2:44 PM, James Widman wrote:
>>
>>> Hi all,
>>>
>>> I'm interested in implementing a feature in Clang. The basic idea
>>> is that, given a new command-line switch, like perhaps:
>>> --disclose-config
>>> ... Clang would output, to a plain text file (which I'll call a
>>> "disclosure file"), information about the compilation environment
>>> sufficient for a third-party program to observe the same sequence
>>> of tokens that Clang saw.
>>>
>>> So basically, providing this dump enables one to perfectly emulate
>>> not just the compiler but a *specific run* of the compiler. Note
>>> that tools based on Clang libraries could make use of information
>>> in disclosure files.
>>
>> I assume we would also need the reverse operation, taking a
>> disclosure file and setting Clang's internal options to match the
>> configuration described?
>
> My motives were so self-serving that this did not occur to me. (:
>
> I assumed that the driver of a 3rd-party tool would have its own
> disclosure-file-parser and set things up appropriately.
>
> But for completeness, direct support for it in Clang would seem to
> make sense.
Yes. I'd hang it off CompilerInvocation, which incapsulates a...
Compiler invocation.
>>> If this turns out to be successful with Clang, my next step would
>>> be to establish a standard specification and try to encourage
>>> other compiler vendors to implement it.
>>
>> If successful, one should be able to configure compiler A to build
>> things like compiler B, just by dumping the disclosure file from
>> one and importing it into the other, no?
>
> Right.
>
>>> If people see this as a desirable feature and reasonable to
>>> implement in Clang,
>>
>> Seems reasonable to me.
>
> Glad you think so!
>
> I'll ask about implementation details in a separate thread.
>
>>> then the next few questions are naturally raised:
>>>
>>> 1. What information would be output?
>>> 2. Where would it be output?
>>> 3. What format would be used in the output?
>>>
>>> 1) What information would be output?
>>> ====================================
>>> At minimum, a tool *needs* the following information for each
>>> translation unit:
>>>
>>> List of Needs:
>>> --------------
>>>
>>> - The complete set of predefined macro definitions;
>>>
>>> - The ordered sequence of directory pathnames used for header-
>>> searching in #include "" and #include <>;
>>>
>>> - The ordered sequence of implicit #include directives;
>>
>> The ordering between predefined macro defs and implicit includes
>> (e.g., based on -include on the command line) can matter.
>
> Important tip; thanks!
>
>>> - The path name of the primary source file;
>>>
>>> - The working directory;
>>>
>>> - Environment variables;
>>
>> How would these affect translation ???
>
> Well, some compilers depend on environment variables. E.g. MSVC
> uses the variable INCLUDE, which names directories containing
> headers that ship with the compiler. And GCC uses several
> environment variables; see the section 'ENVIRONMENT' in the GCC
> manual.
>
> Granted, INCLUDE would already be covered in the second bullet point
> above. But it's always possible that the environment will contain
> something that alters a compiler's behavior in some unexpected way.
The environment is harder to control, which is unfortunate. I wonder
if all of the environment variables have corresponding flags?
> The whole point of this proposal is to bring to light essential (or
> potentially essential) details that have traditionally caused grief
> as a result of being hidden. So if a compiler relies on the
> environment then the environment is just another component of the
> configuration; therefore it's necessary in order to reproduce the
> compiler's behavior (and therefore appropriate to reveal in a config-
> disclosure file).
>
>>> - The assumed encoding of the source file and the internal
>>> encoding used during processing;
>>>
>>> - The name & version of the compiler;
>>>
>>> - argv; and
>>>
>>> - sizes of primitive types (except in the case where the main
>>> output is preprocessor output).
>>>
>>> So those are all of the "needs" as I see them.
>>>
>>> Here are some "wants":
>>>
>>> It would be *very* helpful to automatically determine, for a given
>>> program image (or a dynamic library or a static archive), the full
>>> set of disclosed configurations (as in the "List of Needs" above),
>>> because then every tool would be able to completely configure
>>> itself for a given project with minimal involvement from the user.
>>
>> *That* is going to be tricky, because it's a lot of information to
>> encode in each object file/library/executable.
>
> Ouch! I see that I gave the wrong idea about this.
>
> I really need to be careful in expressing this point: I do *not*
> propose to encode *anything* within object files! I only desire for
> each tool in the tool chain to dump,
>>>> into a *new*, *separate*, plain-text file, <<<
> information that it already knows about (like names of input files
> and names of output files).
>
> Of course, if files get renamed (or moved across a network), that
> would make it harder to piece things back together. Hence the
> additional "want" about SHA values. Then a tool can just invoke:
>
> find some-dir -iname '*.config-disclosure'
>
> ... and with that *alone*, it could see how every stage of the build
> is connected to every other stage.
Ah, okay.
>> Better to start with the tools dumping/loading their configurations.
>
> Right.
>
>>> If each tool in the tool chain, when given the --disclose-config
>>> option, also sent info about its inputs and outputs to a text
>>> file, then this could be achieved.
>>
>> Sure.
>>
>>> So for example, the linker would need only respond to "--disclose-
>>> config" by dumping (again, to a secondary file) its environment
>>> and the names of input files & output files. Renames could be
>>> detected and accounted for if SHA-1 values were passed along the
>>> way.
>>>
>>> Other "wants" include:
>>>
>>> - dialect information for the TU
>>>
>>> - brief info on language extensions
>>>
>>> - SHA of the preprocessing token sequence
>>>
>>> Keeping an unbroken chain of SHA values (a la git) could have
>>> other uses, like verifying that a specific build has been
>>> reproduced exactly (not counting certain things like different
>>> expansions of __DATE__ and __TIME__). This could be very handy
>>> when migrating between build systems or when a user is expected to
>>> deliver source code along with a program image.
>>
>> This feels like you're moving toward a full model of the steps it
>> takes to compile a "project" (as an IDE or make system would know
>> about).
>
> Yes; that's where I would like things to go.
>
> James Widman
> --
> Gimpel Software
> http://gimpel.com
>
>
>
More information about the cfe-dev
mailing list