r241620 - Wrap clang modules and pch files in an object file container.

Tue Jul 14 10:13:03 PDT 2015

On Tue, Jul 14, 2015 at 9:55 AM, Adrian Prantl <aprantl at apple.com> wrote:

>
> On Jul 14, 2015, at 8:25 AM, David Blaikie <dblaikie at gmail.com> wrote:
>
>
>
> On Mon, Jul 13, 2015 at 7:25 PM, Richard Smith <richard at metafoo.co.uk>
> wrote:
>
>> On Mon, Jul 13, 2015 at 6:02 PM, Adrian Prantl <aprantl at apple.com> wrote:
>>
>>>
>>> On Jul 13, 2015, at 5:47 PM, Richard Smith <richard at metafoo.co.uk>
>>> wrote:
>>>
>>> On Mon, Jul 13, 2015 at 3:06 PM, Adrian Prantl <aprantl at apple.com>
>>> wrote:
>>>
>>>> > On Jul 13, 2015, at 2:00 PM, Eric Christopher <echristo at gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi Adrian,
>>>> >
>>>> > Finally getting around to looking at some of this and I think it's
>>>> going in slightly the wrong direction. In general I think begin -able- to
>>>> put modules in object files to simplify wrapping, use, etc is a good thing.
>>>> I think being required to do so is somewhat problematic.
>>>> >
>>>>
>>>> Let me start with that the current infrastructure already allows
>>>> selecting whether you want wrapped modules or not by passing the
>>>> appropriate PCHContainerOperations object to CompilerInstance. Clang
>>>> currently unconditionally uses an object file wrapper, all of
>>>> clang-tools-extra doesn’t. We could easily control the behavior of clang
>>>> based on a (new) command line option.
>>>>
>>>> But.. on a platform with a shared module cache you always have to
>>>> assume that a module once built will eventually be used by a client that
>>>> wants to read the debug info. Think llvm-dsymutil — it does not know and
>>>> does not want to know how to build clang modules, but does want to read all
>>>> the debug info from a clang module.
>>>>
>>>> > Imagine, for example, you have a giant distributed build system...
>>>> >
>>>> > You'd want to create a pile of modules (that may
>>>> reference/include/etc other modules) that aren't don't or may not have
>>>> debug information as part of them (because you might want to build without
>>>> it or have the debug info alongside it as a separate compilation). Waiting
>>>> on the full build of the module including debug is going to adversely
>>>> affect your overall build time and so shouldn't be necessary - especially
>>>> if you want to be able to have information separate ultimately.
>>>> >
>>>> > Make sense?
>>>>
>>>> Not sure if you would be saving much by having the debug info
>>>> separately, from what I’ve measured so far the debug info for a module
>>>> makes up less than 10% of the total size. Admittedly, build-time-wise going
>>>> through the backend to emit the object file is a lot more expensive than
>>>> just dumping the raw PCH. [1]
>>>>
>>>> Yeah, I think wanting to be able to control the behavior is reasonable,
>>>> we just need to be careful what the implications for consumers are. If we
>>>> add a, e.g., an “-fraw-modules” [2] or switch to clang to turn off the
>>>> object file wrapping, I’d strongly suggest that we add the value of this
>>>> switch to the module hash (or add a an optional “-g” to the module file
>>>> name after the hash or something like that) to avoid ugly race conditions
>>>> between debug info and non-debug-info builds of the same module. This way
>>>> we’d have essentially two separate module caches, with and without debug
>>>> info.
>>>>
>>>
>>> That's fine, I think (we don't use a module cache at all in our build
>>> system; it doesn't really make much sense for a distributed build) and most
>>> command-line flag changes already have this effect.
>>>
>>>
>>> Great!
>>>
>>>
>>>
>>>> would that work for you?
>>>> -- adrian
>>>>
>>>> [1] If you want to be serious about building the module debug info in
>>>> parallel to the rest of the build, you could even have a clang-based tool
>>>> import the just-built raw clang module and emit the debug info without
>>>> having to parse the headers again :-)
>>>>
>>>
>>> That is what we intend to do :) (Assuming this turns out to actually be
>>> faster than re-parsing; faulting in the entire contents of a module has
>>> much worse locality than parsing.)
>>>
>>> [2] -fraw-modules, -fmodule-format-raw, -fmodule-debug-info, ...?
>>>>     I would imagine that the driver enables module debug info when
>>>> "-gmodules” is present and by default on Darwin.
>>>
>>>
>>> That seems reasonable to me. For the frontend flag, I think a flag to
>>> turn this on or to select the module format makes more sense than a flag to
>>> switch to the raw format.
>>>
>>>
>>> Okay then let’s narrow this down. Other possibilities in that direction
>>> include (sorted from subjectively best to worst)
>>>
>>> -fmodule-format=obj
>>> -fmodule-debug-info
>>> -ffat-modules
>>> -fmodule-container
>>> -fmodule-container-object
>>>
>>
>> It's a -cc1 flag, so it doesn't really matter much. If this will
>> eventually govern whether we put code for inline functions into the module,
>> then I think we should avoid names like -fmodule-debug-info. Other than
>> that, I don't really have a preference.
>>
>
>
> Unless the “=“ part turns out to be an implementation nightmare, I think
> I’ll be going with -fmodule-format=[raw,obj] then and implicitly emit debug
> info in the obj case. If necessary, we can make this more fine grained
> later.
>
> What you're picturing there is essentially a flag that would indicate if
> we should build all module-related-object-things into the module, or not?
> That seems like a useful broad flag (with an eventual corresponding
> compiler mode where we pass another flag and explicitly pass just the
> module and say "build a separate object with all the
> module-related-object-things - for use in a non-implicit-cache build)
>
> (Hmm, we're going to have a weird middle ground in here - where the IR for
> the inline functions needs to go in the module itself (as an
> available_externally definition for use in non-LTO compilations of
> dependent object files) and then the
> build-separate-module-related-object-things would turn those into (weak?)
> definitions, compile them (& the debug info) into a separate object file,
> to be linked in at the end)
>
>
> Can you elaborate this use-case?
>

So the use cases that have often been bandied about, that I'm referring to
here are:

1) including inline function IR in the module to be used by each
compilation that depends on the module - so each inline function doesn't
have to be IRGen'd in every /use/ of a module, just once when the module is
built

2) include a single definition of the actual machine code for inline
functions from modules and link that into the final program (so that the
functions in (1) can be available_externally, used for inlining
opportunities during compilation, but never generate machine code in the
object files that depend on the module)

> Are you saying you’d want a module object file with ast+bitcode and
> another one with bitcode'+debug info built from the first one? Or one raw
> ast file and two object files?
>

What I'd be picturing would be ast+bitcode and object code+(optional debug
info (if it's a debug build)).

>
>
> Should this just be keyed/defaulted off implicit/explicit modules, or
> orthogonal to that choice?
>
>> [One other thing... I think we may have made a mistake by putting the
>>> reader and writer code behind the same interface: it forces tools that want
>>> to read the module format to link against all of LLVM IR, code generation,
>>> and so on, when all they really need is something like libObject.]
>>>
>>>
>>> We can always split it into two implementations of the interface or two
>>> interfaces, that’s not a very big deal. My assumption was that every tool
>>> that wants to read the clang module format also wants to create modules
>>> (because module cache... but as you noted that’s a Darwin-centric view) and
>>> more low-level tools like llvm-bcanalyzer could be piped through
>>> llvm-objdump.
>>>
>>
> -- adrian
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20150714/183ae55d/attachment.html>