[cfe-dev] [Modules TS] Have the file formats been decided?

Tue Jan 17 15:10:02 PST 2017

On 17 January 2017 at 14:45, Hamza Sood <hamza_sood at me.com> wrote:

> Thanks for clarifying parts of the current implementation. I wasn’t sure
> what’s incomplete and what’s by design.
>
> > Rather than producing two ASTs, it would be preferable to simply export
> less of the AST into the pcm file. As noted above, this optimization is not
> yet implemented (along with some of the semantics of the Modules TS).
> Since it’s currently possible to generate a complete object file from a
> pcm, I assumed that such an optimisation wouldn't be possible with the
> current format. In fact a fully optimised pcm is pretty much what I was
> trying to describe here, but I wasn’t sure if being able to go from pcm ->
> obj is an essential part of what a pcm is.
>

Our pcm format is not immutable; we are free to make such changes if
necessary. One thing that might not be immediately obvious: in a highly
parallel build, it can be beneficial to avoid blocking downstream compiles
on the step that generates object code from a module interface. That is, we
may want to generate a .pcm file without generating object code, and then
later generate the object code from it, to improve build performance. This
doesn't necessarily mean that the .pcm file must contain all function
definitions -- we could generate the object file for the module interface
by re-parsing the .cppm file -- but there's a tradeoff between parallelism
and total CPU time in doing so.

> Just writing less of the AST to a file is certainly better than producing
> two ASTs, and I attempted that with my original tests. However I wasn’t
> able to find anything in ASTWriter that lets you to control which parts of
> the AST are written; all I could get working is producing a second AST from
> the original (with modifications of course) and passing that through to
> ASTWriter. Is there an API that I missed?

We have no real support for this yet, but it doesn't seem especially hard
to add the ability to filter during AST emission. The interesting part will
be determining what can be safely filtered out. Example: an exported
template makes a call to a function with unqualified name 'foo'; can we
still discard any non-exported functions named 'foo' in the module
interface? Those functions might be found by ADL.

Also note that this affects linkage: even internal, non-exported functions
in the module interface might be called that way, and if so, we need some
way to link the symbol references in those template instantiations to the
code we emitted for the module interface.

> Clang's module files are explicitly not a distribution format. You are
> expected to ship your module interface files, not a precompiled form of
> them.
> Would library developers want to ship their module interfaces considering
> they could potentially contain a lot of code?
> Microsoft for example have come up with a distributable binary format so
> that library developers don’t have to ship their module interface files.

Considering that the module interface can, and often will, contain code
that is in some way conditional on the environment (for instance, on the
size of 'int', or on whether certain headers or functions are provided by
the environment, or on certain details of their standard library
implementation -- and so on), it is not clear that Microsoft's approach is
feasible for a non-single-vendor environment. Even trivial concerns such as
whether assert(X) in an inline function or template in a module interface
require precompiled module interfaces for the same .cppm file. At this
point, the idea of a redistributable binary module interface format seems
misguided, but we'll have to see how usage patterns develop and whether
they ever start to make sense.

> Even with modules, large codebases will still want to maintain an
> interface / implementation separation discipline, in order to avoid every
> change to a low-level library's implementation triggering unnecessary
> recompilation of dependent code. (Keep in mind that a change that affects
> line numbers in a low-level library could affect the debug information
> generated for any transitive dependency, so we can't necessarily bail out
> of the compilation if the abstract interface of the module is unchanged.)
> That brings up the question of how a module based build system would look,
> which I don’t think I’ve seen mentioned anywhere. Should the compiler be in
> charge by seeking out imported modules based on search paths and
> automatically building them if needed? Or should it be more like the
> dependency file generation that occurs with headers, which leaves a tool
> such as GNU make in charge?

Historically, Clang's approach has been to provide a mode that requires no
changes to build systems, in order to make transition to modules and
sharing code between a modules build and a non-modules build
straightforward, but that introduces many problems (particularly with
parallel and distributed builds), and with the Modules TS we are already
making a break with the past, so we should simply treat the act of building
a module as a first-class action performed by a build. The compiler should
not become a build system.

This does mean that build systems will need to track interface dependencies
in a way they didn't before (you need to know which module interfaces
should be built before which other module interfaces), and that information
will either need to be provided or detected by the build system. If a build
system wishes to automate this, it would not be dissimilar to the #include
scanning that some existing build systems already perform.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170117/3c05a602/attachment.html>