[cfe-dev] [RFC] C++20 modules dependency discovery

Tue Aug 13 13:33:37 PDT 2019

This is likely going to be a bit weird since I just subscribed and don't
have the original email(s) to reply to, so apologies if my
reconstruction is incorrect.

On Mon, Aug 12, 2019 at 18:37:05 PDT, Michael Spencer wrote:
> For explicit modules we only need to know the direct dependencies, as the
> build system will handle the transitive set.

Correct. Though `import` statements in `#include` files still need to be
mentioned.

> For preprocessing we still need to import header units (but only their
> preprocessor state), but not normal modules.  For this case it’s ok if `-E
> -MD` fails to find a module.  But it does still need to be able to find
> header units and module maps.  Additionally the normal Make output syntax
> is not sufficient to represent the needed information unless the driver
> decides how modules and header units should be built and where intermediate
> files should go.  There’s currently a json format working its way through
> the tooling subgroup of the standards committee that I think we should
> adopt for this.
> 
> I think we need separate modes in clang for these along with support for
> scanning through header units without actually building a clang module for
> them. clang-scan-deps will make use of the explicit mode.  The question I
> have is how should we select this mode, and what clang options do we need
> to add?
> 
> Proposal
> ========
> 
> As a rough idea I propose the following:
> 
> * `-M?` means output the json format which can correctly represent
> dependencies on a module for which we don’t know what the final file path
> will be.

[ I'm the author of the paper specifying the mentioned format. ]

For my GCC patch, I've spelled the flags for the output in the following
way:

  - `-fdep-format=trtbd`: Necessary to support creating old format
    versions (the "trtbd" part is in search of a much better name :) ).
  - `-fdep-output=<PATH>`: The path that will be passed to the `-o` flag
    when compiling the TU being scanned. This is needed to hook up which
    scan result goes with which compilation rule (it can't be associated
    with the source because a single source path may be compiled
    multiple times within a build; the output object file does need to
    be unique however).
  - `-fdep-file=<PATH>` where to write the output for the format.

I avoided the `-M` flag family because that means "make". This is not
make syntax, so it doesn't belong there. In addition, the existing `-M`
flags are still useful because the "should I rerun this rule" logic for
the scan step itself can be satisfied with the `-M` flags here.

> * `clang++ -std=c++20 -E -MD -fimplicit-header-units` should implicitly
> find header unit sources, but not modules (as we've not given it any way to
> look up how to build modules).
>     * This means that the dep file will contain a bunch of `.h`s,
> `.modulemap`s, and any `.pcm`s explicitly listed on the command line.
>     * This also means erroring on unknown imported modules as we don't know
> what to put in the dep file for them.

Sounds reasonable. Matching GCC's output for them might be a viable
option, but that is going to make not-make parsers of the `.d` files
choke (since that output involves appending to make variables).

> * `clang++ -std=c++20 -E -MD -fimplicit-header-units
> -fimplicit-module-lookup=?`  should do the same as the above, except that
> it does know how to find modules, and should list all of the transitive
> dependencies of any modules it finds.
> * `clang++ -std=c++20 -E -MD` should fail if it hits a module or header
> unit, and should never do implicit lookup.
> * `clang++ -std=c++20 -E -M?` should scan through header units without
> actually building clang modules for them (to get the macros it needs), and
> should note all module imports.
>     * This means that the dep file will contain only `.h`s that it
> includes, and use the json representation of header units and modules.
>     * It will also be shallow, with only direct dependencies.

Sounds good.

> Additionally, we should (eventually) make:
> 
> `$ clang++ -std=c++20 a.cpp b.cpp c.cpp a.cppm -o program`
> 
> Work without a build system, even in the presence of modules.  To do this
> we will need to prescan the files to determine the module dependencies
> between them and then build them in dependency order.  This does mean
> adding a (simple) build system to the driver (maybe [llbuild](
> https://github.com/apple/swift-llbuild)?), but I think it’s worth it to
> make simple cases simple.  It may also make sense to actually push this
> work out to a real build system.  For example have clang write a temporary
> ninja file and invoke ninja to perform the build.

This sounds like what a Meson developer is expecting in this blog post:

    https://nibblestew.blogspot.com/2019/08/building-c-modules-take-n1.html

I don't know how "simple" they're able to force their compilation model
into what would be provided here. I'm also not sure how much a nested
ninja would be appreciated (there's no notion of a jobserver for
ninja-under-ninja to propagate things like `-l` or `-j` flags down).
Pool information may also be useful there. There is a patchset for
ninja-under-make to obey jobserver information though, but that doesn't
help Meson at all.

On Tue, Aug 13, 2019 at 02:08:42 PDT, Michael Spencer wrote:
> On Tue, Aug 13, 2019 at  01:52:46 PDT, Finkel, Hal J. wrote:
> > I don't object to supporting the json format, but are there defaults
> > that would make sense? Maybe using the preprocessor state implied by
> > the current command-line options and putting intermediate files /
> > interface files in the current directory, or in
> > TMDIR/.clang/<hash of path>, or something else? We'd need defaults
> > for your `-M?` below anyway?

I think that defaults for the `-M?` (or `-fdep-*` flags) is unnecessary.
The flags are only really meaningful to a build system sophisticated
enough to understand module dependencies anyways, so just requiring at
least `-fdep-format=` and `-fdep-file=` to be set sounds OK to me at
least (`-fdep-output=` being unset means the build tool knows what it's
doing I guess). I suppose `-fdep-file=` could have a default too, but
hat sounds like a build system being too trusting of cross-version
compatibility to me.

> The json format doesn't include pcm paths.

It doesn't require them, but there is a slot for the scan tool to say
something. In CMake's implementation, I take the filename of the pcm
path placed there, but relocate it to a target-specific directory. If it
is missing, I create my own filepath based on the logical name of the
module. This is communicated to the actual build by creating a file for
GCC's module mapper to locate it (which is used for import and export
locations). If clang wants a response file, that can be done too (with
the flag just being spelled as `@` instead of `-fmodule-mapper=`).

> It just says which source
> files provide which modules, and what modules and header units each
> source file imports.  It's up to the build system to construct an actual
> build.

Yep.

> The other issue with -MD is that I believe tools that use `.d`
> files wouldn't even be able to handle a `.d` that included actual
> commands.

Correct. Ninja tries to handle the barest of syntax for these files
(basically what is seen in the wild).

> > Also, does finding a module involve matching a cppm file with
> > compatible preprocessor state, or is it just by name?
> >
> It's just by name.  The assumption here is that you have a compilation
> database or similar and thus know the command line options passed to
> every source file.

In CMake, mismatched preprocessor state is expected to be detected by
the compiler (something like "-D flags change the interpretation of the
BMI") or linker (as `_ITERATOR_DEBUG_LEVEL` is handled in Windows).

--Ben