[cfe-dev] Module build - tokenized form of intermediate source stream

Sean Silva via cfe-dev cfe-dev at lists.llvm.org
Tue Oct 20 01:38:05 PDT 2015


On Mon, Oct 19, 2015 at 11:02 PM, Serge Pavlov <sepavloff at gmail.com> wrote:

> 2015-10-20 10:01 GMT+06:00 Sean Silva <chisophugis at gmail.com>:
>
>>
>>
>> On Mon, Oct 19, 2015 at 10:34 AM, Serge Pavlov <sepavloff at gmail.com>
>> wrote:
>>
>>> 2015-10-15 5:27 GMT+06:00 Richard Smith <richard at metafoo.co.uk>:
>>>
>>>> On Tue, Oct 13, 2015 at 5:55 AM, Serge Pavlov <sepavloff at gmail.com>
>>>> wrote:
>>>>
>>>>> 2015-10-13 8:52 GMT+06:00 Sean Silva <chisophugis at gmail.com>:
>>>>>
>>>>>> On Mon, Oct 12, 2015 at 12:13 PM, Richard Smith via cfe-dev <
>>>>>> cfe-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>>> On Mon, Oct 12, 2015 at 11:33 AM, Serge Pavlov via cfe-dev <
>>>>>>> cfe-dev at lists.llvm.org> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Now building a module involves creation of intermediate source
>>>>>>>> streams that includes/imports each header composing the  module. This
>>>>>>>> source stream is then parsed as if it were a source file. So to build a
>>>>>>>> module several transformations must be done:
>>>>>>>> - Module map is parsed to produce module objects(clang::Module),
>>>>>>>> - Module objects are used to build source stream
>>>>>>>> (llvm::MemoryBuffer), which contains include directives,
>>>>>>>> - The source stream is parsed to produce module content.
>>>>>>>>
>>>>>>>> The build process could be simpler, if instead of text source
>>>>>>>> stream we prepared a sequence of annotation tokens, annot_module_begin,
>>>>>>>> annot_module_end and some new token, say annot_module_header, which
>>>>>>>> represented a header of a module. It would be something like pretokenized
>>>>>>>> header but without a counterpart in file system.
>>>>>>>>
>>>>>>>> Such redesign would help in solving performance degradation
>>>>>>>> reported in PR24667 ([Regression] Quadratic module build time due to
>>>>>>>> Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
>>>>>>>> after each header, even if the next header is of the same module.
>>>>>>>>
>>>>>>>
>>>>>>> We generally recommend that each header goes in its own submodule,
>>>>>>> so optimizing for this case doesn't address the problem for a lot of cases.
>>>>>>>
>>>>>>
>>>>> These are different use cases and there is nothing bad if the problem
>>>>> will be solved with different means. If a user follow this recommendation
>>>>> and puts each header into a separate module, he won't suffer from the
>>>>> tokenized form of the intermediate input stream. If the user chooses to put
>>>>> many headers into one module, this change can solve the problem. The cited
>>>>> PR refers to just the latter case.
>>>>>
>>>>
>>>> I think you're missing my point. We seem to have a choice between a
>>>> general solution that addresses the problem in all cases, and a solution
>>>> that only helps for the "one big module with no submodules" case (which is
>>>> not the case that you get for, say, an umbrella directory module / umbrella
>>>> header / libc++ / Darwin's libc / ...). If these solutions don't have
>>>> drastically different technical complexity, the former seems like the
>>>> better choice.
>>>>
>>>> I'm not opposed to providing a token sequence rather than text for the
>>>> synthesized module umbrella header, but we'd need a reasonably strong
>>>> argument to justify the added complexity, especially as we still need our
>>>> current mode to handle umbrella headers on the file system, #includes
>>>> within modular headers, and so on. If we want something like that, a
>>>> simpler approach might be to add a pragma for starting / ending a module,
>>>> and emit that into the header file we synthesize, and then teach
>>>> PPLexerChange not to do the extra work when switching modules if the source
>>>> and destination module are actually the same.
>>>>
>>>>
>>>>> The "one huge submodule" approach with no local visibility is actually
>>>>>> very useful to have because it (for better or for worse) is very close to
>>>>>> the semantics of PCH (which are very simple). This makes it a nice
>>>>>> incremental step away from PCH and very easy to understand.
>>>>>>
>>>>>> Also, I think "we generally recommend" is a bit strong considering
>>>>>> that this isn't documented anywhere to my knowledge. In fact, the
>>>>>> documentation I've written internally for my customers recommends the exact
>>>>>> opposite for the reason described above.
>>>>>>
>>>>>>
>>>>> This very convenient for users. Usually it is much simpler to write
>>>>> something like #include "clang.h" instead of listing dozen of includes.
>>>>> When API is distributed by many headers, a user must determine first where
>>>>> the necessary piece is declared. In pre-module era splitting API was
>>>>> unavoidable evil, as it reduced compile time. With modules we can enable
>>>>> more convenient solutions.
>>>>>
>>>>
>>>> I agree, but that seems to me that this should be the choice of the
>>>> user of the API. If they want to import all of the Clang API, that should
>>>> work (and if you add an umbrella "clang.h" header, it will work), but if
>>>> they just #include some small part of that interface, should they really
>>>> get the whole thing?
>>>>
>>>> -- Sean Silva
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Leaving module after the last header would be a solution but it is
>>>>>>>> problematic to reveal if the header just parsed is the last one, - there is
>>>>>>>> no such thing as look ahead of the next include directive. Using tokenized
>>>>>>>> input would mark module ends easily.
>>>>>>>>
>>>>>>>
>>>>>>> I have a different approach in mind for that case: namely, to
>>>>>>> produce a separate submodule state for distinct submodules even when not in
>>>>>>> local visibility mode, and lazily populate its Macros map when identifiers
>>>>>>> are queried. That way, the performance is linear in the number of macros
>>>>>>> the submodule actually defines or uses, not in the total number defined or
>>>>>>> used by the top-level module.
>>>>>>>
>>>>>>
>>> That is we need to maintain an object of type SubmoduleState for each
>>> module in all modes. The SubmoduleState is extended by new field that
>>> represents a map from IdentifierInfo* to ModuleMacro, which is populated
>>> when preprocessor tries to find if the identifier used in the source is a
>>> macro. LeaveSubmodule does not build ModuleMacro's anymore. Instead just
>>> before the module is serialized, SubmoduleState::Macro is scanned and for
>>> identifiers that do not have associated ModuleMacro, the latter is created.
>>>
>>> Probably we need to introduce new flag in IdentifierInfo, something like
>>> 'NotAMacro', to mark identifiers, that were checked if they are macro names
>>> and found they are not. It would allow to avoid extra look-ups. If flag
>>> HasMacro is set, this flag is cleared.
>>>
>>> It looks like we have to use complex procedure because we need to
>>> support the case when one header defines a macro and another only uses it.
>>> In this case macro state must be kept somewhere if LeaveSubmodule is called
>>> between headers.
>>>
>>> What about such implementation?
>>>
>>
>> That seems pretty invasive. I'm not sure it is worth it; the case that I
>> reduced to PR24667 was fairly extreme (all headers for a large project
>> (~size of LLVM) in a single top-level module). I'm not sure how likely it
>> is that this will be ran into in practice. It's definitely worth fixing on
>> the principle of avoiding quadratic behavior, but it isn't (currently)
>> blocking a real-world use case, so I hesitate to do very invasive changes.
>>
>
> Modules must be valuable just for large projects, where compile time
> saving can be substantial. So the problem described in PR24667 anyway
> should be solved, with such approach or another.
>

Large projects are composed of many small sub-parts. I don't think any real
project has a "module" with >1000 headers. I just tested the performance
varying the number of headers in the module and the number of macros per
header. My results in Mathematica can be seen here:
http://i.imgur.com/E6g0g0M.png (testit_formathematica.py attached)
Even for the case of 128 headers with 500 macros each (in a single
top-level module with no submodules) the slowdown is less than 3x vs. clang
3.6. So this isn't the end of the world (it's not like compilations won't
finish; in the case I ran into in "practice" in my experiments with >1000
headers in a single top-level module with no submodules, the module was
taking ~60 seconds to build).

Like I said, this is worth fixing on principle of avoiding quadratic
behavior. The most likely case that I can think of that would occur in
practice where this quadratic would really bite is when initially
modularizing an entire SDK top-level include directory that has a bunch of
stuff in it; for both Mac and PS4 this is a couple hundred headers. With
512 headers and 100 macros per header (testing this with a modified version
of my Mathematica notebook) this gives 9x slowdown, which is a lot, but
this sort of situation is rare. Once the modularization is done though,
there are submodules at much smaller granularity, so the problem disappears.

I think Richard's suggestion for having a #pragma for starting/stopping a
submodule is a really good idea. Among other things, this would make module
maps nothing but "syntax sugar" for something that can be done directly in
the language. It fixes this issue and I think it may also make it easier
for users to understand what is happening. It also might make it very easy
to migrate from PCH to modules. I have in practice actually explained the
way module maps work in terms of the synthesized header file (and it seems
to be a good way to describe it), so it is a natural to allow the
synthesized header file to actually be written (even if at first we use
__reserved names for the pragma at first so it isn't available to users;
someday we might open it up if this proves useful).

-- Sean Silva


>
> --Serge
>
>
>>
>> -- Sean Silva
>>
>>
>>>
>>>
>>>>
>>>>>>>
>>>>>>>> Is there any reason why textual form of the intermediate source
>>>>>>>> stream should be kept? Does implementing tokenized form of it make sense?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> --Serge
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> cfe-dev mailing list
>>>>>>>> cfe-dev at lists.llvm.org
>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> cfe-dev mailing list
>>>>>>> cfe-dev at lists.llvm.org
>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151020/d73de59c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testit_formathematica.py
Type: text/x-python-script
Size: 1175 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151020/d73de59c/attachment.bin>


More information about the cfe-dev mailing list