[cfe-dev] Module build - tokenized form of intermediate source stream

Thu Oct 22 09:43:32 PDT 2015

Attempt to implement low-invasive fix to compile time problem:
http://reviews.llvm.org/D13987 .

Thanks,
--Serge

2015-10-21 18:27 GMT+06:00 Serge Pavlov <sepavloff at gmail.com>:

> 2015-10-20 14:38 GMT+06:00 Sean Silva <chisophugis at gmail.com>:
>
>>
>>
>> On Mon, Oct 19, 2015 at 11:02 PM, Serge Pavlov <sepavloff at gmail.com>
>> wrote:
>>
>>> 2015-10-20 10:01 GMT+06:00 Sean Silva <chisophugis at gmail.com>:
>>>
>>>>
>>>>
>>>> On Mon, Oct 19, 2015 at 10:34 AM, Serge Pavlov <sepavloff at gmail.com>
>>>> wrote:
>>>>
>>>>> 2015-10-15 5:27 GMT+06:00 Richard Smith <richard at metafoo.co.uk>:
>>>>>
>>>>>> On Tue, Oct 13, 2015 at 5:55 AM, Serge Pavlov <sepavloff at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> 2015-10-13 8:52 GMT+06:00 Sean Silva <chisophugis at gmail.com>:
>>>>>>>
>>>>>>>> On Mon, Oct 12, 2015 at 12:13 PM, Richard Smith via cfe-dev <
>>>>>>>> cfe-dev at lists.llvm.org> wrote:
>>>>>>>>
>>>>>>>>> On Mon, Oct 12, 2015 at 11:33 AM, Serge Pavlov via cfe-dev <
>>>>>>>>> cfe-dev at lists.llvm.org> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Now building a module involves creation of intermediate source
>>>>>>>>>> streams that includes/imports each header composing the  module. This
>>>>>>>>>> source stream is then parsed as if it were a source file. So to build a
>>>>>>>>>> module several transformations must be done:
>>>>>>>>>> - Module map is parsed to produce module objects(clang::Module),
>>>>>>>>>> - Module objects are used to build source stream
>>>>>>>>>> (llvm::MemoryBuffer), which contains include directives,
>>>>>>>>>> - The source stream is parsed to produce module content.
>>>>>>>>>>
>>>>>>>>>> The build process could be simpler, if instead of text source
>>>>>>>>>> stream we prepared a sequence of annotation tokens, annot_module_begin,
>>>>>>>>>> annot_module_end and some new token, say annot_module_header, which
>>>>>>>>>> represented a header of a module. It would be something like pretokenized
>>>>>>>>>> header but without a counterpart in file system.
>>>>>>>>>>
>>>>>>>>>> Such redesign would help in solving performance degradation
>>>>>>>>>> reported in PR24667 ([Regression] Quadratic module build time due to
>>>>>>>>>> Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
>>>>>>>>>> after each header, even if the next header is of the same module.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We generally recommend that each header goes in its own submodule,
>>>>>>>>> so optimizing for this case doesn't address the problem for a lot of cases.
>>>>>>>>>
>>>>>>>>
>>>>>>> These are different use cases and there is nothing bad if the
>>>>>>> problem will be solved with different means. If a user follow this
>>>>>>> recommendation and puts each header into a separate module, he won't suffer
>>>>>>> from the tokenized form of the intermediate input stream. If the user
>>>>>>> chooses to put many headers into one module, this change can solve the
>>>>>>> problem. The cited PR refers to just the latter case.
>>>>>>>
>>>>>>
>>>>>> I think you're missing my point. We seem to have a choice between a
>>>>>> general solution that addresses the problem in all cases, and a solution
>>>>>> that only helps for the "one big module with no submodules" case (which is
>>>>>> not the case that you get for, say, an umbrella directory module / umbrella
>>>>>> header / libc++ / Darwin's libc / ...). If these solutions don't have
>>>>>> drastically different technical complexity, the former seems like the
>>>>>> better choice.
>>>>>>
>>>>>> I'm not opposed to providing a token sequence rather than text for
>>>>>> the synthesized module umbrella header, but we'd need a reasonably strong
>>>>>> argument to justify the added complexity, especially as we still need our
>>>>>> current mode to handle umbrella headers on the file system, #includes
>>>>>> within modular headers, and so on. If we want something like that, a
>>>>>> simpler approach might be to add a pragma for starting / ending a module,
>>>>>> and emit that into the header file we synthesize, and then teach
>>>>>> PPLexerChange not to do the extra work when switching modules if the source
>>>>>> and destination module are actually the same.
>>>>>>
>>>>>>
>>>>>>> The "one huge submodule" approach with no local visibility is
>>>>>>>> actually very useful to have because it (for better or for worse) is very
>>>>>>>> close to the semantics of PCH (which are very simple). This makes it a nice
>>>>>>>> incremental step away from PCH and very easy to understand.
>>>>>>>>
>>>>>>>> Also, I think "we generally recommend" is a bit strong considering
>>>>>>>> that this isn't documented anywhere to my knowledge. In fact, the
>>>>>>>> documentation I've written internally for my customers recommends the exact
>>>>>>>> opposite for the reason described above.
>>>>>>>>
>>>>>>>>
>>>>>>> This very convenient for users. Usually it is much simpler to write
>>>>>>> something like #include "clang.h" instead of listing dozen of includes.
>>>>>>> When API is distributed by many headers, a user must determine first where
>>>>>>> the necessary piece is declared. In pre-module era splitting API was
>>>>>>> unavoidable evil, as it reduced compile time. With modules we can enable
>>>>>>> more convenient solutions.
>>>>>>>
>>>>>>
>>>>>> I agree, but that seems to me that this should be the choice of the
>>>>>> user of the API. If they want to import all of the Clang API, that should
>>>>>> work (and if you add an umbrella "clang.h" header, it will work), but if
>>>>>> they just #include some small part of that interface, should they really
>>>>>> get the whole thing?
>>>>>>
>>>>>> -- Sean Silva
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Leaving module after the last header would be a solution but it is
>>>>>>>>>> problematic to reveal if the header just parsed is the last one, - there is
>>>>>>>>>> no such thing as look ahead of the next include directive. Using tokenized
>>>>>>>>>> input would mark module ends easily.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have a different approach in mind for that case: namely, to
>>>>>>>>> produce a separate submodule state for distinct submodules even when not in
>>>>>>>>> local visibility mode, and lazily populate its Macros map when identifiers
>>>>>>>>> are queried. That way, the performance is linear in the number of macros
>>>>>>>>> the submodule actually defines or uses, not in the total number defined or
>>>>>>>>> used by the top-level module.
>>>>>>>>>
>>>>>>>>
>>>>> That is we need to maintain an object of type SubmoduleState for each
>>>>> module in all modes. The SubmoduleState is extended by new field that
>>>>> represents a map from IdentifierInfo* to ModuleMacro, which is populated
>>>>> when preprocessor tries to find if the identifier used in the source is a
>>>>> macro. LeaveSubmodule does not build ModuleMacro's anymore. Instead just
>>>>> before the module is serialized, SubmoduleState::Macro is scanned and for
>>>>> identifiers that do not have associated ModuleMacro, the latter is created.
>>>>>
>>>>> Probably we need to introduce new flag in IdentifierInfo, something
>>>>> like 'NotAMacro', to mark identifiers, that were checked if they are macro
>>>>> names and found they are not. It would allow to avoid extra look-ups. If
>>>>> flag HasMacro is set, this flag is cleared.
>>>>>
>>>>> It looks like we have to use complex procedure because we need to
>>>>> support the case when one header defines a macro and another only uses it.
>>>>> In this case macro state must be kept somewhere if LeaveSubmodule is called
>>>>> between headers.
>>>>>
>>>>> What about such implementation?
>>>>>
>>>>
>>>> That seems pretty invasive. I'm not sure it is worth it; the case that
>>>> I reduced to PR24667 was fairly extreme (all headers for a large project
>>>> (~size of LLVM) in a single top-level module). I'm not sure how likely it
>>>> is that this will be ran into in practice. It's definitely worth fixing on
>>>> the principle of avoiding quadratic behavior, but it isn't (currently)
>>>> blocking a real-world use case, so I hesitate to do very invasive changes.
>>>>
>>>
>>> Modules must be valuable just for large projects, where compile time
>>> saving can be substantial. So the problem described in PR24667 anyway
>>> should be solved, with such approach or another.
>>>
>>
>> Large projects are composed of many small sub-parts. I don't think any
>> real project has a "module" with >1000 headers. I just tested the
>> performance varying the number of headers in the module and the number of
>> macros per header. My results in Mathematica can be seen here:
>> http://i.imgur.com/E6g0g0M.png (testit_formathematica.py attached)
>> Even for the case of 128 headers with 500 macros each (in a single
>> top-level module with no submodules) the slowdown is less than 3x vs. clang
>> 3.6. So this isn't the end of the world (it's not like compilations won't
>> finish; in the case I ran into in "practice" in my experiments with >1000
>> headers in a single top-level module with no submodules, the module was
>> taking ~60 seconds to build).
>>
>> Like I said, this is worth fixing on principle of avoiding quadratic
>> behavior. The most likely case that I can think of that would occur in
>> practice where this quadratic would really bite is when initially
>> modularizing an entire SDK top-level include directory that has a bunch of
>> stuff in it; for both Mac and PS4 this is a couple hundred headers. With
>> 512 headers and 100 macros per header (testing this with a modified version
>> of my Mathematica notebook) this gives 9x slowdown, which is a lot, but
>> this sort of situation is rare. Once the modularization is done though,
>> there are submodules at much smaller granularity, so the problem disappears.
>>
>
> Interesting investigation. Having read this I got impression that
> quadratic compile time is more a problem of perfect design than a user
> headache. Indeed, large number of includes are more typical for big
> projects where it is possible to spend some effort to make modularization.
> I guess that even with dropped performance of module compilation, end users
> still gain compile time of their application. So choosing a way to fix the
> problem we should weight trade-offs between good design and quick solution.
>
>
>>
>> I think Richard's suggestion for having a #pragma for starting/stopping a
>> submodule is a really good idea. Among other things, this would make module
>> maps nothing but "syntax sugar" for something that can be done directly in
>> the language. It fixes this issue and I think it may also make it easier
>> for users to understand what is happening. It also might make it very easy
>> to migrate from PCH to modules. I have in practice actually explained the
>> way module maps work in terms of the synthesized header file (and it seems
>> to be a good way to describe it), so it is a natural to allow the
>> synthesized header file to actually be written (even if at first we use
>> __reserved names for the pragma at first so it isn't available to users;
>> someday we might open it up if this proves useful).
>>
>
> Do you think introducing the #pragma is a temporary solution, because we
> do not have general one now? Or this is some help for peoples doing
> modularization and it worth existence even if the general solution will be
> implemented?
>
>
>>
>> -- Sean Silva
>>
>>
>>>
>>> --Serge
>>>
>>>
>>>>
>>>> -- Sean Silva
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>>> Is there any reason why textual form of the intermediate source
>>>>>>>>>> stream should be kept? Does implementing tokenized form of it make sense?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> --Serge
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> cfe-dev mailing list
>>>>>>>>>> cfe-dev at lists.llvm.org
>>>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> cfe-dev mailing list
>>>>>>>>> cfe-dev at lists.llvm.org
>>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151022/98663ace/attachment.html>