[cfe-dev] Module build - tokenized form of intermediate source stream

Mon Oct 19 23:02:56 PDT 2015

2015-10-20 10:01 GMT+06:00 Sean Silva <chisophugis at gmail.com>:

>
>
> On Mon, Oct 19, 2015 at 10:34 AM, Serge Pavlov <sepavloff at gmail.com>
> wrote:
>
>> 2015-10-15 5:27 GMT+06:00 Richard Smith <richard at metafoo.co.uk>:
>>
>>> On Tue, Oct 13, 2015 at 5:55 AM, Serge Pavlov <sepavloff at gmail.com>
>>> wrote:
>>>
>>>> 2015-10-13 8:52 GMT+06:00 Sean Silva <chisophugis at gmail.com>:
>>>>
>>>>> On Mon, Oct 12, 2015 at 12:13 PM, Richard Smith via cfe-dev <
>>>>> cfe-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> On Mon, Oct 12, 2015 at 11:33 AM, Serge Pavlov via cfe-dev <
>>>>>> cfe-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Now building a module involves creation of intermediate source
>>>>>>> streams that includes/imports each header composing the  module. This
>>>>>>> source stream is then parsed as if it were a source file. So to build a
>>>>>>> module several transformations must be done:
>>>>>>> - Module map is parsed to produce module objects(clang::Module),
>>>>>>> - Module objects are used to build source stream
>>>>>>> (llvm::MemoryBuffer), which contains include directives,
>>>>>>> - The source stream is parsed to produce module content.
>>>>>>>
>>>>>>> The build process could be simpler, if instead of text source stream
>>>>>>> we prepared a sequence of annotation tokens, annot_module_begin,
>>>>>>> annot_module_end and some new token, say annot_module_header, which
>>>>>>> represented a header of a module. It would be something like pretokenized
>>>>>>> header but without a counterpart in file system.
>>>>>>>
>>>>>>> Such redesign would help in solving performance degradation reported
>>>>>>> in PR24667 ([Regression] Quadratic module build time due to
>>>>>>> Preprocessor::LeaveSubmodule). The reason of the problem is leaving module
>>>>>>> after each header, even if the next header is of the same module.
>>>>>>>
>>>>>>
>>>>>> We generally recommend that each header goes in its own submodule, so
>>>>>> optimizing for this case doesn't address the problem for a lot of cases.
>>>>>>
>>>>>
>>>> These are different use cases and there is nothing bad if the problem
>>>> will be solved with different means. If a user follow this recommendation
>>>> and puts each header into a separate module, he won't suffer from the
>>>> tokenized form of the intermediate input stream. If the user chooses to put
>>>> many headers into one module, this change can solve the problem. The cited
>>>> PR refers to just the latter case.
>>>>
>>>
>>> I think you're missing my point. We seem to have a choice between a
>>> general solution that addresses the problem in all cases, and a solution
>>> that only helps for the "one big module with no submodules" case (which is
>>> not the case that you get for, say, an umbrella directory module / umbrella
>>> header / libc++ / Darwin's libc / ...). If these solutions don't have
>>> drastically different technical complexity, the former seems like the
>>> better choice.
>>>
>>> I'm not opposed to providing a token sequence rather than text for the
>>> synthesized module umbrella header, but we'd need a reasonably strong
>>> argument to justify the added complexity, especially as we still need our
>>> current mode to handle umbrella headers on the file system, #includes
>>> within modular headers, and so on. If we want something like that, a
>>> simpler approach might be to add a pragma for starting / ending a module,
>>> and emit that into the header file we synthesize, and then teach
>>> PPLexerChange not to do the extra work when switching modules if the source
>>> and destination module are actually the same.
>>>
>>>
>>>> The "one huge submodule" approach with no local visibility is actually
>>>>> very useful to have because it (for better or for worse) is very close to
>>>>> the semantics of PCH (which are very simple). This makes it a nice
>>>>> incremental step away from PCH and very easy to understand.
>>>>>
>>>>> Also, I think "we generally recommend" is a bit strong considering
>>>>> that this isn't documented anywhere to my knowledge. In fact, the
>>>>> documentation I've written internally for my customers recommends the exact
>>>>> opposite for the reason described above.
>>>>>
>>>>>
>>>> This very convenient for users. Usually it is much simpler to write
>>>> something like #include "clang.h" instead of listing dozen of includes.
>>>> When API is distributed by many headers, a user must determine first where
>>>> the necessary piece is declared. In pre-module era splitting API was
>>>> unavoidable evil, as it reduced compile time. With modules we can enable
>>>> more convenient solutions.
>>>>
>>>
>>> I agree, but that seems to me that this should be the choice of the user
>>> of the API. If they want to import all of the Clang API, that should work
>>> (and if you add an umbrella "clang.h" header, it will work), but if they
>>> just #include some small part of that interface, should they really get the
>>> whole thing?
>>>
>>> -- Sean Silva
>>>>>
>>>>>
>>>>>>
>>>>>> Leaving module after the last header would be a solution but it is
>>>>>>> problematic to reveal if the header just parsed is the last one, - there is
>>>>>>> no such thing as look ahead of the next include directive. Using tokenized
>>>>>>> input would mark module ends easily.
>>>>>>>
>>>>>>
>>>>>> I have a different approach in mind for that case: namely, to produce
>>>>>> a separate submodule state for distinct submodules even when not in local
>>>>>> visibility mode, and lazily populate its Macros map when identifiers are
>>>>>> queried. That way, the performance is linear in the number of macros the
>>>>>> submodule actually defines or uses, not in the total number defined or used
>>>>>> by the top-level module.
>>>>>>
>>>>>
>> That is we need to maintain an object of type SubmoduleState for each
>> module in all modes. The SubmoduleState is extended by new field that
>> represents a map from IdentifierInfo* to ModuleMacro, which is populated
>> when preprocessor tries to find if the identifier used in the source is a
>> macro. LeaveSubmodule does not build ModuleMacro's anymore. Instead just
>> before the module is serialized, SubmoduleState::Macro is scanned and for
>> identifiers that do not have associated ModuleMacro, the latter is created.
>>
>> Probably we need to introduce new flag in IdentifierInfo, something like
>> 'NotAMacro', to mark identifiers, that were checked if they are macro names
>> and found they are not. It would allow to avoid extra look-ups. If flag
>> HasMacro is set, this flag is cleared.
>>
>> It looks like we have to use complex procedure because we need to support
>> the case when one header defines a macro and another only uses it. In this
>> case macro state must be kept somewhere if LeaveSubmodule is called between
>> headers.
>>
>> What about such implementation?
>>
>
> That seems pretty invasive. I'm not sure it is worth it; the case that I
> reduced to PR24667 was fairly extreme (all headers for a large project
> (~size of LLVM) in a single top-level module). I'm not sure how likely it
> is that this will be ran into in practice. It's definitely worth fixing on
> the principle of avoiding quadratic behavior, but it isn't (currently)
> blocking a real-world use case, so I hesitate to do very invasive changes.
>

Modules must be valuable just for large projects, where compile time saving
can be substantial. So the problem described in PR24667 anyway should be
solved, with such approach or another.

--Serge

>
> -- Sean Silva
>
>
>>
>>
>>>
>>>>>>
>>>>>>> Is there any reason why textual form of the intermediate source
>>>>>>> stream should be kept? Does implementing tokenized form of it make sense?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> --Serge
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> cfe-dev mailing list
>>>>>>> cfe-dev at lists.llvm.org
>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> cfe-dev mailing list
>>>>>> cfe-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151020/90388097/attachment.html>