[cfe-dev] [RFC] Clang SourceLocation overflow

Nico Weber via cfe-dev cfe-dev at lists.llvm.org
Tue Nov 5 07:41:18 PST 2019


(Looks like the patch converts a few files to dos-style line endings, which
makes it larger than it needs to be.)

On Sun, Nov 3, 2019 at 3:51 PM Matt Asplund via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> I have attached a patch for the changes to update SourceLocation to
> internally use an unsigned long long (I tried to put it on phabricator, but
> it was not having my changes). The change is as isolated as I could make it
> to allow a majority of clang to continue to use the old 32 bit ids whenever
> possible and only convert locations that I was personally hitting overflows
> (when trying to convert down) during large pcm lazy loading. Note: There
> may be a few random C++20 Modules fixes and/or hacks sprinkled in, this was
> entirely to unblock my own investigations into a C++ Modules build system.
>
> -Matt
>
> On Thu, Oct 10, 2019 at 7:59 AM Christof Douma <Christof.Douma at arm.com>
> wrote:
>
>> Hi Matt.
>>
>>
>>
>> Thanks for the offer. Whenever you’re back and have a moment is fine by
>> me, I don’t think we’re in a massive hurry.
>>
>>
>>
>> Thanks,
>>
>> Christof
>>
>>
>>
>> *From:* Matt Asplund <mwasplund at gmail.com>
>> *Sent:* 10 October 2019 14:09
>> *To:* Christof Douma <Christof.Douma at arm.com>
>> *Cc:* Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <
>> Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <
>> nd at arm.com>
>> *Subject:* Re: [cfe-dev] [RFC] Clang SourceLocation overflow
>>
>>
>>
>> Sure, I can share my changes. In general my changes are very localized to
>> the code paths I was hitting oveflow issues and tried to keep as much as
>> possible using 32bit encodings. Is there an ETA for when you will start
>> investigating, I am out of town for the next week but if needed can get
>> access to my desktop.
>>
>>
>>
>> -Matt
>>
>>
>>
>> On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com>
>> wrote:
>>
>> Hi Richard.
>>
>>
>>
>> Thanks for the clarification, I certainly had not realized that location
>> tracking was needed for correctness of clang. Decoupling this sounds like a
>> painful processes, and the only thing we get is errors without any location
>> information. Sounds like not a great trade-off. We’ll go experiment a bit
>> with the size impact of using 64-bits and will attempt to take the route of
>> a cmake configuration option.
>>
>>
>>
>> If there are people that have already done some experiments on the size
>> impact of 64-bits location pointers, we’re welcome your insights. @Matt
>> Asplund <mwasplund at gmail.com> I believe you said that you’ve done some
>> prototyping, is there something you can share?
>>
>>
>>
>> Thanks,
>>
>> Christof
>>
>>
>>
>> *From:* Richard Smith <richard at metafoo.co.uk>
>> *Sent:* 08 October 2019 19:53
>> *To:* Christof Douma <Christof.Douma at arm.com>
>> *Cc:* Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <
>> cfe-dev at lists.llvm.org>; nd <nd at arm.com>
>> *Subject:* Re: [cfe-dev] [RFC] Clang SourceLocation overflow
>>
>>
>>
>> On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <
>> cfe-dev at lists.llvm.org> wrote:
>>
>> Hi Richard, Paul and other.
>>
>>
>>
>> Thanks for the input so far. I wanted to point out that it’s not our
>> code-base. Rather, we’re seeing more use of the LLVM technology in the
>> automotive market and as usual we’re faced with existing code bases that
>> are tried and tested with other toolchains (gcc or others) and when LLVM
>> comes along things don’t always work directly.
>>
>>
>>
>> We’ve suggested better ways of structuring their code and your
>> suggestions are certainly good input. However, legacy code is especially
>> sticky in any market that has to handle ‘safety’ concerns, like automotive,
>> aerospace and medical markets. Code changes are pretty expensive in those
>> fields. So while I hope that over time we see more sensible coding
>> structures, I don’t expect that to happen any time soon. In the mean time,
>> we’re searching for a solution for this coding pattern that doesn’t play
>> well with clang.
>>
>>
>>
>> Hope that gave some more background of where this question comes from.
>>
>>
>>
>> Do all options that were suggested by Mikhail really require fundamental
>> restructuring of major parts of clang? This surprised me, I had expected
>> that the option 2 to be possible without a complete overhaul. (2 is “Track
>> until an overflow occurs after that make the lexer output the <invalid
>> location> special value for all subsequent tokens.”)
>>
>> Clang uses source locations as part of the semantic representation of the
>> AST in some cases (simple example: some forms of initialization might use
>> parens or not, with different semantics, and we distinguish between them
>> based on whether we have paren locations; there are also some checks that
>> look at which of two source locations came first when determining which
>> warnings or errors to produce, and so on). Maybe we could go through and
>> fix all of those, but that's still a very large task and it'd be hard to
>> check we got them all.
>>
>> Not nice user experience but maybe doable? I was hoping there was
>> something slightly better that still works without a major restructuring
>> (maybe something that at least gives a rough location or something that
>> only gives the location of the error and not the include stack under an
>> option or using some kind of heuristic to detect that things go haywire).
>>
>>
>>
>> As an alternative, I was curious if it would be possible and acceptable
>> to make the switch between 32-bit and 64-bit location tracking a
>> build-time/cmake decision? I’ve not done any estimation on the memory size
>> growth, so maybe this is a dead end.
>>
>> If someone is prepared to do the work to add and maintain this build
>> mode, I think this might be a feasible option.
>>
>> Thanks,
>>
>> Christof
>>
>>
>>
>> *From:* cfe-dev <cfe-dev-bounces at lists.llvm.org> *On Behalf Of *Richard
>> Smith via cfe-dev
>> *Sent:* 07 October 2019 20:36
>> *To:* Mikhail Maltsev <Mikhail.Maltsev at arm.com>
>> *Cc:* nd <nd at arm.com>; cfe-dev at lists.llvm.org
>> *Subject:* Re: [cfe-dev] [RFC] Clang SourceLocation overflow
>>
>>
>>
>> On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <
>> cfe-dev at lists.llvm.org> wrote:
>>
>> Hi all,
>>
>> We are experiencing a problem with Clang SourceLocation overflow.
>> Currently source locations are 32-bit values, one bit is a flag, which
>> gives
>> a source location space of 2^31 characters.
>>
>> When the Clang lexer processes an #include directive it reserves the
>> total size
>> of the file being included in the source location space. An overflow can
>> occur
>> if a large file (which does not have include guards by design) is
>> included many
>> times into a single TU.
>>
>> The pattern of including a file multiple times is for example required by
>> the AUTOSAR standard [1], which is widely used in the automotive industry.
>> Specifically the pattern is described in the Specification of Memory
>> Mapping [2]:
>>
>> Section 8.2.1, MEMMAP003:
>> "The start and stop symbols for section control are configured with
>> section
>> identifiers defined in MemMap.h [...] For instance:
>>
>> #define EEP_START_SEC_VAR_16BIT
>> #include "MemMap.h"
>> static uint16 EepTimer;
>> static uint16 EepRemainingBytes;
>> #define EEP_STOP_SEC_VAR_16BIT
>> #include "MemMap.h""
>>
>> Section 8.2.2, MEMMAP005:
>> "The file MemMap.h shall provide a mechanism to select different code,
>> variable
>> or constant sections by checking the definition of the module specific
>> memory
>> allocation key words for starting a section [...]"
>>
>> In practice MemMap.h can reach several MBs and can be included several
>> thousand
>> times causing an overflow in the source location space.
>>
>> The problem does not occur with GCC because it tracks line numbers rather
>> than
>> file offsets. Column numbers are tracked separately and are optional.
>> I.e., in
>> GCC a source location can be either a (line+column) tuple packed into 32
>> bits or
>> (when the line number exceeds a certain threshold) a 32-bit line number.
>>
>> We are looking for an acceptable way of resolving the problem and propose
>> the
>> following approaches for discussion:
>> 1. Use 64 bits for source location tracking.
>> 2. Track until an overflow occurs after that make the lexer output
>>    the <invalid location> special value for all subsequent tokens.
>> 3. Implement an approach similar to the one used by GCC and start
>> tracking line
>>    numbers instead of file offsets after a certain threshold. Resort to
>> (2)
>>    when even line numbers overflow.
>> 4. (?) Detect the multiple inclusion pattern and track it differently
>> (for now
>>    we don't have specific ideas on how to implement this)
>>
>> Is any of these approaches viable? What caveats should we expect? (we
>> already
>> know about static_asserts guarding the sizes of certain class fields
>> which start
>> failing in the first approach).
>>
>> Other suggestions are welcome.
>>
>>
>>
>> I don't think any of the above approaches are reasonable; they would all
>> require fundamental restructuring of major parts of Clang, an efficiency or
>> memory size hit for all other users of Clang, or some combination of those.
>>
>>
>>
>> Your code pattern seems unreasonable; including a multi-megabyte file
>> thousands of times is not a good idea. Can you split out parts of MemMap.h
>> into a separate header that is only included once, and keep only the parts
>> that actually change on repeated inclusion in MemMap.h itself?
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191105/960efc74/attachment.html>


More information about the cfe-dev mailing list