[cfe-dev] [RFC] Clang SourceLocation overflow
Matt Asplund via cfe-dev
cfe-dev at lists.llvm.org
Sat Nov 2 10:00:24 PDT 2019
I have attached a patch for the changes to update SourceLocation to
internally use an unsigned long long (I tried to put it on phabricator, but
it was not having my changes). The change is as isolated as I could make it
to allow a majority of clang to continue to use the old 32 bit ids whenever
possible and only convert locations that I was personally hitting overflows
(when trying to convert down) during large pcm lazy loading. Note: There
may be a few random C++20 Modules fixes and/or hacks sprinkled in, this was
entirely to unblock my own investigations into a C++ Modules build system.
-Matt
On Thu, Oct 10, 2019 at 7:59 AM Christof Douma <Christof.Douma at arm.com>
wrote:
> Hi Matt.
>
>
>
> Thanks for the offer. Whenever you’re back and have a moment is fine by
> me, I don’t think we’re in a massive hurry.
>
>
>
> Thanks,
>
> Christof
>
>
>
> *From:* Matt Asplund <mwasplund at gmail.com>
> *Sent:* 10 October 2019 14:09
> *To:* Christof Douma <Christof.Douma at arm.com>
> *Cc:* Richard Smith <richard at metafoo.co.uk>; Mikhail Maltsev <
> Mikhail.Maltsev at arm.com>; Clang Dev <cfe-dev at lists.llvm.org>; nd <
> nd at arm.com>
> *Subject:* Re: [cfe-dev] [RFC] Clang SourceLocation overflow
>
>
>
> Sure, I can share my changes. In general my changes are very localized to
> the code paths I was hitting oveflow issues and tried to keep as much as
> possible using 32bit encodings. Is there an ETA for when you will start
> investigating, I am out of town for the next week but if needed can get
> access to my desktop.
>
>
>
> -Matt
>
>
>
> On Thu, Oct 10, 2019, 2:13 AM Christof Douma <Christof.Douma at arm.com>
> wrote:
>
> Hi Richard.
>
>
>
> Thanks for the clarification, I certainly had not realized that location
> tracking was needed for correctness of clang. Decoupling this sounds like a
> painful processes, and the only thing we get is errors without any location
> information. Sounds like not a great trade-off. We’ll go experiment a bit
> with the size impact of using 64-bits and will attempt to take the route of
> a cmake configuration option.
>
>
>
> If there are people that have already done some experiments on the size
> impact of 64-bits location pointers, we’re welcome your insights. @Matt
> Asplund <mwasplund at gmail.com> I believe you said that you’ve done some
> prototyping, is there something you can share?
>
>
>
> Thanks,
>
> Christof
>
>
>
> *From:* Richard Smith <richard at metafoo.co.uk>
> *Sent:* 08 October 2019 19:53
> *To:* Christof Douma <Christof.Douma at arm.com>
> *Cc:* Mikhail Maltsev <Mikhail.Maltsev at arm.com>; Clang Dev <
> cfe-dev at lists.llvm.org>; nd <nd at arm.com>
> *Subject:* Re: [cfe-dev] [RFC] Clang SourceLocation overflow
>
>
>
> On Tue, 8 Oct 2019, 10:42 Christof Douma via cfe-dev, <
> cfe-dev at lists.llvm.org> wrote:
>
> Hi Richard, Paul and other.
>
>
>
> Thanks for the input so far. I wanted to point out that it’s not our
> code-base. Rather, we’re seeing more use of the LLVM technology in the
> automotive market and as usual we’re faced with existing code bases that
> are tried and tested with other toolchains (gcc or others) and when LLVM
> comes along things don’t always work directly.
>
>
>
> We’ve suggested better ways of structuring their code and your suggestions
> are certainly good input. However, legacy code is especially sticky in any
> market that has to handle ‘safety’ concerns, like automotive, aerospace and
> medical markets. Code changes are pretty expensive in those fields. So
> while I hope that over time we see more sensible coding structures, I don’t
> expect that to happen any time soon. In the mean time, we’re searching for
> a solution for this coding pattern that doesn’t play well with clang.
>
>
>
> Hope that gave some more background of where this question comes from.
>
>
>
> Do all options that were suggested by Mikhail really require fundamental
> restructuring of major parts of clang? This surprised me, I had expected
> that the option 2 to be possible without a complete overhaul. (2 is “Track
> until an overflow occurs after that make the lexer output the <invalid
> location> special value for all subsequent tokens.”)
>
> Clang uses source locations as part of the semantic representation of the
> AST in some cases (simple example: some forms of initialization might use
> parens or not, with different semantics, and we distinguish between them
> based on whether we have paren locations; there are also some checks that
> look at which of two source locations came first when determining which
> warnings or errors to produce, and so on). Maybe we could go through and
> fix all of those, but that's still a very large task and it'd be hard to
> check we got them all.
>
> Not nice user experience but maybe doable? I was hoping there was
> something slightly better that still works without a major restructuring
> (maybe something that at least gives a rough location or something that
> only gives the location of the error and not the include stack under an
> option or using some kind of heuristic to detect that things go haywire).
>
>
>
> As an alternative, I was curious if it would be possible and acceptable to
> make the switch between 32-bit and 64-bit location tracking a
> build-time/cmake decision? I’ve not done any estimation on the memory size
> growth, so maybe this is a dead end.
>
> If someone is prepared to do the work to add and maintain this build mode,
> I think this might be a feasible option.
>
> Thanks,
>
> Christof
>
>
>
> *From:* cfe-dev <cfe-dev-bounces at lists.llvm.org> *On Behalf Of *Richard
> Smith via cfe-dev
> *Sent:* 07 October 2019 20:36
> *To:* Mikhail Maltsev <Mikhail.Maltsev at arm.com>
> *Cc:* nd <nd at arm.com>; cfe-dev at lists.llvm.org
> *Subject:* Re: [cfe-dev] [RFC] Clang SourceLocation overflow
>
>
>
> On Wed, 2 Oct 2019 at 09:26, Mikhail Maltsev via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
> Hi all,
>
> We are experiencing a problem with Clang SourceLocation overflow.
> Currently source locations are 32-bit values, one bit is a flag, which
> gives
> a source location space of 2^31 characters.
>
> When the Clang lexer processes an #include directive it reserves the total
> size
> of the file being included in the source location space. An overflow can
> occur
> if a large file (which does not have include guards by design) is included
> many
> times into a single TU.
>
> The pattern of including a file multiple times is for example required by
> the AUTOSAR standard [1], which is widely used in the automotive industry.
> Specifically the pattern is described in the Specification of Memory
> Mapping [2]:
>
> Section 8.2.1, MEMMAP003:
> "The start and stop symbols for section control are configured with section
> identifiers defined in MemMap.h [...] For instance:
>
> #define EEP_START_SEC_VAR_16BIT
> #include "MemMap.h"
> static uint16 EepTimer;
> static uint16 EepRemainingBytes;
> #define EEP_STOP_SEC_VAR_16BIT
> #include "MemMap.h""
>
> Section 8.2.2, MEMMAP005:
> "The file MemMap.h shall provide a mechanism to select different code,
> variable
> or constant sections by checking the definition of the module specific
> memory
> allocation key words for starting a section [...]"
>
> In practice MemMap.h can reach several MBs and can be included several
> thousand
> times causing an overflow in the source location space.
>
> The problem does not occur with GCC because it tracks line numbers rather
> than
> file offsets. Column numbers are tracked separately and are optional.
> I.e., in
> GCC a source location can be either a (line+column) tuple packed into 32
> bits or
> (when the line number exceeds a certain threshold) a 32-bit line number.
>
> We are looking for an acceptable way of resolving the problem and propose
> the
> following approaches for discussion:
> 1. Use 64 bits for source location tracking.
> 2. Track until an overflow occurs after that make the lexer output
> the <invalid location> special value for all subsequent tokens.
> 3. Implement an approach similar to the one used by GCC and start tracking
> line
> numbers instead of file offsets after a certain threshold. Resort to (2)
> when even line numbers overflow.
> 4. (?) Detect the multiple inclusion pattern and track it differently (for
> now
> we don't have specific ideas on how to implement this)
>
> Is any of these approaches viable? What caveats should we expect? (we
> already
> know about static_asserts guarding the sizes of certain class fields which
> start
> failing in the first approach).
>
> Other suggestions are welcome.
>
>
>
> I don't think any of the above approaches are reasonable; they would all
> require fundamental restructuring of major parts of Clang, an efficiency or
> memory size hit for all other users of Clang, or some combination of those.
>
>
>
> Your code pattern seems unreasonable; including a multi-megabyte file
> thousands of times is not a good idea. Can you split out parts of MemMap.h
> into a separate header that is only included once, and keep only the parts
> that actually change on repeated inclusion in MemMap.h itself?
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191102/59612750/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: new-feature.patch
Type: application/octet-stream
Size: 80405 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20191102/59612750/attachment.obj>
More information about the cfe-dev
mailing list