[cfe-dev] [RFC] Clang SourceLocation overflow

Aaron Ballman via cfe-dev cfe-dev at lists.llvm.org
Tue Oct 8 10:49:38 PDT 2019


On Mon, Oct 7, 2019 at 2:45 PM Reid Kleckner via cfe-dev
<cfe-dev at lists.llvm.org> wrote:
>
> The increase in memory usage would be a darn shame. Most of the complexity of the source location machinery exists just to save these four bytes. If we used 64-bit values, source locations could easily be pointers into mapped files or something simple.
>
> Can we detect files with pathologically many source locations, and collapse them to one source location, or something like that?
> For a 2+GB file, it's not unreasonable to say "error, somewhere in this file, you figure it out, it's your own fault for generating source code this large".
> I'm sure this would break some invariants, but it's worth exploring before increasing memory usage.

I'm not keen on this approach because source locations aren't just for
humans. For instance, source location information is output as part of
the static analyzer and may be dumped to SARIF or a plist file to be
consumed by other tools (sometimes through automation). I doubt those
tools would expect this behavior (especially if there's a spec, like
for SARIF), and it would be unfortunate to expect them to work around
Clang's bug.

That said, I am also super wary of increasing the memory usage --
source locations are *everywhere*.

~Aaron

> Also, what is the current failure mode when source locations are exhausted? Does clang print a useful error message instructing the user to reduce the size of their pre-processed input? That seems like a good place to start.
>
> On Mon, Oct 7, 2019 at 2:54 AM Oliver Stannard via cfe-dev <cfe-dev at lists.llvm.org> wrote:
>>
>> I think that we can rule out option 2, given that you've found a pattern used in real-world code which hits this limit. We'd like to avoid giving <invalid location> to any real users.
>>
>> I'm not familiar with how GCC does source locations, but this sounds like it's just delaying the problem by a factor of the average line length? If so, adding two different thresholds at which line number accuracy is degraded doesn't seem worth it.
>>
>> I also don't think option 4 is viable, as it would increase complexity a lot, and would only work for this exact pattern. I've also seen bug reports from customers trying to pre-process (mechanically generated) assembly files >2GB, which this solution wouldn't help with.
>>
>> That leaves option 1, which looks like the best solution to me. The obvious concern with this one is the increase in memory consumption, since there will be a large number of these objects (one per token/AST node?). @Matt: do you have any memory consumption numbers from your prototype whoch could help here?
>>
>> Oliver
>>
>> On Thu, 3 Oct 2019 at 17:23, Matt Asplund via cfe-dev <cfe-dev at lists.llvm.org> wrote:
>>>
>>> I don't want to distract from your question, but wanted to add that I have been seeing source location overflow issues for many months when using clangs implementation of c++20 modules. I have a personal branch where I have made a partial conversion over to 64 bit source locations for test purposes.
>>>
>>> -Matt
>>>
>>> On Wed, Oct 2, 2019, 9:26 AM Mikhail Maltsev via cfe-dev <cfe-dev at lists.llvm.org> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> We are experiencing a problem with Clang SourceLocation overflow.
>>>> Currently source locations are 32-bit values, one bit is a flag, which gives
>>>> a source location space of 2^31 characters.
>>>>
>>>> When the Clang lexer processes an #include directive it reserves the total size
>>>> of the file being included in the source location space. An overflow can occur
>>>> if a large file (which does not have include guards by design) is included many
>>>> times into a single TU.
>>>>
>>>> The pattern of including a file multiple times is for example required by
>>>> the AUTOSAR standard [1], which is widely used in the automotive industry.
>>>> Specifically the pattern is described in the Specification of Memory Mapping [2]:
>>>>
>>>> Section 8.2.1, MEMMAP003:
>>>> "The start and stop symbols for section control are configured with section
>>>> identifiers defined in MemMap.h [...] For instance:
>>>>
>>>> #define EEP_START_SEC_VAR_16BIT
>>>> #include "MemMap.h"
>>>> static uint16 EepTimer;
>>>> static uint16 EepRemainingBytes;
>>>> #define EEP_STOP_SEC_VAR_16BIT
>>>> #include "MemMap.h""
>>>>
>>>> Section 8.2.2, MEMMAP005:
>>>> "The file MemMap.h shall provide a mechanism to select different code, variable
>>>> or constant sections by checking the definition of the module specific memory
>>>> allocation key words for starting a section [...]"
>>>>
>>>> In practice MemMap.h can reach several MBs and can be included several thousand
>>>> times causing an overflow in the source location space.
>>>>
>>>> The problem does not occur with GCC because it tracks line numbers rather than
>>>> file offsets. Column numbers are tracked separately and are optional. I.e., in
>>>> GCC a source location can be either a (line+column) tuple packed into 32 bits or
>>>> (when the line number exceeds a certain threshold) a 32-bit line number.
>>>>
>>>> We are looking for an acceptable way of resolving the problem and propose the
>>>> following approaches for discussion:
>>>> 1. Use 64 bits for source location tracking.
>>>> 2. Track until an overflow occurs after that make the lexer output
>>>>    the <invalid location> special value for all subsequent tokens.
>>>> 3. Implement an approach similar to the one used by GCC and start tracking line
>>>>    numbers instead of file offsets after a certain threshold. Resort to (2)
>>>>    when even line numbers overflow.
>>>> 4. (?) Detect the multiple inclusion pattern and track it differently (for now
>>>>    we don't have specific ideas on how to implement this)
>>>>
>>>> Is any of these approaches viable? What caveats should we expect? (we already
>>>> know about static_asserts guarding the sizes of certain class fields which start
>>>> failing in the first approach).
>>>>
>>>> Other suggestions are welcome.
>>>>
>>>> [1]. https://www.autosar.org
>>>> [2].
>>>> https://www.autosar.org/fileadmin/user_upload/standards/classic/3-0/AUTOSAR_SWS_MemoryMapping.pdf
>>>>
>>>> --
>>>> Regards,
>>>>   Mikhail Maltsev
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev



More information about the cfe-dev mailing list