[PATCH] D37491: [Preamble] Fixed preamble breaking with BOM presence (and particularly, fluctuating BOM presence)

Ilya Biryukov via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Sep 7 01:13:50 PDT 2017


ilya-biryukov added a comment.

In https://reviews.llvm.org/D37491#862160, @cameron314 wrote:

> > Are there potential problems we may run into because of the changing offsets? Could we add tests checking changing the offsets does not matter?
>
> That's a good point; I've looked into it and the PCH for the preamble is parsed using just the buffer slice that contains the preamble, excluding any BOM. That means that when we resume parsing later on a main buffer with a BOM, the `SourceLocation`s within the preamble itself will be off. However, normally this doesn't matter since the only things in the preamble are preprocessor directives, whose positions are very rarely used. (I should note at this point that we've been using a variant of this patch in production for a few years without any problem.) So, we have two choices: Either parse the preamble with the BOM and throw out the preamble/PCH when the BOM presence changes from the main buffer, or slice the buffer when using a preamble PCH so that it never has a BOM during parsing. I'm leaning towards the second option, since it's a little cleaner and lets the preamble be reused more easily; the only downside is that an external consumer would not be able to use any absolute offsets from the AST (note that line/column offsets would be identical) in the original buffer if it has a BOM -- but in any case, absolute offsets are usually useless without the buffer itself, which if obtained from clang would always be the correct buffer.


Maybe there's  a third option option to remove the BOM from the buffer before passing it to clang?
Could you elaborate on your use-case a little more? Is there no way to consistently always pass buffers either with or without BOM?

Out of two options you mention discarding preamble on BOM changes seems like an easy option that is both correct and won't make a difference in performance since BOM rarely changes.
Looking at your use-case, it sounds like you'll only have 1 extra reparse of preamble, which is probably fine.

>> Should we add checks that BOM was removed or added, but not changed? I would not expect preamble to be reusable "as is" if BOM (and therefore, input encoding) changed.
> 
> I'm not sure I understand this point. Clang only understands UTF-8; the BOM is either present or not, but the encoding never changes. (And the BOM itself is always the same byte sequence too.) It has no impact on the file contents.

Sure, it's not something clang supports, it's an edge-case when clang receives "malformed" input. Does lexer only skip utf-8 BOM, but not other versions of BOM?
But you're right, it's highly unlikely anything will break in that case.



================
Comment at: unittests/Frontend/PchPreambleTest.cpp:190
+
+  ASSERT_TRUE(ReparseAST(AST));
+  ASSERT_FALSE(AST->getDiagnostics().hasErrorOccurred());
----------------
cameron314 wrote:
> ilya-biryukov wrote:
> > We're not really testing that preamble was reused.
> > Maybe return a flag from `ASTUnit::Reparse` to indicate if preamble was reused and check it here?
> We are; if it wasn't reused, the header would have been opened again and the last assert on `GetFileReadCount` below would fail.
Missed that, thanks. Looks good.
Maybe add a comment explicitly noting that?


https://reviews.llvm.org/D37491





More information about the cfe-commits mailing list