[PATCH] D37491: [Preamble] Fixed preamble breaking with BOM presence (and particularly, fluctuating BOM presence)

Cameron via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Wed Sep 6 07:41:25 PDT 2017


cameron314 added a comment.

Thanks for the response!

> How are various preprocessor offests (and SourceLocation offsets) are calculated? Do they account for BOM presence and ignore it?

Everything is in byte offsets; the `SourceLocation` after the BOM is not the same as before the BOM. The lexer automatically skips the BOM at the beginning of the file if it sees one (`Lexer::InitLexer`), and everything else works normally after that. The start of the first line is after the BOM, if any, which means it doesn't affect line/column numbers.

> Are there potential problems we may run into because of the changing offsets? Could we add tests checking changing the offsets does not matter?

That's a good point; I've looked into it and the PCH for the preamble is parsed using just the buffer slice that contains the preamble, excluding any BOM. That means that when we resume parsing later on a main buffer with a BOM, the `SourceLocation`s within the preamble itself will be off. However, normally this doesn't matter since the only things in the preamble are preprocessor directives, whose positions are very rarely used. (I should note at this point that we've been using a variant of this patch in production for a few years without any problem.) So, we have two choices: Either parse the preamble with the BOM and throw out the preamble/PCH when the BOM presence changes from the main buffer, or slice the buffer when using a preamble PCH so that it never has a BOM during parsing. I'm leaning towards the second option, since it's a little cleaner and lets the preamble be reused more easily; the only downside is that an external consumer would not be able to use any absolute offsets from the AST (note that line/column offsets would be identical) in the original buffer if it has a BOM -- but in any case, absolute offsets are usually useless without the buffer itself, which if obtained from clang would always be the correct buffer.

> Should we add checks that BOM was removed or added, but not changed? I would not expect preamble to be reusable "as is" if BOM (and therefore, input encoding) changed.

I'm not sure I understand this point. Clang only understands UTF-8; the BOM is either present or not, but the encoding never changes. (And the BOM itself is always the same byte sequence too.) It has no impact on the file contents.



================
Comment at: include/clang/Frontend/PrecompiledPreamble.h:102
+  /// appears or disappears between parses).
+  void UpdateOffset(unsigned NewOffset);
+
----------------
ilya-biryukov wrote:
> Let's leave this class's interface immutable. It is used concurrently in clangd and having a mutable method like this would break the code.
> 
> Passing new `PreambleBounds` to `AddImplicitPreamble` and setting the offsets accordingly would do the trick, leave the interface immutable and make the fact that offsets might change more evident.
Fair point, I'll change this.


================
Comment at: include/clang/Frontend/PrecompiledPreamble.h:191
   bool PreambleEndsAtStartOfLine;
+  /// The start offset of the bounds used to build the preamble.
+  unsigned PreambleOffset;
----------------
ilya-biryukov wrote:
> Let's store original `PreambleBounds` instead of `PreambleEndsAtStartOfLine` and `PreambleOffset`.
> It would make the code easier to read.
Again, good point, I'll change this.


================
Comment at: include/clang/Lex/Lexer.h:50
+  /// \brief Start offset of the preamble, in bytes.
+  unsigned StartOffset;
+  /// \brief Size of the preamble in bytes.
----------------
ilya-biryukov wrote:
> Maybe pick a name that clearly states that it's a `BOM` size? 
> Or add a comment indicating that it's a `BOM` offset.
I can see how this might be confusing. I'll add a comment.


================
Comment at: include/clang/Lex/Lexer.h:639
 
-  void SkipBytes(unsigned Bytes, bool StartOfLine);
+  void SetByteOffset(unsigned Offset, bool StartOfLine);
 
----------------
ilya-biryukov wrote:
> Maybe leave the old name? Doesn't `SkipBytes` captures the new semantics just as good?
`SkipBytes` moves relative to the current position, but the lexer skips the BOM implicitly on construction; I don't want to skip it twice. `SetByteOffset` is absolute, which makes it simple and clear to use without having to reason about implicit past state.


================
Comment at: lib/Frontend/PrecompiledPreamble.cpp:195
 
 PreambleBounds clang::ComputePreambleBounds(const LangOptions &LangOpts,
                                             llvm::MemoryBuffer *Buffer,
----------------
ilya-biryukov wrote:
> Could you inline usages of this function and remove it?
> 
I could; I think it makes sense to leave the wrapper, though, since the `ASTUnit` deals with the `PrecompiledPreamble` at its level of abstraction, and the `PrecompiledPreamble` deals with the lexer at its level of abstraction.


================
Comment at: unittests/Frontend/PchPreambleTest.cpp:190
+
+  ASSERT_TRUE(ReparseAST(AST));
+  ASSERT_FALSE(AST->getDiagnostics().hasErrorOccurred());
----------------
ilya-biryukov wrote:
> We're not really testing that preamble was reused.
> Maybe return a flag from `ASTUnit::Reparse` to indicate if preamble was reused and check it here?
We are; if it wasn't reused, the header would have been opened again and the last assert on `GetFileReadCount` below would fail.


https://reviews.llvm.org/D37491





More information about the cfe-commits mailing list