[clang] [clang][Diagnostics] Highlight code snippets (PR #66514)

Wed Sep 20 10:48:31 PDT 2023

Timm =?utf-8?q?Bäder?= <tbaeder at redhat.com>,
Timm =?utf-8?q?Bäder?= <tbaeder at redhat.com>
Message-ID:
In-Reply-To: <llvm/llvm-project/pull/66514/clang at github.com>


================
@@ -0,0 +1,77 @@
+
+#include "clang/Frontend/CodeSnippetHighlighter.h"
+#include "clang/Basic/DiagnosticOptions.h"
+#include "clang/Basic/SourceManager.h"
+#include "clang/Lex/Lexer.h"
+#include "clang/Lex/Preprocessor.h"
+#include "clang/Lex/PreprocessorOptions.h"
+#include "llvm/Support/raw_ostream.h"
+
+using namespace clang;
+
+static SourceManager createTempSourceManager() {
+  FileSystemOptions FileOpts;
+  FileManager FileMgr(FileOpts);
+  llvm::IntrusiveRefCntPtr<DiagnosticIDs> DiagIDs(new DiagnosticIDs());
+  llvm::IntrusiveRefCntPtr<DiagnosticOptions> DiagOpts(new DiagnosticOptions());
+  DiagnosticsEngine diags(DiagIDs, DiagOpts);
+  return SourceManager(diags, FileMgr);
+}
+
+static Lexer createTempLexer(llvm::MemoryBufferRef B, SourceManager &FakeSM,
+                             const LangOptions &LangOpts) {
+  return Lexer(FakeSM.createFileID(B), B, FakeSM, LangOpts);
+}
+
+std::vector<StyleRange> CodeSnippetHighlighter::highlightLine(
+    StringRef SourceLine, const Preprocessor *PP, const LangOptions &LangOpts) {
+  if (!PP)
+    return {};
+  constexpr raw_ostream::Colors CommentColor = raw_ostream::BLACK;
+  constexpr raw_ostream::Colors LiteralColor = raw_ostream::GREEN;
+  constexpr raw_ostream::Colors KeywordColor = raw_ostream::YELLOW;
+
+  SourceManager FakeSM = createTempSourceManager();
+  const auto MemBuf = llvm::MemoryBuffer::getMemBuffer(SourceLine);
+  Lexer L = createTempLexer(MemBuf->getMemBufferRef(), FakeSM, LangOpts);
+  L.SetKeepWhitespaceMode(true);
----------------
zygoloid wrote:

While I think re-lexing the input to find the tokens is the right approach, starting with the source line in isolation is going to do the wrong thing in a lot of cases. For example, a format string warning inside a multi-line raw string literal will get bad highlighting due to not taking the initial lexing state for the line into account. But equally, re-lexing the entire file seems like it's going to be problematic from a performance perspective. I can think of a few alternatives here:

1) We could make the regular lexing process keep track of some of the lines where the lexer is in its "normal" state at the start of the line -- whenever we're in the normal lexing state at the start of a line, add the line number to a per-file list if it's been "long enough" (maybe >1K of program text?) since we last did so. Then when emitting diagnostics, we can find the most recent line where we were at a good state at the start of the line, and lex forward from there to drive syntax highlighting.

2) We could make the diagnostics layer keep a cache of the tokenized forms of buffers for which we emit diagnostics. We'd still re-lex an entire file if we emit diagnostics within it, but we'd only do so *once*, and we don't need to store the full list of tokens, only a list of (offset, color) pairs for transitions between token kinds.

Thoughts?

https://github.com/llvm/llvm-project/pull/66514