[clang] [clang][Diagnostics] Highlight code snippets (PR #66514)

Wed Sep 20 12:53:45 PDT 2023

Timm =?utf-8?q?Bäder?= <tbaeder at redhat.com>,
Timm =?utf-8?q?Bäder?= <tbaeder at redhat.com>
Message-ID:
In-Reply-To: <llvm/llvm-project/pull/66514/clang at github.com>


================
@@ -0,0 +1,77 @@
+
+#include "clang/Frontend/CodeSnippetHighlighter.h"
+#include "clang/Basic/DiagnosticOptions.h"
+#include "clang/Basic/SourceManager.h"
+#include "clang/Lex/Lexer.h"
+#include "clang/Lex/Preprocessor.h"
+#include "clang/Lex/PreprocessorOptions.h"
+#include "llvm/Support/raw_ostream.h"
+
+using namespace clang;
+
+static SourceManager createTempSourceManager() {
+  FileSystemOptions FileOpts;
+  FileManager FileMgr(FileOpts);
+  llvm::IntrusiveRefCntPtr<DiagnosticIDs> DiagIDs(new DiagnosticIDs());
+  llvm::IntrusiveRefCntPtr<DiagnosticOptions> DiagOpts(new DiagnosticOptions());
+  DiagnosticsEngine diags(DiagIDs, DiagOpts);
+  return SourceManager(diags, FileMgr);
+}
+
+static Lexer createTempLexer(llvm::MemoryBufferRef B, SourceManager &FakeSM,
+                             const LangOptions &LangOpts) {
+  return Lexer(FakeSM.createFileID(B), B, FakeSM, LangOpts);
+}
+
+std::vector<StyleRange> CodeSnippetHighlighter::highlightLine(
+    StringRef SourceLine, const Preprocessor *PP, const LangOptions &LangOpts) {
+  if (!PP)
+    return {};
+  constexpr raw_ostream::Colors CommentColor = raw_ostream::BLACK;
+  constexpr raw_ostream::Colors LiteralColor = raw_ostream::GREEN;
+  constexpr raw_ostream::Colors KeywordColor = raw_ostream::YELLOW;
+
+  SourceManager FakeSM = createTempSourceManager();
+  const auto MemBuf = llvm::MemoryBuffer::getMemBuffer(SourceLine);
+  Lexer L = createTempLexer(MemBuf->getMemBufferRef(), FakeSM, LangOpts);
+  L.SetKeepWhitespaceMode(true);
----------------
zygoloid wrote:

Yes, I think those are the three cases we can (currently) encounter.

For multi-line comments: all our `-Wdoxygen` warnings will fire in the middle of multi-line comments. I don't think we want to turn off the highlighting in those cases. We also don't know that on a line containing `foo /* bar */ baz`, `foo` is not part of the block comment. It'd be valid for there to be a `/*` on a previous line. We do detect and warn on `/*` within a `/*...*/` comment, and we could perhaps keep track of the places where that happens. I'm not sure we warn on `//` within a `/*...*/` comment, which has similar issues.

It might be reasonable to require that any time a diagnostic is produced with a caret location within a comment or a string literal, the diagnostic must be informed of that fact. Possibly we could require that the caret location is either a raw token location (that is, it points to a location that we know we can lex forward from), or a raw token location plus an offset from the start of the token (for diagnostics within comments and strings)? That would at least allow us to highlight reliably from the caret location forwards, but scanning backwards to find the start of a comment or string would still not really be possible in general. We could approximate it with heuristics, but that's imperfect.

So I suppose part of what we need to decide here is how much imperfection we're OK with. I think this highlighting will become important feedback to developers to help them see how Clang is interpreting their code, so I think it's important that the highlighting is reliable. If the highlighting is weird / wrong, the developer will assume that Clang is interpreting the code in that weird / wrong way.

https://github.com/llvm/llvm-project/pull/66514