[PATCH] D109701: [clang] Emit SARIF Diagnostics: Create `clang::SarifDocumentWriter` interface

Mon Jun 27 11:22:56 PDT 2022

aaron.ballman added inline comments.

================
Comment at: clang/include/clang/Basic/Sarif.h:296-297
+/// The SourceLocation provided must point at the beginning of the token.
+typedef std::function<unsigned int(const FullSourceLoc &, const LangOptions &)>
+    TokenLenMetric;
+
----------------
I worry about the performance aspects of using a callback for this. Calls across DSO boundaries can be slower (due to having to call through a thunk), and it necessitates calling back to the user to calculate information that the user could simply pass in directly. That's on top of `std::function` already being pretty heavy-weight.

================
Comment at: clang/lib/Basic/Sarif.cpp:161
+    Region["endColumn"] = adjustColumnPos(
+        R.getEnd(), Lexer::MeasureTokenLength(R.getEnd().getLocWithOffset(0),
+                                              R.getEnd().getManager(), LO));
----------------
vaibhav.y wrote:
> aaron.ballman wrote:
> > vaibhav.y wrote:
> > > aaron.ballman wrote:
> > > > I didn't catch this during the review -- but this is a layering violation that caused link errors on some of the build bots. Lexer can call into Basic, but Basic cannot call into Lexer. So we'll need to find a different way to handle this.
> > > Would moving the code to Support, having it depend on Basic & Lex work?
> > I don't think so -- Support is supposed to be a pretty low-level interface; it currently only relies on LLVM's Support library. I think the layering is supposed to be: Support -> Basic -> Lex.
> > 
> > As I see it, there are a couple of options as to where this could live. It could live in the Frontend library, as that's where all the diagnostic consumer code in Clang lives. But that library might be a bit "heavy" to pull into other tools (maybe? I don't know). It could also live in AST -- that already links in Basic and Lex. But that feels like a somewhat random place for this to live as this has very little to do with the AST itself.
> > 
> > Another approach, which might be better, is to require the user of this interface to pass in the token length calculation themselves in the places where it's necessary. e.g., `json::Object whatever(SourceLocation Start, SourceLocation End, unsigned EndLen)` and then you can remove the reliance on the lexer entirely while keeping the interface in Basic. I'm not certain how obnoxious this suggestion is, but I think it's my preferred approach for the moment (but not a strongly held position yet). WDYT of this approach?
> I think the approach to injecting the function is better here. I've tried to make the smallest change possiblew with passing in a function whose interface is almost identical to `Lexer::MeasureTokenLength`. The intent was to hint at this being the "canonical metric" for token lengths (with an example in the tests for the same).
> 
> I tried passing start, end locs but couldn't find a strong use case yet since `end` would likely always be: `Lexer::getLocForEndOfToken(start, 0)`
I'm not convinced that the less obtrusive change is a good design in this case. But I also agree that we should not use start/end *locations* either. `SourceLocation` traditionally points to the *start* of a token, so it would be super easy to get the `end` location wrong by forgetting to pass the location for the end of the token.

My suggestion was to continue to pass the start of the starting token, the start of the ending token, and the length of the ending token. With the callback approach, you have to call through the callback to eventually call `Lexer::MeasureTokenLength()`; with the direct approach, you skip needing to call through a callback (which means at least one less function call on every source location operation).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109701/new/

https://reviews.llvm.org/D109701