[PATCH] D109701: [clang] Emit SARIF Diagnostics: Create `clang::SarifDocumentWriter` interface

Mon Jun 27 12:45:21 PDT 2022

vaibhav.y added inline comments.

================
Comment at: clang/lib/Basic/Sarif.cpp:161
+    Region["endColumn"] = adjustColumnPos(
+        R.getEnd(), Lexer::MeasureTokenLength(R.getEnd().getLocWithOffset(0),
+                                              R.getEnd().getManager(), LO));
----------------
aaron.ballman wrote:
> vaibhav.y wrote:
> > aaron.ballman wrote:
> > > vaibhav.y wrote:
> > > > aaron.ballman wrote:
> > > > > vaibhav.y wrote:
> > > > > > aaron.ballman wrote:
> > > > > > > I didn't catch this during the review -- but this is a layering violation that caused link errors on some of the build bots. Lexer can call into Basic, but Basic cannot call into Lexer. So we'll need to find a different way to handle this.
> > > > > > Would moving the code to Support, having it depend on Basic & Lex work?
> > > > > I don't think so -- Support is supposed to be a pretty low-level interface; it currently only relies on LLVM's Support library. I think the layering is supposed to be: Support -> Basic -> Lex.
> > > > > 
> > > > > As I see it, there are a couple of options as to where this could live. It could live in the Frontend library, as that's where all the diagnostic consumer code in Clang lives. But that library might be a bit "heavy" to pull into other tools (maybe? I don't know). It could also live in AST -- that already links in Basic and Lex. But that feels like a somewhat random place for this to live as this has very little to do with the AST itself.
> > > > > 
> > > > > Another approach, which might be better, is to require the user of this interface to pass in the token length calculation themselves in the places where it's necessary. e.g., `json::Object whatever(SourceLocation Start, SourceLocation End, unsigned EndLen)` and then you can remove the reliance on the lexer entirely while keeping the interface in Basic. I'm not certain how obnoxious this suggestion is, but I think it's my preferred approach for the moment (but not a strongly held position yet). WDYT of this approach?
> > > > I think the approach to injecting the function is better here. I've tried to make the smallest change possiblew with passing in a function whose interface is almost identical to `Lexer::MeasureTokenLength`. The intent was to hint at this being the "canonical metric" for token lengths (with an example in the tests for the same).
> > > > 
> > > > I tried passing start, end locs but couldn't find a strong use case yet since `end` would likely always be: `Lexer::getLocForEndOfToken(start, 0)`
> > > I'm not convinced that the less obtrusive change is a good design in this case. But I also agree that we should not use start/end *locations* either. `SourceLocation` traditionally points to the *start* of a token, so it would be super easy to get the `end` location wrong by forgetting to pass the location for the end of the token.
> > > 
> > > My suggestion was to continue to pass the start of the starting token, the start of the ending token, and the length of the ending token. With the callback approach, you have to call through the callback to eventually call `Lexer::MeasureTokenLength()`; with the direct approach, you skip needing to call through a callback (which means at least one less function call on every source location operation).
> > Ah, I think I misunderstood your initial suggestion (`json::Object whatever(SourceLocation Start, SourceLocation End, unsigned EndLen)`) seemed like a function call to me, when it seems the suggested change was to pass in an object? Apologies, will fix that up.
> Sorry for the confusion! Just to make sure we're on the same page -- my suggestion was to change the function interfaces like `SarficDocumentWriter::createPhysicalLocation()` so that they would take an additional `unsigned EndLen` parameter.
> 
> However, now that I dig around a bit, it seems like `CharSourceRange` is what you'd want to use there -- then you can assert that what you're given is a char range and not a token range. So you won't need the `unsigned EndLen` parameter after all!
Interesting!

Asking for my understanding: If a `CharSourceRange` is a valid character range,  then the `End` SLoc points to the last character in the range (even if it is mid token)? (Unlike slocs where it the first character of the last token).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109701/new/

https://reviews.llvm.org/D109701