[PATCH] D109701: [clang] Emit SARIF Diagnostics: Create `clang::SarifDocumentWriter` interface

Aaron Ballman via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Jan 6 10:17:52 PST 2022


aaron.ballman added a comment.

Thank you for your patience! I finally had the chance to go through this a bit more. I identified a bunch of tiny nits (the review feedback may look scary, but most should be trivial changes), but a few larger things about the design as well that are worth thinking about.

One thing that I'm still a bit worried about is validating the output document. The unit tests are a good start and the sort of thing we need for this functionality, but I'm also worried we won't find substantial design issues until we finally get SARIF results out of Clang or the static analyzer so we can see whether tools can actually *import* the SARIF we produce. I don't think we need in-tree tests for that sort of thing, but it'd definitely make me feel a lot more comfortable if we had external evidence that we're producing valid SARIF given that this is an exchange format. That said, I also don't want this to become a cumbersome patch for you to work on or for us to review, so I'm not certain what the best answer is here yet.



================
Comment at: clang/include/clang/Basic/Sarif.h:46
+
+using namespace llvm;
+
----------------
Having thought about this a bit more, I think this line should be removed because it's within a header file, so all includes of this header file will be impacted and likely without knowing it. I'm very sorry for the churn, but I think this should go and the `llvm::` prefixes should be brought back within the header file. Within the source file, it's fine to use the `using`.


================
Comment at: clang/include/clang/Basic/Sarif.h:112
+
+  explicit SarifArtifact(const SarifArtifactLocation &Loc) : Location(Loc) {}
+
----------------
Should we delete the default ctor?


================
Comment at: clang/include/clang/Basic/Sarif.h:162
+public:
+  static ThreadFlow create() { return {}; }
+
----------------
One question this raises for me is whether we should be enforcing invariants from the SARIF spec as part of this interface or not. Currently, you can create a thread flow that has no importance or a rule without a name/id, etc. That means it's pretty easy to create SARIF that won't validate properly. One possible way to alleviate this would be for the `create()` methods/ctors to require these be set when creating the objects. However, I can imagine there will be times when that is awkward due to following the builder pattern with the interface. Another option would be to have some validation methods on each of the interfaces and the whole tree gets validated after construction, but this could have performance impacts.

What are your thoughts?


================
Comment at: clang/include/clang/Basic/Sarif.h:248
+  // chosen because it has to be non-negative, and because the JSON encoder
+  // used requires this be a type that can be safely promoted to \c int64_t
+  uint32_t RuleIdx;
----------------



================
Comment at: clang/include/clang/Basic/Sarif.h:339
+public:
+  /// Create a new empty SARIF document
+  SarifDocumentWriter() : Closed(true){};
----------------



================
Comment at: clang/include/clang/Basic/Sarif.h:340
+  /// Create a new empty SARIF document
+  SarifDocumentWriter() : Closed(true){};
+
----------------
Once you use the default ctor, there's no way to associate language options with the document writer. Is it wise to expose this constructor? (If we didn't, then we could store the `LangOptions` as a const reference instead of making a copy in the other constructor. Given that we never mutate the options, I think that's a win.)


================
Comment at: clang/include/clang/Basic/Sarif.h:342
+
+  /// Create a new empty SARIF document with the given language options
+  SarifDocumentWriter(const LangOptions &LangOpts)
----------------



================
Comment at: clang/include/clang/Basic/Sarif.h:344
+  SarifDocumentWriter(const LangOptions &LangOpts)
+      : LangOpts(LangOpts), Closed(true) {}
+
----------------



================
Comment at: clang/include/clang/Basic/Sarif.h:346
+
+  /// Release resources held by this SARIF document
+  ~SarifDocumentWriter() = default;
----------------



================
Comment at: clang/include/clang/Basic/Sarif.h:400
+private:
+  /// Langauge options to use for the current SARIF document
+  const LangOptions LangOpts;
----------------



================
Comment at: clang/include/clang/Basic/Sarif.h:406
+  /// This could be a document that is freshly created, or has recently
+  /// finished writing to a previous run
+  bool Closed;
----------------



================
Comment at: clang/include/clang/Basic/Sarif.h:407
+  /// finished writing to a previous run
+  bool Closed;
+
----------------



================
Comment at: clang/include/clang/Basic/Sarif.h:80-82
+  static SarifArtifactLocation create(StringRef URI) {
+    return SarifArtifactLocation{URI};
+  }
----------------
vaibhav.y wrote:
> aaron.ballman wrote:
> > One thing that's worth calling out is that `StringRef` is non-owning which means that the argument passed to create the `SarifArtifactLocation` has to outlive the returned object to avoid dangling references. This makes the class a bit more dangerous to use because `Twine` or automatic `std::string` objects may cause lifetime concerns.
> > 
> > Should these classes be storing a `std::string` so that the memory is owned by SARIF?
> Good point! Will change it to `std::string` to start with.
> 
> Some fields such as MimeType, RuleID would probably do better with `SmallString`, I haven't been able to find any good measurements on how the lengths of those two are distributed, can it be made into `SmallString<N>` in a future PR?
> Some fields such as MimeType, RuleID would probably do better with SmallString, I haven't been able to find any good measurements on how the lengths of those two are distributed, can it be made into SmallString<N> in a future PR?

Yeah, I think that can be done in a follow-up if we find a performance benefit from it.


================
Comment at: clang/lib/Basic/Sarif.cpp:54-56
+/// \param C Character to encode to RFC3986
+///
+/// \return The RFC3986 representation of \c C
----------------



================
Comment at: clang/lib/Basic/Sarif.cpp:72-74
+/// \param Filename The filename to be represented as URI
+///
+/// \return RFC3986 URI representing the input file name
----------------



================
Comment at: clang/lib/Basic/Sarif.cpp:113-120
+/// \brief Calculate the column position expressed in the number of UTF-8 code
+/// points from column start to the source location
+///
+/// \param Loc The source location whose column needs to be calculated
+/// \param TokenLen Optional hint for when the token is multiple bytes long
+///
+/// \return The column number as a UTF-8 aware byte offset from column start to
----------------



================
Comment at: clang/lib/Basic/Sarif.cpp:219
+    // If inserted, ensure the original iterator points to the newly inserted
+    // element, so it can be used downstream
+    if (StatusIter.second)
----------------



================
Comment at: clang/lib/Basic/Sarif.cpp:235-236
+
+  // Since Closed = false here, expect there to be at least 1 Run, anything
+  // else is an invalid state
+  assert(!Runs.empty() && "There are no runs associated with the document!");
----------------



================
Comment at: clang/lib/Basic/Sarif.cpp:248
+void SarifDocumentWriter::endRun() {
+  // Exit early if trying to close a closed Document
+  if (Closed) {
----------------
(At this point, I'll stop commenting on these -- can you go through the patch and make sure that all comments have appropriate terminating punctuation?)


================
Comment at: clang/lib/Basic/Sarif.cpp:199
+    const SarifArtifactLocation &Location =
+        SarifArtifactLocation::create(FileURI).setIndex(Idx);
+    const SarifArtifact &Artifact = SarifArtifact::create(Location)
----------------
vaibhav.y wrote:
> aaron.ballman wrote:
> > This seems like it'll be a use-after-free because the local `std::string` will be destroyed before the lifetime of the `SarifArtifactLocation` ends.
> Will run it through a pass of asan & msan, is the best way to add: `-fsanitize=memory -fsanitize=address` to the test CMakeLists.txt & run them?
> 
> I've changed all strings to `std::string`, so this one should no longer be a problem but I wonder if there's any others I have missed as well.
> Will run it through a pass of asan & msan, is the best way to add: -fsanitize=memory -fsanitize=address to the test CMakeLists.txt & run them?

Yup! The critical part will be test coverage -- code paths that aren't executed won't get issues reported with them.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109701/new/

https://reviews.llvm.org/D109701



More information about the cfe-commits mailing list