[PATCH] D53814: Allow the analyzer to output to a SARIF file

Mon Oct 29 10:07:24 PDT 2018

aaron.ballman added inline comments.

================
Comment at: Analysis/diagnostics/sarif-check.py:22
+passes = 0
+with open(testfile) as testfh:
+    lineno = 0
----------------
george.karpenkov wrote:
> Wow, this is super neat!
> Since you are quite active in LLVM community, would you think it's better to have this tool in `llvm/utils` next to FileCheck? The concept is very general, and I think a lot of people really feel FileCheck limitations.
The concept was pulled from test\TableGen\JSON-check.py, so it likely could be generalized. However, I suspect that each JSON test may want to expose different helper capabilities, so perhaps it won't be super general? I don't know enough about good Python design to know.

================
Comment at: Analysis/diagnostics/sarif-diagnostics-taint-test.c:18
+// Test the basics for sanity.
+// CHECK: sarifLog['version'].startswith("2.0.0")
+// CHECK: sarifLog['runs'][0]['tool']['fullName'] == "clang static analyzer"
----------------
george.karpenkov wrote:
> Would it make more sense to just use `diff` + json pretty-formatter to write a test?
> With this test I can't even quite figure out how the output should look like.
I'm not super comfortable with that approach, but perhaps I'm thinking of something different than what you're proposing. The reason I went with this approach is because diff would be fragile (depends heavily on field ordering, which the JSON support library doesn't give control over) and the physical layout of the file isn't what needs to be tested anyway. SARIF has a fair amount of optional data that can be provided as well, so using a purely textual diff worried me that exporting additional optional data in the future would require extensive unrelated changes to all SARIF diffs in the test directory.

The goal for this test was to demonstrate that we can validate that the interesting bits of information are present in the output without worrying about the details.

Also, the python approach allows us to express relationships between data items more easily than a textual diff tool would. I've not used that here, but I could imagine a test where we want to check that each code location has a corresponding file entry in another list.

================
Comment at: StaticAnalyzer/Core/SarifDiagnostics.cpp:74
+  // Add the rest of the path components, encoding any reserved characters.
+  std::for_each(std::next(sys::path::begin(Filename)), sys::path::end(Filename),
+                [&Ret](StringRef Component) {
----------------
george.karpenkov wrote:
> Nitpicking style, but I don't see why for-each loop, preferably with a range wrapping the iterators would not be more readable.
I tend to prefer using algorithms when the logic is simple -- it makes it more clear that the loop is an unimportant detail. I don't have strong opinions on this particular loop, however.

================
Comment at: StaticAnalyzer/Core/SarifDiagnostics.cpp:88
+                  // URI encode the part.
+                  for (char C : Component) {
+                    // RFC 3986 claims alpha, numeric, and this handful of
----------------
ZaMaZaN4iK wrote:
> I don't know about const corectness policy in LLVM. But I prefer here const char C .
We typically do not put top-level `const` on locals, so I'd prefer to leave it off here rather than be inconsistent.

================
Comment at: StaticAnalyzer/Core/SarifDiagnostics.cpp:182
+  case PathDiagnosticPiece::Kind::Note:
+    // FIXME: What should be reported here?
+    break;
----------------
george.karpenkov wrote:
> "Note" are notes which do not have to be attached to a particular path element.
Good to know!

================
Comment at: StaticAnalyzer/Core/SarifDiagnostics.cpp:243
+
+  llvm::for_each(Diags, [&](const PathDiagnostic *D) {
+    Results.push_back(createResult(*D, Files));
----------------
george.karpenkov wrote:
> I like closures, but what's wrong with just using a `for` loop here?
Same as above: clarity of exposition. This one I'd feel pretty strongly about keeping as an algorithm given how trivial the loop body is.

================
Comment at: StaticAnalyzer/Core/SarifDiagnostics.cpp:254
+    std::vector<const PathDiagnostic *> &Diags, FilesMade *) {
+  // FIXME: if the file already exists, do we overwrite it with a single run,
+  // or do we append a run into the file if it's a valid SARIF log?
----------------
george.karpenkov wrote:
> Usually we overwrite the file and note that on stderr in such cases.
We took the decision internally to overwrite as well, but the SARIF format allows for multiple runs within the same output file (so you can compare analysis results for the same project over time). I think this will eventually have to be user-controlled because I can see some users wanting to append to the run and others wanting to overwrite. However, these log files can become quite large in practice (GBs of data), so "read in the JSON and add to it" may be implausible, hence why I punted for now.

I'll update the comment so it's not a FIXME.

https://reviews.llvm.org/D53814