[llvm] [workflows] Add post-commit job that periodically runs the clang static analyzer (PR #94106)

Tue Jun 4 19:31:58 PDT 2024

================
@@ -0,0 +1,69 @@
+name: Post-Commit Static Analyzer
+
+permissions:
+  contents: read
+
+on:
+  push:
+    branches:
+      - 'release/**'
+    paths:
+      - 'llvm/**'
+      - '.github/workflows/ci-post-commit-analyzer.yml'
+  pull_request:
+    paths:
+      - '.github/workflows/ci-post-commit-analyzer.yml'
+  schedule:
+    - cron: '30 0 * * *'
+
+concurrency:
+  group: >-
+    llvm-project-${{ github.workflow }}-${{ github.event_name == 'pull_request' &&
+      ( github.event.pull_request.number || github.ref) }}
+  cancel-in-progress: ${{ startsWith(github.ref, 'refs/pull/') }}
+
+jobs:
+  post-commit-analyzer:
+    if: >-
+      github.repository_owner == 'llvm' &&
+      github.event.action != 'closed'
+    runs-on: ubuntu-22.04
+    env:
+      LLVM_VERSION: 18
+    steps:
+      - name: Checkout Source
+        uses: actions/checkout at b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
+
+      - name: Install Dependencies
+        run: |
+          sudo echo "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main" | sudo tee -a /etc/apt/sources.list.d/llvm.list
+          wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
+          sudo apt-get update
+          sudo apt-get install \
+            cmake \
+            ninja-build \
+            perl \
+            clang-tools-$LLVM_VERSION \
+            clang-$LLVM_VERSION
+
+      - name: Configure
+        run: |
+          scan-build-$LLVM_VERSION \
+              --use-c++=clang++ \
+              --use-cc=clang \
+              cmake -B build -S llvm -G Ninja \
----------------
haoNoQ wrote:

Oh.

Hmm.

Yes, the static analyzer can be tweaked for performance. In fact the entire idea is that the static analyzer tries to "explore all possible execution paths" which fundamentally takes an infinite amount of time. So the only reason it terminates at all, is because it implements a bunch of "budgets", hard cutoffs to limit the exploration time. A limit on the total number of simulation steps across the entire graph of paths, a limit on the number of iterations spent in a loop, a limit on the number of nested function calls. (To be fair, it also tries to do its best within the budget, eg. it prioritizes the paths that lead to previously unexplored lines of code. So increasing those limits yields rapidly diminishing returns.)

There's the "user-friendly" "deep"/"shallow" analysis mode setting that tweaks a bunch of budgets at once to produce two "supported" configurations; the shallow mode is faster but less precise. Unfortunately I'm afraid that the shallow mode was historically fine-tuned for Objective-C so I wouldn't really recommend it for any other purposes.

I think the easiest/safest thing to do is to simply decrease `-analyzer-config max-nodes=225000` to a smaller value. That's the total number of simulation steps. It affects analysis time non-linearly because many "entry-point" functions are simulated in a finite amount of steps so they never reach the limit anyway. But either way it's the simplest option to tweak because when this budget runs out the analysis simply terminates. It will simply miss some deeper bugs but it shouldn't introduce a lot of new false positives. Every other option means that the static analyzer gets more time to explore a different part of the code so it becomes significantly more chaotic.

So, can you try
```
scan-build ... -analyzer-config max-nodes=150000
```
or
```
scan-build ... -analyzer-config max-nodes=75000
```
to see if you can fit into the 6-hour limit this way?

----

Also, well, it's sad that we have to sacrifice thoroughness in order to cover more ground. This means that we'll need to sacrifice most of the other sub-projects, unless they're so self-contained that we can set up a new job with just them.

I hope that eventually we'll either get faster machines, or figure out how to split analysis into several smaller jobs.

We should probably consider `-DLLVM_ENABLE_MODULES=1` as well. Fundamentally the static analyzer isn't affected by modules so it should be a pure win. Not by a lot: path exploration time would probably outweigh compilation time anyway. But it could buy us an hour or so.

There's also the part where the code is compiled normally during analysis, and the static analyzer is invoked in a separate process from normal compilation, so all the code actually gets compiled twice. At least up to the AST, and then it goes in different directions. The output produced by normal compilation isn't used for the analysis so this is fundamentally unnecessary and could in theory be cut. Except we still need to compile the binaries that we need for compiling the rest of the code, such as `TableGen`. So in case of LLVM this is hard to avoid.

----

> Can I use ccache?

That's an excellent question, I've somehow never even considered this. There's nothing fundamentally impossible about this: the static analyzer is fully deterministic so it'll produce the same output given the same translation unit and flags as input. But it'll probably be very annoying to marry ccache and scan-build, because scan-build comes with its own compiler wrapper. Also I'm not sure ccache would understand where our output is, because what scan-build passes to clang as `-o` is a directory.

So let me try and see if there's a straightforward solution, but if not, I probably won't have time to implement a proper solution.

It may also be a good idea to drop `scan-build` entirely and implement the whole thing with a tiny custom `CMAKE_CXX_COMPILER_LAUNCHER`. That's the whole point of `scan-build` anyway, it's just a hacky way to intercept compiler invocations from arbitrary build systems and append `--analyze` to them. A hand-crafted integration into your build system is almost always preferable if you have time for it.

https://github.com/llvm/llvm-project/pull/94106