[llvm] [AMDGPU] IGLP: Fix static variables (PR #137549)

Mon Mar 9 03:40:38 PDT 2026

ro-i wrote:

> @ro-i If you do not see a significant performance advantage obtained by the caching, perhaps you could fix the correctness issue first (recomputing the analysis data) and then try to improve the performance?

that may make sense, although the tests I conducted on the .ll sources you sent me (which I optimized with `-O3` before compiling them with `llc`) suggest that there will be a slight regression. Have a look at the AI summary of the performance data:

---

## Per-Pass Timing Analysis (20 runs, `-time-passes` data)

### Pre-RA Scheduler (where IGLP analysis runs)

This is the key pass. The IGLP `analyzeDAG` recomputation happens here during the pre-RA reentry phase.

| Benchmark | W/O Cache | W/ Cache | Trunk | WO vs Trunk | W/ vs Trunk | W/ vs WO |
|-----------|-----------|----------|-------|-------------|-------------|----------|
| km_kn_mn  | 37.8 ms   | 36.8 ms  | 37.4 ms | **+1.2%** | -1.4% | **-2.6%** |
| km_nk_mn  | 29.2 ms   | 28.3 ms  | 28.8 ms | **+1.4%** | -2.0% | **-3.4%** |
| mk_kn_mn  | 30.9 ms   | 30.0 ms  | 30.4 ms | **+1.3%** | -1.8% | **-3.0%** |
| mk_nk_mn  | 24.7 ms   | 23.6 ms  | 24.3 ms | **+1.7%** | -2.5% | **-4.1%** |

All differences are **statistically significant at p < 0.001** (Welch's t-test, n=20 per group, extremely tight standard deviations of ~0.2ms).

### PostRA Scheduler

| Benchmark | W/O Cache | W/ Cache | Trunk | WO vs Trunk |
|-----------|-----------|----------|-------|-------------|
| km_kn_mn  | 24.8 ms   | 24.7 ms  | 24.8 ms | +0.2% |
| km_nk_mn  | 18.5 ms   | 18.6 ms  | 18.8 ms | -1.6% |
| mk_kn_mn  | 19.0 ms   | 19.0 ms  | 19.2 ms | -1.0% |
| mk_nk_mn  | 14.0 ms   | 14.0 ms  | 14.2 ms | -1.4% |

PostRA shows essentially no impact from cache vs. no-cache, as expected (PostRA always recomputes).

### Total Compilation Time

| Benchmark | W/O Cache | W/ Cache | Trunk | WO vs Trunk | W/ vs Trunk |
|-----------|-----------|----------|-------|-------------|-------------|
| km_kn_mn  | 430.2 ms  | 424.4 ms | 431.0 ms | -0.2% | **-1.5%** |
| km_nk_mn  | 405.3 ms  | 400.2 ms | 406.1 ms | -0.2% | **-1.5%** |
| mk_kn_mn  | 406.1 ms  | 400.9 ms | 406.8 ms | -0.2% | **-1.5%** |
| mk_nk_mn  | 382.2 ms  | 376.8 ms | 382.5 ms | -0.1% | **-1.5%** |

The scheduler passes (pre-RA + post-RA) account for **10-15%** of total compilation time.

---

## Key Takeaways

1. **"Without cache" (recompute every time) is ~1.2-1.7% slower in the pre-RA scheduler than trunk** (which used the old static variable caching). This is a consistent, statistically significant regression of **~0.4 ms** per pass invocation.

2. **"With cache" (serializable cache) was actually ~1.4-2.5% faster than trunk** in the pre-RA scheduler, suggesting the new cache implementation was more efficient than the old static variable approach.

3. **The total compilation impact of removing the cache is negligible** -- only ~0.1-0.2% of total wall clock time, because the pre-RA scheduler is only ~7-9% of the total, and the regression within that pass is ~1.5%.

4. **The absolute cost of recomputation is ~0.4 ms per benchmark** in the pre-RA scheduler. On a ~400ms compilation, this is lost in the noise of wall-clock measurements, which explains why the earlier `time`-based analysis couldn't reliably detect it.

5. **PostRA shows a small unexplained improvement** (~1-1.6%) for "without cache" vs trunk on `km_nk_mn`, `mk_kn_mn`, `mk_nk_mn`. This is likely due to other code changes in the patch rather than cache-related.

**Bottom line**: Removing the cache causes a real but tiny regression (~0.4ms, ~1.5% of the pre-RA scheduler, ~0.1% of total compilation time). Whether this matters depends on your reviewer's appetite for complexity vs. marginal performance.

https://github.com/llvm/llvm-project/pull/137549