[all-commits] [llvm/llvm-project] 303a78: [GreedyRA] Improve RA for nested loop induction va...

David Green via All-commits all-commits at lists.llvm.org
Sat Nov 18 01:55:33 PST 2023


  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 303a7835ff833278a0de20cf5a70085b2ae8fee1
      https://github.com/llvm/llvm-project/commit/303a7835ff833278a0de20cf5a70085b2ae8fee1
  Author: David Green <david.green at arm.com>
  Date:   2023-11-18 (Sat, 18 Nov 2023)

  Changed paths:
    M llvm/lib/CodeGen/RegAllocGreedy.cpp
    M llvm/lib/CodeGen/SplitKit.cpp
    M llvm/lib/CodeGen/SplitKit.h
    M llvm/test/CodeGen/AArch64/nested-iv-regalloc.mir
    M llvm/test/DebugInfo/MIR/InstrRef/memory-operand-folding-tieddef.mir

  Log Message:
  -----------
  [GreedyRA] Improve RA for nested loop induction variables (#72093)

Imagine a loop of the form:
```
  preheader:
    %r = def
  header:
    bcc latch, inner
  inner1:
    ..
  inner2:
    b latch
  latch:
    %r = subs %r
    bcc header
```

It can be possible for code to spend a decent amount of time in the
header<->latch loop, not going into the inner part of the loop as much.
The greedy register allocator can prefer to spill _around_ %r though,
adding spills around the subs in the loop, which can be very detrimental
for performance. (The case I am looking at is actually a very deeply
nested set of loops that repeat the header<->latch pattern at multiple
different levels).

The greedy RA will apply a preference to spill to the IV, as it is live
through the header block. This patch attempts to add a heuristic to
prevent that in this case for variables that look like IVs, in a similar
regard to the extra spill weight that gets added to variables that look
like IVs, that are expensive to spill. That will mean spills are more
likely to be pushed into the inner blocks, where they are less likely to
be executed and not as expensive as spills around the IV.

This gives a 8% speedup in the exchange benchmark from spec2017 when
compiled with flang-new, whilst importantly stabilising the scores to be
less chaotic to other changes. Running ctmark showed no difference in
the compile time. I've tried to run a range of benchmarking for
performance, most of which were relatively flat not showing many large
differences. One matrix multiply case improved 21.3% due to removing a
cascading chains of spills, and some other knock-on effects happen which
usually cause small differences in the scores.




More information about the All-commits mailing list