[PATCH] D98230: [LSR] Add reconciliation of unfoldable offsets

Fri Feb 25 16:24:21 PST 2022

jonpa updated this revision to Diff 411550.
jonpa added reviewers: qcolombet, efriedma, eli.friedman.
jonpa added a comment.

I saw the need for this patch again recently when needing the DAGCombiner to handle multiple out-of-range offsets, which it did not. The problem is that we "lie" in SystemZ::isLegalAddressingMode() currently by saying that a vector type memory access accepts a big offset, which is not true. If I change this, the DAGCombiner does a good job, however LSR then produces worse code in some cases, which makes me come back to this.

For some reason or other, "lying" to LSR about these displacements is sometimes better. For instance with this loop:

  define void @fun(i8* %arg) {
  bb:
    %i = getelementptr inbounds i8, i8* %arg, i64 27548
    %i1 = bitcast i8* %i to [12 x [16 x i16]]*
    %i2 = getelementptr inbounds i8, i8* %arg, i64 28028
    %i3 = bitcast i8* %i2 to [12 x [16 x i16]]*
    br label %bb4

  bb4:                                              ; preds = %bb4, %bb
    %i5 = phi i64 [ %i10, %bb4 ], [ 0, %bb ]
    %i6 = getelementptr inbounds [12 x [16 x i16]], [12 x [16 x i16]]* %i1, i64 0, i64 3, i64 %i5
    %i7 = bitcast i16* %i6 to <8 x i16>*
    store <8 x i16> zeroinitializer, <8 x i16>* %i7
    %i8 = getelementptr inbounds [12 x [16 x i16]], [12 x [16 x i16]]* %i3, i64 0, i64 3, i64 %i5
    %i9 = bitcast i16* %i8 to <16 x i8>*
    store <16 x i8> zeroinitializer, <16 x i8>* %i9
    %i10 = add nuw i64 %i5, 8
    br label %bb4
  }

trunk:

  	vgbm	%v0, 0
  	aghi	%r2, 27644              ### Single increment
  .LBB0_1:                                # %bb4
                                          # =>This Inner Loop Header: Depth=1
  	vst	%v0, 0(%r2), 3
  	vst	%v0, 480(%r2), 3
  	la	%r2, 16(%r2)
  	j	.LBB0_1

if rejecting big offsets for the Vector STores:

  	vgbm	%v0, 0
  	lay	%r1, 28124(%r2)         ### Two separate regs
  	lay	%r2, 27644(%r2)
  	lghi	%r3, 0
  .LBB0_1:                                # %bb4
                                          # =>This Inner Loop Header: Depth=1
  	vst	%v0, 0(%r3,%r2), 3
  	vst	%v0, 0(%r3,%r1), 3
  	la	%r3, 16(%r3)
  	j	.LBB0_1

This is a small example, but it illustrates that LSR recognizes that the two offsets are out of range, and it then puts them in two separate registers. In this case it seems that trunk LSR managed to chose the better formula even when "thinking" the huge offsets were in range. Maybe the huge offset was added before the loop as a general heuristic to reduce the offsets in the fixups...? In real code, those LAYs are more plentiful and also inside the loop... :-/

So my simple idea here is to have LSR group these fixups together under one reg even though they are initially unfoldable: Together they form a group which can have foldable offsets if the right immediate is added to the formula before the loop. It seems to me that this would make sense and is kind of simply missing - unless there is some other reason for not doing this?

The first step is to make sure the fixup groups are formed under one LSRUse when they are all out of range, but within range between themselves.

When rejecting the addressing modes properly for the vector-type, I realized I also had to adjust these new groups (second step), which is done in adjustInitialFormulaeForOffsets(). This is what happens:

  Fixup #1:   Offset = 28124  => Initial formula:  reg(A) + 28124
  Fixup #2:   Offset = 27644  => reg(A) - 480.

reconcileNewOffset() checked that the offset range between #1 and #2 is in range, which it is. However the offset -480 is not legal, so adjustInitialFormulaeForOffsets() adjusts by subtracting 480 to the Formula and adds it to the fixup offsets:

  Initial formula, adjusted: reg(A) + 27644
  Fixup #1:   reg(A) + 480
  Fixup #2:   reg(A) +   0

In English: The SystemZ::isLegalAddressingMode() returns true for a VectorType only if the offset is within 0 - 4095, so all fixups can only use such an offset against the formula.

The reason this adjustment is made after the fact (of creating the initial formula) is that it can only be done after the group of fixups have been found. But maybe this could be gotten for free if the user instructions where sorted by offsets somehow so that the lowest offset would be seen first, which should give the same result...?

I left some "EXPERIMENTAL" (/SystemZ) options in the patch for now, so that anyone who wanted could look into this. To investigate the small loop above:

trunk:
llc -mtriple=s390x-linux-gnu -mcpu=z15 -O3 ./tc_LSR.ll -o -

reject large displacements for vector type (as should be), with worse results:
llc -mtriple=s390x-linux-gnu -mcpu=z15 -O3 ./tc_LSR.ll -o - -legalam-vec

to handle the problem with this patch:
llc -mtriple=s390x-linux-gnu -mcpu=z15 -O3 ./tc_LSR.ll -o - -legalam-vec -lsr-unfolded-offs

Is this making sense? Is there some other way of handling this that would be even better? I am thinking that of course those addresses could be somewhat cleaned up later, but it would be really best to handle the loop addressing in LSR as loops are after all very important.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98230/new/

https://reviews.llvm.org/D98230

Files:
  llvm/include/llvm/Analysis/TargetTransformInfo.h
  llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
  llvm/lib/Analysis/TargetTransformInfo.cpp
  llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
  llvm/lib/Target/SystemZ/SystemZISelLowering.h
  llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
  llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
  llvm/lib/Transforms/Scalar/LoopStrengthReduce.cpp
  llvm/test/CodeGen/SystemZ/loop-01.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D98230.411550.patch
Type: text/x-patch
Size: 14184 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220226/386f8223/attachment-0001.bin>