[PATCH] D35049: LSR tunings for SystemZ, with some minor common code changes

Thu Jul 20 02:15:39 PDT 2017

jonpa added a comment.

> The idea was that if NumBaseAdds is the same, ImmCost is used as a tie breaker.
>  If I understand correctly you're saying not using this as a tie breaker generates better code. I found that concerning.
> 
> Could you dig into that before moving forward?

1. If I move back one step and update the ImmCost like

      if (LU.Kind == LSRUse::Address && TTI.LSRWithInstrQueries()) {
         if (!TTI.isFoldableMemAccessOffset(Fixup.UserInst, Offset))
           C.NumBaseAdds++;
  -    } else
  -      C.ImmCost += APInt(64, Offset, true).getMinSignedBits();
  +    }
  +
  +    C.ImmCost += APInt(64, Offset, true).getMinSignedBits();
     }

I get that on SPEC-2006, now 717 loops are bigger (as expected), while 361 are smaller. So this affects only ~2% of the loops.

Without ImmCost on address fixups:
717 loops better

  205 better because live-out PHI duplicated and modified slightly in other version  (#0)
  177 better because of more spilling in other version (#1)
  280 better because more COPYs in other version (#2)
  132 better becasue less loads of constants in loops (#3)
   56 better because of less immediate adds
      42 before a compare using same IV (#4)
      14 before a memory instruction
   83 better because more fused compare / branches (#5)
  --- left:
   (27 loops better)

With ImmCost on address fixups:
361 loops better

    5 better because live-out PHI duplicated and modified slightly in other version   (#0)
  173 better because of more spilling in other version (#1)
  165 better because more COPYs in other version (#2)
   25 better becasue less loads of constants in loops (#3)
   99 better because of less immediate adds
      89 before a compare using same IV (#4)
      10 before a memory instruction
   26 better because more fused compare / branches (#5)
  --- left:
   (22 loops better)

#0: It seems that LSR is not aware if a live-out PHI is reused or not. The formula that reuses a live-out phi does not need the extra phi, which should mean a cost difference of one add in the loop. So if LSR comes up with a formula that offsets the original PHI with some offset which makes the total Offset bits of the fixups less, it will insert a new phi.

Since this is clearly bad (205/5), my guess is that LSR most of the time manages to first use an existing PHI. Without the fixups ImmCost, the formulas have the same cost, and the formula with the original PHI remains.

#1: These loops got randomly different number of spills / reloads in them. This is a current regalloc flaw, and there is a patch in progress to handle it by Wei Mi (https://bugs.llvm.org/show_bug.cgi?id=32722). This happens evenly (177/173) w/ or w/out ImmCost, so this isn't an LSR issue.

#2: The coalescing isn't perfect, and I have for instance seen cases where due to the pre-RA-scheduler the IV increment may be scheduled above the last use of it, prohibiting coalescing of the IV. As a result, there is an extra needless copy after the IV increment. Not sure why there are more COPYs with ImmCost.

#3: Not sure why, but it seems that a lot less loads of constants got hoisted out of loops with ImmCost. Maybe different register allocation for some reason?

#4: It seems that ImmCost did reduce the number of adds of immediates before compares somewhat (42/89).

#5: A few more compares got fused into the branch without ImmCost (83/26)

So, there are maybe a thing or two here possible to improve, but I am not sure if these details affect more than the 2% of the loops or not...

I think for SystemZ, a first step might be to just skip the ImmCost in the cost function, since that actually only affects 18 loops, so that is surely an easier way if this patch cannot be accepted as is. In that case, I would revert the LoopStrenghtReduce patch changes related to ImmCost.

> This does not answer the question why we should guard this check with LSRWithInstrQueries given it wasn't guarded previously. My concern here is that we change a fairly high weighted piece of the formulae rating and given all other target would return false for LSRWithInstrQueries, I am afraid it will affect their performance across the board.

I was just seing that only SystemZ is using isFoldableMemAccessOffset(), but I suppose there are out-of-tree targets that you are concerned about?

https://reviews.llvm.org/D35049