[PATCH] D32422: LoopVectorizer: let target prefer scalar addressing computations (+ minor improvements in SystemZTTI)

Thu May 18 08:52:36 PDT 2017

jonpa marked an inline comment as done.
jonpa added a comment.

Elena, thanks for a nice review! I'll wait for your comments before updating the patch.

I have tried all your points, and tried to explain my findings inlined below.

> In general, I think that one-level-up scalarization is good for all targets.

You mean to handle just the case where we know this only changes one instructions decision to CM_Scalarize?

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7301
+
+  // Start with all scalar pointer uses.
+  SmallPtrSet<Instruction *, 8> AddrDefs;
----------------
delena wrote:
> But interleave loads are also here. You don't need to scalarize their pointers.
> attempt to scalarize the Ptr should be done if the taken decision is "Scalarize".
I tried

```
      if (PtrDef && TheLoop->contains(PtrDef) &&
          getWideningDecision(&I, VF) == CM_Scalarize)
        AddrDefs.insert(PtrDef);
```

This was a minor change: only 17 loops became different. This seems to relate to the cases of mixed context (used both as address and otherwise), which means that we are instead of extracting an address now do inserts instead.

Not sure exactly if this matters, but for what I can see, it seemed slightly better in terms of final MIs to scalarize "anything" (not just scalarized case), perhaps because the big benefit of LSR working on scalar addresses?

If you don't have a strong opinion, I guess I would be most happy not to change anything here.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7305
+    for (Instruction &I : *BB) {
+      Instruction *PtrDef = nullptr;
+      if (auto *LoadI = dyn_cast<LoadInst>(&I))
----------------
delena wrote:
> You will find getPointerOperand(Inst) utility in this file. It retrieves a pointer from load and store.
Aah, thanks.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7322
+      if (auto *InstOp = dyn_cast<Instruction>(Op))
+        if (TheLoop->contains(InstOp) && !isa<PHINode>(InstOp) &&
+            AddrDefs.insert(InstOp).second == true)
----------------
delena wrote:
> delena wrote:
> > In your examples you show only one level up. I don't believe that we have too many real cases of multiple address redirections ,  but theoretically, it may not be so profitable if you go many levels up.
> The scalarized intsruction should belong to the same  Basic block and , probably, have only one user.
> In your examples you show only one level up. I don't believe that we have too many real cases of multiple address redirections , but theoretically, it may not be so profitable if you go many levels up.

To my experience, if there are many levels of address computation stages I think it would still be wise to scalarize this and in those cases let this become "too expensive", and let LSR do its magic, rather than vectorizing complex addressing. In practice on benchmark it has been shown that about the same number of loops get vectorized still, so this shouldn't be common.

I still think it is a pity to see vectorized code that suffers from no LSR -- feel free to join discussion at https://reviews.llvm.org/D32166 and share any thoughts...

> The scalarized instruction should belong to the same Basic block
Changing the check from "contained in loop" to "contained in same bb" actually meant identical results...

> and , probably, have only one user.
I have experimented with something like this and found a few cases were an address computation was also used in another context. What I did then was an iterative pruning of AddrDefs of any instructions used outside of AddrDefs. It  was quite rare, and did not seem to matter much, and I left it like this with the idea that maybe it is best to help the memory access to be fast, but this is nothing confirmed. 

I tried this now - just to add "InstOp->hasOneUse() &&" here, and found that on spec, a set of loops (73) were affected. This seems quite clearly to be for the worse, with just more extracts, so if you do not have a very strong opinion, I would like to leave this alone...

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7331
+        // Scalarize a load of address
+        setWideningDecision(I, VF, CM_Scalarize,
+                            (VF * getMemoryInstructionCost(I, 1)));
----------------
delena wrote:
> At this point you change widening decision of already analyzed Load inst. Make sure that this inst is not a part of interleave group.
I tried to only change widening decision if it was CM_Widen, and reran spec. 17 (0.6% of vectorized loops) loops now got different output and all but 2 of them got slightly worse result. So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. They are 64 bits which means 2 per vector register, so there isn't that much room for vectorization anyway.

Here's an example:

Loop before vectorize pass:

```
for.body61:                                       ; preds = %for.body61, %for.body61.lr.ph
  %indvars.iv81 = phi i64 [ %indvars.iv.next82, %for.body61 ], [ 0, %for.body61.lr.ph ]
  %next65 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %indvars.iv81, i32 1
  %20 = load i32, i32* %next65, align 4, !tbaa !11
  %idxprom66 = sext i32 %20 to i64
  %arrayidx67 = getelementptr inbounds i32, i32* %1, i64 %idxprom66
  %21 = load i32, i32* %arrayidx67, align 4, !tbaa !7
  store i32 %21, i32* %next65, align 4, !tbaa !11
  %indvars.iv.next82 = add nuw nsw i64 %indvars.iv81, 1
  %cmp59 = icmp slt i64 %indvars.iv81, %16
  br i1 %cmp59, label %for.body61, label %for.cond75.preheader.loopexit

PATCH UNMODIFIED:

Loop after vectorize pass:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
  %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
  %20 = add i64 %index, 0
  %21 = add i64 %index, 1
  %22 = add i64 %index, 2
  %23 = add i64 %index, 3
  %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
  %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
  %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
  %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
  %28 = load i32, i32* %24, align 4, !tbaa !11
  %29 = load i32, i32* %25, align 4, !tbaa !11
  %30 = load i32, i32* %26, align 4, !tbaa !11
  %31 = load i32, i32* %27, align 4, !tbaa !11
  %32 = sext i32 %28 to i64
  %33 = sext i32 %29 to i64
  %34 = sext i32 %30 to i64
  %35 = sext i32 %31 to i64
  %36 = getelementptr inbounds i32, i32* %1, i64 %32
  %37 = getelementptr inbounds i32, i32* %1, i64 %33
  %38 = getelementptr inbounds i32, i32* %1, i64 %34
  %39 = getelementptr inbounds i32, i32* %1, i64 %35
  %40 = load i32, i32* %36, align 4, !tbaa !7
  %41 = load i32, i32* %37, align 4, !tbaa !7
  %42 = load i32, i32* %38, align 4, !tbaa !7
  %43 = load i32, i32* %39, align 4, !tbaa !7
  store i32 %40, i32* %24, align 4, !tbaa !11
  store i32 %41, i32* %25, align 4, !tbaa !11
  store i32 %42, i32* %26, align 4, !tbaa !11
  store i32 %43, i32* %27, align 4, !tbaa !11
  %index.next = add i64 %index, 4
  %44 = icmp eq i64 %index.next, %n.vec
  br i1 %44, label %middle.block, label %vector.body, !llvm.loop !12

->

Final Header: 
BB#22: derived from LLVM BB %vector.body
    Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
    Predecessors according to CFG: BB#21 BB#23
        %R5D<def> = LGF %R4D, -32, %noreg; mem:LD4[%scevgep112](tbaa=!21)
        %R14D<def> = LGF %R4D, -40, %noreg; mem:LD4[%scevgep113](tbaa=!21)
        %R5D<def> = SLLG %R5D<kill>, %noreg, 2
        %R14D<def> = SLLG %R14D<kill>, %noreg, 2
        %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%41](tbaa=!2)
        %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%40](tbaa=!2)
        %R14D<def> = LGF %R4D, -48, %noreg; mem:LD4[%scevgep114](tbaa=!21)
        %R11D<def> = LGF %R4D, -56, %noreg; mem:LD4[%scevgep115](tbaa=!21)
        %R14D<def> = SLLG %R14D<kill>, %noreg, 2
        %R11D<def> = SLLG %R11D<kill>, %noreg, 2
        %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%39](tbaa=!2)
        %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%38](tbaa=!2)
        STFH %R14H<kill>, %R4D, -56, %noreg; mem:ST4[%scevgep115](tbaa=!21)
        STY %R14L<kill>, %R4D, -48, %noreg; mem:ST4[%scevgep114](tbaa=!21)
        STFH %R5H<kill>, %R4D, -40, %noreg; mem:ST4[%scevgep113](tbaa=!21)
        STY %R5L<kill>, %R4D, -32, %noreg; mem:ST4[%scevgep112](tbaa=!21)
        CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>

instr-count:  17

MODIFIED TO NOT TOUCH INTERLEAVED LOADS:

Loop after vectorize pass:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
  %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
  %20 = add i64 %index, 0
  %21 = add i64 %index, 1
  %22 = add i64 %index, 2
  %23 = add i64 %index, 3
  %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
  %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
  %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
  %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
  %28 = getelementptr i32, i32* %24, i32 0
  %29 = bitcast i32* %28 to <8 x i32>*
  %wide.vec = load <8 x i32>, <8 x i32>* %29, align 4, !tbaa !11
  %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  %30 = extractelement <4 x i32> %strided.vec, i32 0
  %31 = sext i32 %30 to i64
  %32 = extractelement <4 x i32> %strided.vec, i32 1
  %33 = sext i32 %32 to i64
  %34 = extractelement <4 x i32> %strided.vec, i32 2
  %35 = sext i32 %34 to i64
  %36 = extractelement <4 x i32> %strided.vec, i32 3
  %37 = sext i32 %36 to i64
  %38 = getelementptr inbounds i32, i32* %1, i64 %31
  %39 = getelementptr inbounds i32, i32* %1, i64 %33
  %40 = getelementptr inbounds i32, i32* %1, i64 %35
  %41 = getelementptr inbounds i32, i32* %1, i64 %37
  %42 = load i32, i32* %38, align 4, !tbaa !7
  %43 = load i32, i32* %39, align 4, !tbaa !7
  %44 = load i32, i32* %40, align 4, !tbaa !7
  %45 = load i32, i32* %41, align 4, !tbaa !7
  store i32 %42, i32* %24, align 4, !tbaa !11
  store i32 %43, i32* %25, align 4, !tbaa !11
  store i32 %44, i32* %26, align 4, !tbaa !11
  store i32 %45, i32* %27, align 4, !tbaa !11
  %index.next = add i64 %index, 4
  %46 = icmp eq i64 %index.next, %n.vec
  br i1 %46, label %middle.block, label %vector.body, !llvm.loop !12

Final Header: 
BB#22: derived from LLVM BB %vector.body
    Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
    Predecessors according to CFG: BB#21 BB#23
        %V0<def> = VL %R4D, 16, %noreg; mem:LD16[%lsr.iv106113+16](align=4)(tbaa=!21)
        %R5D<def> = VLGVF %V0, %noreg, 2
        %R5D<def> = LGFR %R5L<kill>, %R5D<imp-use,kill>
        %R5D<def> = SLLG %R5D<kill>, %noreg, 2
        %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%44](tbaa=!2)
        %R14D<def> = VLGVF %V0<kill>, %noreg, 0
        %V0<def> = VL %R4D, 0, %noreg; mem:LD16[%lsr.iv106113](align=4)(tbaa=!21)
        %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
        %R14D<def> = SLLG %R14D<kill>, %noreg, 2
        %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%43](tbaa=!2)
        %R14D<def> = VLGVF %V0, %noreg, 0
        %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
        %R14D<def> = SLLG %R14D<kill>, %noreg, 2
        %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%41](tbaa=!2)
        %R11D<def> = VLGVF %V0<kill>, %noreg, 2
        %R11D<def> = LGFR %R11L<kill>, %R11D<imp-use,kill>
        %R11D<def> = SLLG %R11D<kill>, %noreg, 2
        %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%42](tbaa=!2)
        STFH %R14H<kill>, %R4D, 8, %noreg; mem:ST4[%scevgep117](tbaa=!21)
        ST %R14L<kill>, %R4D, 0, %noreg; mem:ST4[%lsr.iv106108](tbaa=!21)
        STFH %R5H<kill>, %R4D, 16, %noreg; mem:ST4[%scevgep116](tbaa=!21)
        ST %R5L<kill>, %R4D, 24, %noreg; mem:ST4[%scevgep115](tbaa=!21)
        CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>

instr-count:  23

```

I experimented with scalarizing the whole interleave group instead, like:

```
  for (auto *I : AddrDefs) {
    if (isa<LoadInst>(I)) {
      if (getWideningDecision(I, VF) == CM_Widen)
        // Scalarize a widened load of address.
        setWideningDecision(I, VF, CM_Scalarize,
                            (VF * getMemoryInstructionCost(I, 1)));
      else if (auto Group = Legal->getInterleavedAccessGroup(I)) {
        // Scalarize an interleave group of address loads.
        for (unsigned I = 0; I < Group->getFactor(); ++I) {
          if (Instruction *Member = Group->getMember(I))
            setWideningDecision(Member, VF, CM_Scalarize,
                                (VF * getMemoryInstructionCost(Member, 1)));
        }
      }
    } else
      // Make sure I gets scalarized and a cost estimate without
      // scalarization overhead.
      ForcedScalars[VF].insert(I);
  }

```

This was also very close to original patch - just 5 new loops now become vectorized, which seems to be a slight improvement in those loops. This is because now the cost is accurate for the scalarized loads.

So to summarize, this really doesn't matter that much (on spec), but it seems that (on spec) the best thing to do is to scalarize the whole interleave group, second best is to just scalarize (unmodified patch), third place is to interleave loads of addresses, which is not so good as we then have to do extracts...

https://reviews.llvm.org/D32422