[PATCH] D32422: LoopVectorizer: let target prefer scalar addressing computations (+ minor improvements in SystemZTTI)

Sat May 20 00:32:30 PDT 2017

jonpa marked an inline comment as done.
jonpa added inline comments.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:394
+  /// Return true if target doesn't mind addresses in vectors.
+  bool prefersVectorizedAddressing() const;
+
----------------
delena wrote:
> delena wrote:
> > jonpa wrote:
> > > rengolin wrote:
> > > > Can't you check for scatter/gather support directly?
> > > I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this.
> > > I guess I could instead of this new hook check if getGatherScatterOpCost() returns INT_MAX.
> > > 
> > > I am not sure however if it will always be true that targets want to keep it this simple. Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?
> > > 
> > > 
> >   "Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?"
> > Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets. 
> "I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this."
> The API could be extended. We can add isLegalGather()/isLegalScatter() and it will be consistent and clear.
> Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets.
I meant that this could be applied for *other* cases than gather/scatter.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:394
+  /// Return true if target doesn't mind addresses in vectors.
+  bool prefersVectorizedAddressing() const;
+
----------------
jonpa wrote:
> delena wrote:
> > delena wrote:
> > > jonpa wrote:
> > > > rengolin wrote:
> > > > > Can't you check for scatter/gather support directly?
> > > > I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this.
> > > > I guess I could instead of this new hook check if getGatherScatterOpCost() returns INT_MAX.
> > > > 
> > > > I am not sure however if it will always be true that targets want to keep it this simple. Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?
> > > > 
> > > > 
> > >   "Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?"
> > > Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets. 
> > "I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this."
> > The API could be extended. We can add isLegalGather()/isLegalScatter() and it will be consistent and clear.
> > Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets.
> I meant that this could be applied for *other* cases than gather/scatter.
I suppose that "preferrsVectorizedAddressing()" could be named something else, but the point with this is currently that it is separate from deciding on the widening decision, which means that even though it currently is only a way to enable this for SystemZ only, it could become also for other targets a way to get scalarized addressing for *non* gather/scatter accesses (or CM_Scalarize only, perhaps).

So first, the widening decision is made, which included calling isLegalScatter/Gather or similar. With all memory accesses having gotten mapped to such a decision, we now ask if for any non-gather/scatter the target wants the current vectorized addressing, or to keep it scalar.

================
Comment at: include/llvm/Analysis/TargetTransformInfoImpl.h:234

+  bool prefersVectorizedAddressing() { return true; }
+
----------------
delena wrote:
> I assume that X86 would prefer to scalarize addressing for scalarised memops. But it is not clear how to use this interface..
OK - Sofar this patch is just working for SystemZ. As discussed already, we have to improve it somehow if X86 has an interest.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2107
+  /// scalarized.
+  DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> ForcedScalars;
+
----------------
delena wrote:
> May be we can add these instructions to the widening decisions map? This map contains only memops now, right?
They are not just CM_Scalarize, because they don't have any scalarization overhead.
(With a more intelligent handling of scalarized instruction sequences I guess this would become equivalent)

ForcedScalars contains not just memory instructions, but any other instruction leading to an address.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7301
+
+  // Start with all scalar pointer uses.
+  SmallPtrSet<Instruction *, 8> AddrDefs;
----------------
delena wrote:
> jonpa wrote:
> > delena wrote:
> > > But interleave loads are also here. You don't need to scalarize their pointers.
> > > attempt to scalarize the Ptr should be done if the taken decision is "Scalarize".
> > I tried
> > 
> > ```
> >       if (PtrDef && TheLoop->contains(PtrDef) &&
> >           getWideningDecision(&I, VF) == CM_Scalarize)
> >         AddrDefs.insert(PtrDef);
> > ```
> > 
> > This was a minor change: only 17 loops became different. This seems to relate to the cases of mixed context (used both as address and otherwise), which means that we are instead of extracting an address now do inserts instead.
> > 
> > Not sure exactly if this matters, but for what I can see, it seemed slightly better in terms of final MIs to scalarize "anything" (not just scalarized case), perhaps because the big benefit of LSR working on scalar addresses?
> > 
> > If you don't have a strong opinion, I guess I would be most happy not to change anything here.
> > 
> I just do not understand  why do you need to scalarize addressing if the instruction itself is not scalarized.
> Can you give me an IR example?
Again, "This seems to relate to the cases of mixed context (used both as address and otherwise)"...

For example:

```
for.body.i:                                       ; preds = %for.body.i, %for.body.preheader.i
  %indvars.iv241.i = phi i64 [ %47, %for.body.preheader.i ], [ %indvars.iv.next242.i, %for.body.i ]
  %indvars.iv.next242.i = add nsw i64 %indvars.iv241.i, -1
  %arrayidx.i.i = getelementptr inbounds i32, i32* %42, i64 %indvars.iv.next242.i
  %48 = trunc i64 %indvars.iv.next242.i to i32
  store i32 %48, i32* %arrayidx.i.i, align 4, !tbaa !47
  %tobool.i = icmp eq i32 %48, 0
  br i1 %tobool.i, label %for.end.i.loopexit, label %for.body.i
```

This is a mixed-use where the IV is stored truncated at the address indexed with IV. Due to the non-address use, it will be contained in a vector.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7331
+        // Scalarize a load of address
+        setWideningDecision(I, VF, CM_Scalarize,
+                            (VF * getMemoryInstructionCost(I, 1)));
----------------
delena wrote:
> delena wrote:
> > jonpa wrote:
> > > delena wrote:
> > > > At this point you change widening decision of already analyzed Load inst. Make sure that this inst is not a part of interleave group.
> > > I tried to only change widening decision if it was CM_Widen, and reran spec. 17 (0.6% of vectorized loops) loops now got different output and all but 2 of them got slightly worse result. So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. They are 64 bits which means 2 per vector register, so there isn't that much room for vectorization anyway.
> > > 
> > > Here's an example:
> > > 
> > > Loop before vectorize pass:
> > > 
> > > 
> > > ```
> > > for.body61:                                       ; preds = %for.body61, %for.body61.lr.ph
> > >   %indvars.iv81 = phi i64 [ %indvars.iv.next82, %for.body61 ], [ 0, %for.body61.lr.ph ]
> > >   %next65 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %indvars.iv81, i32 1
> > >   %20 = load i32, i32* %next65, align 4, !tbaa !11
> > >   %idxprom66 = sext i32 %20 to i64
> > >   %arrayidx67 = getelementptr inbounds i32, i32* %1, i64 %idxprom66
> > >   %21 = load i32, i32* %arrayidx67, align 4, !tbaa !7
> > >   store i32 %21, i32* %next65, align 4, !tbaa !11
> > >   %indvars.iv.next82 = add nuw nsw i64 %indvars.iv81, 1
> > >   %cmp59 = icmp slt i64 %indvars.iv81, %16
> > >   br i1 %cmp59, label %for.body61, label %for.cond75.preheader.loopexit
> > > 
> > > PATCH UNMODIFIED:
> > > 
> > > Loop after vectorize pass:
> > > 
> > > vector.body:                                      ; preds = %vector.body, %vector.ph
> > >   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
> > >   %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
> > >   %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
> > >   %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
> > >   %20 = add i64 %index, 0
> > >   %21 = add i64 %index, 1
> > >   %22 = add i64 %index, 2
> > >   %23 = add i64 %index, 3
> > >   %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
> > >   %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
> > >   %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
> > >   %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
> > >   %28 = load i32, i32* %24, align 4, !tbaa !11
> > >   %29 = load i32, i32* %25, align 4, !tbaa !11
> > >   %30 = load i32, i32* %26, align 4, !tbaa !11
> > >   %31 = load i32, i32* %27, align 4, !tbaa !11
> > >   %32 = sext i32 %28 to i64
> > >   %33 = sext i32 %29 to i64
> > >   %34 = sext i32 %30 to i64
> > >   %35 = sext i32 %31 to i64
> > >   %36 = getelementptr inbounds i32, i32* %1, i64 %32
> > >   %37 = getelementptr inbounds i32, i32* %1, i64 %33
> > >   %38 = getelementptr inbounds i32, i32* %1, i64 %34
> > >   %39 = getelementptr inbounds i32, i32* %1, i64 %35
> > >   %40 = load i32, i32* %36, align 4, !tbaa !7
> > >   %41 = load i32, i32* %37, align 4, !tbaa !7
> > >   %42 = load i32, i32* %38, align 4, !tbaa !7
> > >   %43 = load i32, i32* %39, align 4, !tbaa !7
> > >   store i32 %40, i32* %24, align 4, !tbaa !11
> > >   store i32 %41, i32* %25, align 4, !tbaa !11
> > >   store i32 %42, i32* %26, align 4, !tbaa !11
> > >   store i32 %43, i32* %27, align 4, !tbaa !11
> > >   %index.next = add i64 %index, 4
> > >   %44 = icmp eq i64 %index.next, %n.vec
> > >   br i1 %44, label %middle.block, label %vector.body, !llvm.loop !12
> > > 
> > > ->
> > > 
> > > Final Header: 
> > > BB#22: derived from LLVM BB %vector.body
> > >     Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
> > >     Predecessors according to CFG: BB#21 BB#23
> > >         %R5D<def> = LGF %R4D, -32, %noreg; mem:LD4[%scevgep112](tbaa=!21)
> > >         %R14D<def> = LGF %R4D, -40, %noreg; mem:LD4[%scevgep113](tbaa=!21)
> > >         %R5D<def> = SLLG %R5D<kill>, %noreg, 2
> > >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> > >         %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%41](tbaa=!2)
> > >         %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%40](tbaa=!2)
> > >         %R14D<def> = LGF %R4D, -48, %noreg; mem:LD4[%scevgep114](tbaa=!21)
> > >         %R11D<def> = LGF %R4D, -56, %noreg; mem:LD4[%scevgep115](tbaa=!21)
> > >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> > >         %R11D<def> = SLLG %R11D<kill>, %noreg, 2
> > >         %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%39](tbaa=!2)
> > >         %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%38](tbaa=!2)
> > >         STFH %R14H<kill>, %R4D, -56, %noreg; mem:ST4[%scevgep115](tbaa=!21)
> > >         STY %R14L<kill>, %R4D, -48, %noreg; mem:ST4[%scevgep114](tbaa=!21)
> > >         STFH %R5H<kill>, %R4D, -40, %noreg; mem:ST4[%scevgep113](tbaa=!21)
> > >         STY %R5L<kill>, %R4D, -32, %noreg; mem:ST4[%scevgep112](tbaa=!21)
> > >         CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>
> > > 
> > > instr-count:  17
> > > 
> > > MODIFIED TO NOT TOUCH INTERLEAVED LOADS:
> > > 
> > > Loop after vectorize pass:
> > > 
> > > vector.body:                                      ; preds = %vector.body, %vector.ph
> > >   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
> > >   %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
> > >   %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
> > >   %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
> > >   %20 = add i64 %index, 0
> > >   %21 = add i64 %index, 1
> > >   %22 = add i64 %index, 2
> > >   %23 = add i64 %index, 3
> > >   %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
> > >   %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
> > >   %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
> > >   %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
> > >   %28 = getelementptr i32, i32* %24, i32 0
> > >   %29 = bitcast i32* %28 to <8 x i32>*
> > >   %wide.vec = load <8 x i32>, <8 x i32>* %29, align 4, !tbaa !11
> > >   %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
> > >   %30 = extractelement <4 x i32> %strided.vec, i32 0
> > >   %31 = sext i32 %30 to i64
> > >   %32 = extractelement <4 x i32> %strided.vec, i32 1
> > >   %33 = sext i32 %32 to i64
> > >   %34 = extractelement <4 x i32> %strided.vec, i32 2
> > >   %35 = sext i32 %34 to i64
> > >   %36 = extractelement <4 x i32> %strided.vec, i32 3
> > >   %37 = sext i32 %36 to i64
> > >   %38 = getelementptr inbounds i32, i32* %1, i64 %31
> > >   %39 = getelementptr inbounds i32, i32* %1, i64 %33
> > >   %40 = getelementptr inbounds i32, i32* %1, i64 %35
> > >   %41 = getelementptr inbounds i32, i32* %1, i64 %37
> > >   %42 = load i32, i32* %38, align 4, !tbaa !7
> > >   %43 = load i32, i32* %39, align 4, !tbaa !7
> > >   %44 = load i32, i32* %40, align 4, !tbaa !7
> > >   %45 = load i32, i32* %41, align 4, !tbaa !7
> > >   store i32 %42, i32* %24, align 4, !tbaa !11
> > >   store i32 %43, i32* %25, align 4, !tbaa !11
> > >   store i32 %44, i32* %26, align 4, !tbaa !11
> > >   store i32 %45, i32* %27, align 4, !tbaa !11
> > >   %index.next = add i64 %index, 4
> > >   %46 = icmp eq i64 %index.next, %n.vec
> > >   br i1 %46, label %middle.block, label %vector.body, !llvm.loop !12
> > > 
> > > Final Header: 
> > > BB#22: derived from LLVM BB %vector.body
> > >     Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
> > >     Predecessors according to CFG: BB#21 BB#23
> > >         %V0<def> = VL %R4D, 16, %noreg; mem:LD16[%lsr.iv106113+16](align=4)(tbaa=!21)
> > >         %R5D<def> = VLGVF %V0, %noreg, 2
> > >         %R5D<def> = LGFR %R5L<kill>, %R5D<imp-use,kill>
> > >         %R5D<def> = SLLG %R5D<kill>, %noreg, 2
> > >         %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%44](tbaa=!2)
> > >         %R14D<def> = VLGVF %V0<kill>, %noreg, 0
> > >         %V0<def> = VL %R4D, 0, %noreg; mem:LD16[%lsr.iv106113](align=4)(tbaa=!21)
> > >         %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
> > >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> > >         %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%43](tbaa=!2)
> > >         %R14D<def> = VLGVF %V0, %noreg, 0
> > >         %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
> > >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> > >         %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%41](tbaa=!2)
> > >         %R11D<def> = VLGVF %V0<kill>, %noreg, 2
> > >         %R11D<def> = LGFR %R11L<kill>, %R11D<imp-use,kill>
> > >         %R11D<def> = SLLG %R11D<kill>, %noreg, 2
> > >         %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%42](tbaa=!2)
> > >         STFH %R14H<kill>, %R4D, 8, %noreg; mem:ST4[%scevgep117](tbaa=!21)
> > >         ST %R14L<kill>, %R4D, 0, %noreg; mem:ST4[%lsr.iv106108](tbaa=!21)
> > >         STFH %R5H<kill>, %R4D, 16, %noreg; mem:ST4[%scevgep116](tbaa=!21)
> > >         ST %R5L<kill>, %R4D, 24, %noreg; mem:ST4[%scevgep115](tbaa=!21)
> > >         CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>
> > > 
> > > instr-count:  23
> > > 
> > > ```
> > > 
> > > I experimented with scalarizing the whole interleave group instead, like:
> > > 
> > > 
> > > ```
> > >   for (auto *I : AddrDefs) {
> > >     if (isa<LoadInst>(I)) {
> > >       if (getWideningDecision(I, VF) == CM_Widen)
> > >         // Scalarize a widened load of address.
> > >         setWideningDecision(I, VF, CM_Scalarize,
> > >                             (VF * getMemoryInstructionCost(I, 1)));
> > >       else if (auto Group = Legal->getInterleavedAccessGroup(I)) {
> > >         // Scalarize an interleave group of address loads.
> > >         for (unsigned I = 0; I < Group->getFactor(); ++I) {
> > >           if (Instruction *Member = Group->getMember(I))
> > >             setWideningDecision(Member, VF, CM_Scalarize,
> > >                                 (VF * getMemoryInstructionCost(Member, 1)));
> > >         }
> > >       }
> > >     } else
> > >       // Make sure I gets scalarized and a cost estimate without
> > >       // scalarization overhead.
> > >       ForcedScalars[VF].insert(I);
> > >   }
> > > 
> > > ```
> > > 
> > > This was also very close to original patch - just 5 new loops now become vectorized, which seems to be a slight improvement in those loops. This is because now the cost is accurate for the scalarized loads.
> > > 
> > > So to summarize, this really doesn't matter that much (on spec), but it seems that (on spec) the best thing to do is to scalarize the whole interleave group, second best is to just scalarize (unmodified patch), third place is to interleave loads of addresses, which is not so good as we then have to do extracts...
> > > 
> > You can't change decision of one load from the interleave group - or all, or nothing.
>   > "So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. "
> This is a cost model decision. Interleaving of 64-bit loads is profitable on X86.
> You can't change decision of one load from the interleave group - or all, or nothing.
Ok - I will update the patch then to scalarize the whole interleave group, which seemed best also.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7331
+        // Scalarize a load of address
+        setWideningDecision(I, VF, CM_Scalarize,
+                            (VF * getMemoryInstructionCost(I, 1)));
----------------
jonpa wrote:
> delena wrote:
> > delena wrote:
> > > jonpa wrote:
> > > > delena wrote:
> > > > > At this point you change widening decision of already analyzed Load inst. Make sure that this inst is not a part of interleave group.
> > > > I tried to only change widening decision if it was CM_Widen, and reran spec. 17 (0.6% of vectorized loops) loops now got different output and all but 2 of them got slightly worse result. So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. They are 64 bits which means 2 per vector register, so there isn't that much room for vectorization anyway.
> > > > 
> > > > Here's an example:
> > > > 
> > > > Loop before vectorize pass:
> > > > 
> > > > 
> > > > ```
> > > > for.body61:                                       ; preds = %for.body61, %for.body61.lr.ph
> > > >   %indvars.iv81 = phi i64 [ %indvars.iv.next82, %for.body61 ], [ 0, %for.body61.lr.ph ]
> > > >   %next65 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %indvars.iv81, i32 1
> > > >   %20 = load i32, i32* %next65, align 4, !tbaa !11
> > > >   %idxprom66 = sext i32 %20 to i64
> > > >   %arrayidx67 = getelementptr inbounds i32, i32* %1, i64 %idxprom66
> > > >   %21 = load i32, i32* %arrayidx67, align 4, !tbaa !7
> > > >   store i32 %21, i32* %next65, align 4, !tbaa !11
> > > >   %indvars.iv.next82 = add nuw nsw i64 %indvars.iv81, 1
> > > >   %cmp59 = icmp slt i64 %indvars.iv81, %16
> > > >   br i1 %cmp59, label %for.body61, label %for.cond75.preheader.loopexit
> > > > 
> > > > PATCH UNMODIFIED:
> > > > 
> > > > Loop after vectorize pass:
> > > > 
> > > > vector.body:                                      ; preds = %vector.body, %vector.ph
> > > >   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
> > > >   %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
> > > >   %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
> > > >   %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
> > > >   %20 = add i64 %index, 0
> > > >   %21 = add i64 %index, 1
> > > >   %22 = add i64 %index, 2
> > > >   %23 = add i64 %index, 3
> > > >   %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
> > > >   %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
> > > >   %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
> > > >   %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
> > > >   %28 = load i32, i32* %24, align 4, !tbaa !11
> > > >   %29 = load i32, i32* %25, align 4, !tbaa !11
> > > >   %30 = load i32, i32* %26, align 4, !tbaa !11
> > > >   %31 = load i32, i32* %27, align 4, !tbaa !11
> > > >   %32 = sext i32 %28 to i64
> > > >   %33 = sext i32 %29 to i64
> > > >   %34 = sext i32 %30 to i64
> > > >   %35 = sext i32 %31 to i64
> > > >   %36 = getelementptr inbounds i32, i32* %1, i64 %32
> > > >   %37 = getelementptr inbounds i32, i32* %1, i64 %33
> > > >   %38 = getelementptr inbounds i32, i32* %1, i64 %34
> > > >   %39 = getelementptr inbounds i32, i32* %1, i64 %35
> > > >   %40 = load i32, i32* %36, align 4, !tbaa !7
> > > >   %41 = load i32, i32* %37, align 4, !tbaa !7
> > > >   %42 = load i32, i32* %38, align 4, !tbaa !7
> > > >   %43 = load i32, i32* %39, align 4, !tbaa !7
> > > >   store i32 %40, i32* %24, align 4, !tbaa !11
> > > >   store i32 %41, i32* %25, align 4, !tbaa !11
> > > >   store i32 %42, i32* %26, align 4, !tbaa !11
> > > >   store i32 %43, i32* %27, align 4, !tbaa !11
> > > >   %index.next = add i64 %index, 4
> > > >   %44 = icmp eq i64 %index.next, %n.vec
> > > >   br i1 %44, label %middle.block, label %vector.body, !llvm.loop !12
> > > > 
> > > > ->
> > > > 
> > > > Final Header: 
> > > > BB#22: derived from LLVM BB %vector.body
> > > >     Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
> > > >     Predecessors according to CFG: BB#21 BB#23
> > > >         %R5D<def> = LGF %R4D, -32, %noreg; mem:LD4[%scevgep112](tbaa=!21)
> > > >         %R14D<def> = LGF %R4D, -40, %noreg; mem:LD4[%scevgep113](tbaa=!21)
> > > >         %R5D<def> = SLLG %R5D<kill>, %noreg, 2
> > > >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> > > >         %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%41](tbaa=!2)
> > > >         %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%40](tbaa=!2)
> > > >         %R14D<def> = LGF %R4D, -48, %noreg; mem:LD4[%scevgep114](tbaa=!21)
> > > >         %R11D<def> = LGF %R4D, -56, %noreg; mem:LD4[%scevgep115](tbaa=!21)
> > > >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> > > >         %R11D<def> = SLLG %R11D<kill>, %noreg, 2
> > > >         %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%39](tbaa=!2)
> > > >         %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%38](tbaa=!2)
> > > >         STFH %R14H<kill>, %R4D, -56, %noreg; mem:ST4[%scevgep115](tbaa=!21)
> > > >         STY %R14L<kill>, %R4D, -48, %noreg; mem:ST4[%scevgep114](tbaa=!21)
> > > >         STFH %R5H<kill>, %R4D, -40, %noreg; mem:ST4[%scevgep113](tbaa=!21)
> > > >         STY %R5L<kill>, %R4D, -32, %noreg; mem:ST4[%scevgep112](tbaa=!21)
> > > >         CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>
> > > > 
> > > > instr-count:  17
> > > > 
> > > > MODIFIED TO NOT TOUCH INTERLEAVED LOADS:
> > > > 
> > > > Loop after vectorize pass:
> > > > 
> > > > vector.body:                                      ; preds = %vector.body, %vector.ph
> > > >   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
> > > >   %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
> > > >   %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
> > > >   %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
> > > >   %20 = add i64 %index, 0
> > > >   %21 = add i64 %index, 1
> > > >   %22 = add i64 %index, 2
> > > >   %23 = add i64 %index, 3
> > > >   %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
> > > >   %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
> > > >   %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
> > > >   %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
> > > >   %28 = getelementptr i32, i32* %24, i32 0
> > > >   %29 = bitcast i32* %28 to <8 x i32>*
> > > >   %wide.vec = load <8 x i32>, <8 x i32>* %29, align 4, !tbaa !11
> > > >   %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
> > > >   %30 = extractelement <4 x i32> %strided.vec, i32 0
> > > >   %31 = sext i32 %30 to i64
> > > >   %32 = extractelement <4 x i32> %strided.vec, i32 1
> > > >   %33 = sext i32 %32 to i64
> > > >   %34 = extractelement <4 x i32> %strided.vec, i32 2
> > > >   %35 = sext i32 %34 to i64
> > > >   %36 = extractelement <4 x i32> %strided.vec, i32 3
> > > >   %37 = sext i32 %36 to i64
> > > >   %38 = getelementptr inbounds i32, i32* %1, i64 %31
> > > >   %39 = getelementptr inbounds i32, i32* %1, i64 %33
> > > >   %40 = getelementptr inbounds i32, i32* %1, i64 %35
> > > >   %41 = getelementptr inbounds i32, i32* %1, i64 %37
> > > >   %42 = load i32, i32* %38, align 4, !tbaa !7
> > > >   %43 = load i32, i32* %39, align 4, !tbaa !7
> > > >   %44 = load i32, i32* %40, align 4, !tbaa !7
> > > >   %45 = load i32, i32* %41, align 4, !tbaa !7
> > > >   store i32 %42, i32* %24, align 4, !tbaa !11
> > > >   store i32 %43, i32* %25, align 4, !tbaa !11
> > > >   store i32 %44, i32* %26, align 4, !tbaa !11
> > > >   store i32 %45, i32* %27, align 4, !tbaa !11
> > > >   %index.next = add i64 %index, 4
> > > >   %46 = icmp eq i64 %index.next, %n.vec
> > > >   br i1 %46, label %middle.block, label %vector.body, !llvm.loop !12
> > > > 
> > > > Final Header: 
> > > > BB#22: derived from LLVM BB %vector.body
> > > >     Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
> > > >     Predecessors according to CFG: BB#21 BB#23
> > > >         %V0<def> = VL %R4D, 16, %noreg; mem:LD16[%lsr.iv106113+16](align=4)(tbaa=!21)
> > > >         %R5D<def> = VLGVF %V0, %noreg, 2
> > > >         %R5D<def> = LGFR %R5L<kill>, %R5D<imp-use,kill>
> > > >         %R5D<def> = SLLG %R5D<kill>, %noreg, 2
> > > >         %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%44](tbaa=!2)
> > > >         %R14D<def> = VLGVF %V0<kill>, %noreg, 0
> > > >         %V0<def> = VL %R4D, 0, %noreg; mem:LD16[%lsr.iv106113](align=4)(tbaa=!21)
> > > >         %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
> > > >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> > > >         %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%43](tbaa=!2)
> > > >         %R14D<def> = VLGVF %V0, %noreg, 0
> > > >         %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
> > > >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> > > >         %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%41](tbaa=!2)
> > > >         %R11D<def> = VLGVF %V0<kill>, %noreg, 2
> > > >         %R11D<def> = LGFR %R11L<kill>, %R11D<imp-use,kill>
> > > >         %R11D<def> = SLLG %R11D<kill>, %noreg, 2
> > > >         %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%42](tbaa=!2)
> > > >         STFH %R14H<kill>, %R4D, 8, %noreg; mem:ST4[%scevgep117](tbaa=!21)
> > > >         ST %R14L<kill>, %R4D, 0, %noreg; mem:ST4[%lsr.iv106108](tbaa=!21)
> > > >         STFH %R5H<kill>, %R4D, 16, %noreg; mem:ST4[%scevgep116](tbaa=!21)
> > > >         ST %R5L<kill>, %R4D, 24, %noreg; mem:ST4[%scevgep115](tbaa=!21)
> > > >         CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>
> > > > 
> > > > instr-count:  23
> > > > 
> > > > ```
> > > > 
> > > > I experimented with scalarizing the whole interleave group instead, like:
> > > > 
> > > > 
> > > > ```
> > > >   for (auto *I : AddrDefs) {
> > > >     if (isa<LoadInst>(I)) {
> > > >       if (getWideningDecision(I, VF) == CM_Widen)
> > > >         // Scalarize a widened load of address.
> > > >         setWideningDecision(I, VF, CM_Scalarize,
> > > >                             (VF * getMemoryInstructionCost(I, 1)));
> > > >       else if (auto Group = Legal->getInterleavedAccessGroup(I)) {
> > > >         // Scalarize an interleave group of address loads.
> > > >         for (unsigned I = 0; I < Group->getFactor(); ++I) {
> > > >           if (Instruction *Member = Group->getMember(I))
> > > >             setWideningDecision(Member, VF, CM_Scalarize,
> > > >                                 (VF * getMemoryInstructionCost(Member, 1)));
> > > >         }
> > > >       }
> > > >     } else
> > > >       // Make sure I gets scalarized and a cost estimate without
> > > >       // scalarization overhead.
> > > >       ForcedScalars[VF].insert(I);
> > > >   }
> > > > 
> > > > ```
> > > > 
> > > > This was also very close to original patch - just 5 new loops now become vectorized, which seems to be a slight improvement in those loops. This is because now the cost is accurate for the scalarized loads.
> > > > 
> > > > So to summarize, this really doesn't matter that much (on spec), but it seems that (on spec) the best thing to do is to scalarize the whole interleave group, second best is to just scalarize (unmodified patch), third place is to interleave loads of addresses, which is not so good as we then have to do extracts...
> > > > 
> > > You can't change decision of one load from the interleave group - or all, or nothing.
> >   > "So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. "
> > This is a cost model decision. Interleaving of 64-bit loads is profitable on X86.
> > You can't change decision of one load from the interleave group - or all, or nothing.
> Ok - I will update the patch then to scalarize the whole interleave group, which seemed best also.
> This is a cost model decision. Interleaving of 64-bit loads is profitable on X86.
It would have been nice to just return a too-great value in getInterleavedMemoryOpCost() instead for loads of addresses, but I did however find that there are cases where the index is loaded, so it is not just a matter of checking if the passed type is a pointer type to catch this. 

An alternative then would be to pass the instruction pointer into that method, and to have to do a search then to see if the loaded value is involved in addressing. That way the widening decision for such a load would always be CM_Scalarize before this patch starts to work. This is however just extra work compared to the current patch, which is already finding those instructions. 

Perhaps an alternative would then be another boolean argument to that method and getMemoryOpCost() like IsInvolvedInAddressing, so that this new patch could call it again to check if it returned a very high value or not? This would then be called at this point where the patch is scalarizing things.

https://reviews.llvm.org/D32422