[PATCH] D32422: LoopVectorizer: let target prefer scalar addressing computations (+ minor improvements in SystemZTTI)

Fri May 19 00:56:45 PDT 2017

delena added inline comments.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:394
+  /// Return true if target doesn't mind addresses in vectors.
+  bool prefersVectorizedAddressing() const;
+
----------------
jonpa wrote:
> rengolin wrote:
> > Can't you check for scatter/gather support directly?
> I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this.
> I guess I could instead of this new hook check if getGatherScatterOpCost() returns INT_MAX.
> 
> I am not sure however if it will always be true that targets want to keep it this simple. Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?
> 
> 
  "Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?"
Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets. 

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:394
+  /// Return true if target doesn't mind addresses in vectors.
+  bool prefersVectorizedAddressing() const;
+
----------------
delena wrote:
> jonpa wrote:
> > rengolin wrote:
> > > Can't you check for scatter/gather support directly?
> > I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this.
> > I guess I could instead of this new hook check if getGatherScatterOpCost() returns INT_MAX.
> > 
> > I am not sure however if it will always be true that targets want to keep it this simple. Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?
> > 
> > 
>   "Couldn't it be that a target with such support actually wants scalarized addressing for scalarized/vectorized memory accesses, while still doing gather/scatter whenever possible?"
> Scallarizing memory addressing before gather/scatter does not make sense on X86. I can't say anything about other targets. 
"I didn't want to use isLegalMaskedScatter() / isLegalMaskedGather(), because "masked" has nothing to do with this."
The API could be extended. We can add isLegalGather()/isLegalScatter() and it will be consistent and clear.

================
Comment at: include/llvm/Analysis/TargetTransformInfoImpl.h:234

+  bool prefersVectorizedAddressing() { return true; }
+
----------------
I assume that X86 would prefer to scalarize addressing for scalarised memops. But it is not clear how to use this interface..

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2107
+  /// scalarized.
+  DenseMap<unsigned, SmallPtrSet<Instruction *, 4>> ForcedScalars;
+
----------------
May be we can add these instructions to the widening decisions map? This map contains only memops now, right?

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7301
+
+  // Start with all scalar pointer uses.
+  SmallPtrSet<Instruction *, 8> AddrDefs;
----------------
jonpa wrote:
> delena wrote:
> > But interleave loads are also here. You don't need to scalarize their pointers.
> > attempt to scalarize the Ptr should be done if the taken decision is "Scalarize".
> I tried
> 
> ```
>       if (PtrDef && TheLoop->contains(PtrDef) &&
>           getWideningDecision(&I, VF) == CM_Scalarize)
>         AddrDefs.insert(PtrDef);
> ```
> 
> This was a minor change: only 17 loops became different. This seems to relate to the cases of mixed context (used both as address and otherwise), which means that we are instead of extracting an address now do inserts instead.
> 
> Not sure exactly if this matters, but for what I can see, it seemed slightly better in terms of final MIs to scalarize "anything" (not just scalarized case), perhaps because the big benefit of LSR working on scalar addresses?
> 
> If you don't have a strong opinion, I guess I would be most happy not to change anything here.
> 
I just do not understand  why do you need to scalarize addressing if the instruction itself is not scalarized.
Can you give me an IR example?

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7331
+        // Scalarize a load of address
+        setWideningDecision(I, VF, CM_Scalarize,
+                            (VF * getMemoryInstructionCost(I, 1)));
----------------
jonpa wrote:
> delena wrote:
> > At this point you change widening decision of already analyzed Load inst. Make sure that this inst is not a part of interleave group.
> I tried to only change widening decision if it was CM_Widen, and reran spec. 17 (0.6% of vectorized loops) loops now got different output and all but 2 of them got slightly worse result. So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. They are 64 bits which means 2 per vector register, so there isn't that much room for vectorization anyway.
> 
> Here's an example:
> 
> Loop before vectorize pass:
> 
> 
> ```
> for.body61:                                       ; preds = %for.body61, %for.body61.lr.ph
>   %indvars.iv81 = phi i64 [ %indvars.iv.next82, %for.body61 ], [ 0, %for.body61.lr.ph ]
>   %next65 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %indvars.iv81, i32 1
>   %20 = load i32, i32* %next65, align 4, !tbaa !11
>   %idxprom66 = sext i32 %20 to i64
>   %arrayidx67 = getelementptr inbounds i32, i32* %1, i64 %idxprom66
>   %21 = load i32, i32* %arrayidx67, align 4, !tbaa !7
>   store i32 %21, i32* %next65, align 4, !tbaa !11
>   %indvars.iv.next82 = add nuw nsw i64 %indvars.iv81, 1
>   %cmp59 = icmp slt i64 %indvars.iv81, %16
>   br i1 %cmp59, label %for.body61, label %for.cond75.preheader.loopexit
> 
> PATCH UNMODIFIED:
> 
> Loop after vectorize pass:
> 
> vector.body:                                      ; preds = %vector.body, %vector.ph
>   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
>   %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
>   %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
>   %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
>   %20 = add i64 %index, 0
>   %21 = add i64 %index, 1
>   %22 = add i64 %index, 2
>   %23 = add i64 %index, 3
>   %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
>   %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
>   %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
>   %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
>   %28 = load i32, i32* %24, align 4, !tbaa !11
>   %29 = load i32, i32* %25, align 4, !tbaa !11
>   %30 = load i32, i32* %26, align 4, !tbaa !11
>   %31 = load i32, i32* %27, align 4, !tbaa !11
>   %32 = sext i32 %28 to i64
>   %33 = sext i32 %29 to i64
>   %34 = sext i32 %30 to i64
>   %35 = sext i32 %31 to i64
>   %36 = getelementptr inbounds i32, i32* %1, i64 %32
>   %37 = getelementptr inbounds i32, i32* %1, i64 %33
>   %38 = getelementptr inbounds i32, i32* %1, i64 %34
>   %39 = getelementptr inbounds i32, i32* %1, i64 %35
>   %40 = load i32, i32* %36, align 4, !tbaa !7
>   %41 = load i32, i32* %37, align 4, !tbaa !7
>   %42 = load i32, i32* %38, align 4, !tbaa !7
>   %43 = load i32, i32* %39, align 4, !tbaa !7
>   store i32 %40, i32* %24, align 4, !tbaa !11
>   store i32 %41, i32* %25, align 4, !tbaa !11
>   store i32 %42, i32* %26, align 4, !tbaa !11
>   store i32 %43, i32* %27, align 4, !tbaa !11
>   %index.next = add i64 %index, 4
>   %44 = icmp eq i64 %index.next, %n.vec
>   br i1 %44, label %middle.block, label %vector.body, !llvm.loop !12
> 
> ->
> 
> Final Header: 
> BB#22: derived from LLVM BB %vector.body
>     Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
>     Predecessors according to CFG: BB#21 BB#23
>         %R5D<def> = LGF %R4D, -32, %noreg; mem:LD4[%scevgep112](tbaa=!21)
>         %R14D<def> = LGF %R4D, -40, %noreg; mem:LD4[%scevgep113](tbaa=!21)
>         %R5D<def> = SLLG %R5D<kill>, %noreg, 2
>         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
>         %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%41](tbaa=!2)
>         %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%40](tbaa=!2)
>         %R14D<def> = LGF %R4D, -48, %noreg; mem:LD4[%scevgep114](tbaa=!21)
>         %R11D<def> = LGF %R4D, -56, %noreg; mem:LD4[%scevgep115](tbaa=!21)
>         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
>         %R11D<def> = SLLG %R11D<kill>, %noreg, 2
>         %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%39](tbaa=!2)
>         %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%38](tbaa=!2)
>         STFH %R14H<kill>, %R4D, -56, %noreg; mem:ST4[%scevgep115](tbaa=!21)
>         STY %R14L<kill>, %R4D, -48, %noreg; mem:ST4[%scevgep114](tbaa=!21)
>         STFH %R5H<kill>, %R4D, -40, %noreg; mem:ST4[%scevgep113](tbaa=!21)
>         STY %R5L<kill>, %R4D, -32, %noreg; mem:ST4[%scevgep112](tbaa=!21)
>         CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>
> 
> instr-count:  17
> 
> MODIFIED TO NOT TOUCH INTERLEAVED LOADS:
> 
> Loop after vectorize pass:
> 
> vector.body:                                      ; preds = %vector.body, %vector.ph
>   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
>   %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
>   %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
>   %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
>   %20 = add i64 %index, 0
>   %21 = add i64 %index, 1
>   %22 = add i64 %index, 2
>   %23 = add i64 %index, 3
>   %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
>   %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
>   %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
>   %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
>   %28 = getelementptr i32, i32* %24, i32 0
>   %29 = bitcast i32* %28 to <8 x i32>*
>   %wide.vec = load <8 x i32>, <8 x i32>* %29, align 4, !tbaa !11
>   %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
>   %30 = extractelement <4 x i32> %strided.vec, i32 0
>   %31 = sext i32 %30 to i64
>   %32 = extractelement <4 x i32> %strided.vec, i32 1
>   %33 = sext i32 %32 to i64
>   %34 = extractelement <4 x i32> %strided.vec, i32 2
>   %35 = sext i32 %34 to i64
>   %36 = extractelement <4 x i32> %strided.vec, i32 3
>   %37 = sext i32 %36 to i64
>   %38 = getelementptr inbounds i32, i32* %1, i64 %31
>   %39 = getelementptr inbounds i32, i32* %1, i64 %33
>   %40 = getelementptr inbounds i32, i32* %1, i64 %35
>   %41 = getelementptr inbounds i32, i32* %1, i64 %37
>   %42 = load i32, i32* %38, align 4, !tbaa !7
>   %43 = load i32, i32* %39, align 4, !tbaa !7
>   %44 = load i32, i32* %40, align 4, !tbaa !7
>   %45 = load i32, i32* %41, align 4, !tbaa !7
>   store i32 %42, i32* %24, align 4, !tbaa !11
>   store i32 %43, i32* %25, align 4, !tbaa !11
>   store i32 %44, i32* %26, align 4, !tbaa !11
>   store i32 %45, i32* %27, align 4, !tbaa !11
>   %index.next = add i64 %index, 4
>   %46 = icmp eq i64 %index.next, %n.vec
>   br i1 %46, label %middle.block, label %vector.body, !llvm.loop !12
> 
> Final Header: 
> BB#22: derived from LLVM BB %vector.body
>     Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
>     Predecessors according to CFG: BB#21 BB#23
>         %V0<def> = VL %R4D, 16, %noreg; mem:LD16[%lsr.iv106113+16](align=4)(tbaa=!21)
>         %R5D<def> = VLGVF %V0, %noreg, 2
>         %R5D<def> = LGFR %R5L<kill>, %R5D<imp-use,kill>
>         %R5D<def> = SLLG %R5D<kill>, %noreg, 2
>         %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%44](tbaa=!2)
>         %R14D<def> = VLGVF %V0<kill>, %noreg, 0
>         %V0<def> = VL %R4D, 0, %noreg; mem:LD16[%lsr.iv106113](align=4)(tbaa=!21)
>         %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
>         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
>         %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%43](tbaa=!2)
>         %R14D<def> = VLGVF %V0, %noreg, 0
>         %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
>         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
>         %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%41](tbaa=!2)
>         %R11D<def> = VLGVF %V0<kill>, %noreg, 2
>         %R11D<def> = LGFR %R11L<kill>, %R11D<imp-use,kill>
>         %R11D<def> = SLLG %R11D<kill>, %noreg, 2
>         %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%42](tbaa=!2)
>         STFH %R14H<kill>, %R4D, 8, %noreg; mem:ST4[%scevgep117](tbaa=!21)
>         ST %R14L<kill>, %R4D, 0, %noreg; mem:ST4[%lsr.iv106108](tbaa=!21)
>         STFH %R5H<kill>, %R4D, 16, %noreg; mem:ST4[%scevgep116](tbaa=!21)
>         ST %R5L<kill>, %R4D, 24, %noreg; mem:ST4[%scevgep115](tbaa=!21)
>         CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>
> 
> instr-count:  23
> 
> ```
> 
> I experimented with scalarizing the whole interleave group instead, like:
> 
> 
> ```
>   for (auto *I : AddrDefs) {
>     if (isa<LoadInst>(I)) {
>       if (getWideningDecision(I, VF) == CM_Widen)
>         // Scalarize a widened load of address.
>         setWideningDecision(I, VF, CM_Scalarize,
>                             (VF * getMemoryInstructionCost(I, 1)));
>       else if (auto Group = Legal->getInterleavedAccessGroup(I)) {
>         // Scalarize an interleave group of address loads.
>         for (unsigned I = 0; I < Group->getFactor(); ++I) {
>           if (Instruction *Member = Group->getMember(I))
>             setWideningDecision(Member, VF, CM_Scalarize,
>                                 (VF * getMemoryInstructionCost(Member, 1)));
>         }
>       }
>     } else
>       // Make sure I gets scalarized and a cost estimate without
>       // scalarization overhead.
>       ForcedScalars[VF].insert(I);
>   }
> 
> ```
> 
> This was also very close to original patch - just 5 new loops now become vectorized, which seems to be a slight improvement in those loops. This is because now the cost is accurate for the scalarized loads.
> 
> So to summarize, this really doesn't matter that much (on spec), but it seems that (on spec) the best thing to do is to scalarize the whole interleave group, second best is to just scalarize (unmodified patch), third place is to interleave loads of addresses, which is not so good as we then have to do extracts...
> 
You can't change decision of one load from the interleave group - or all, or nothing.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:7331
+        // Scalarize a load of address
+        setWideningDecision(I, VF, CM_Scalarize,
+                            (VF * getMemoryInstructionCost(I, 1)));
----------------
delena wrote:
> jonpa wrote:
> > delena wrote:
> > > At this point you change widening decision of already analyzed Load inst. Make sure that this inst is not a part of interleave group.
> > I tried to only change widening decision if it was CM_Widen, and reran spec. 17 (0.6% of vectorized loops) loops now got different output and all but 2 of them got slightly worse result. So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. They are 64 bits which means 2 per vector register, so there isn't that much room for vectorization anyway.
> > 
> > Here's an example:
> > 
> > Loop before vectorize pass:
> > 
> > 
> > ```
> > for.body61:                                       ; preds = %for.body61, %for.body61.lr.ph
> >   %indvars.iv81 = phi i64 [ %indvars.iv.next82, %for.body61 ], [ 0, %for.body61.lr.ph ]
> >   %next65 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %indvars.iv81, i32 1
> >   %20 = load i32, i32* %next65, align 4, !tbaa !11
> >   %idxprom66 = sext i32 %20 to i64
> >   %arrayidx67 = getelementptr inbounds i32, i32* %1, i64 %idxprom66
> >   %21 = load i32, i32* %arrayidx67, align 4, !tbaa !7
> >   store i32 %21, i32* %next65, align 4, !tbaa !11
> >   %indvars.iv.next82 = add nuw nsw i64 %indvars.iv81, 1
> >   %cmp59 = icmp slt i64 %indvars.iv81, %16
> >   br i1 %cmp59, label %for.body61, label %for.cond75.preheader.loopexit
> > 
> > PATCH UNMODIFIED:
> > 
> > Loop after vectorize pass:
> > 
> > vector.body:                                      ; preds = %vector.body, %vector.ph
> >   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
> >   %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
> >   %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
> >   %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
> >   %20 = add i64 %index, 0
> >   %21 = add i64 %index, 1
> >   %22 = add i64 %index, 2
> >   %23 = add i64 %index, 3
> >   %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
> >   %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
> >   %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
> >   %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
> >   %28 = load i32, i32* %24, align 4, !tbaa !11
> >   %29 = load i32, i32* %25, align 4, !tbaa !11
> >   %30 = load i32, i32* %26, align 4, !tbaa !11
> >   %31 = load i32, i32* %27, align 4, !tbaa !11
> >   %32 = sext i32 %28 to i64
> >   %33 = sext i32 %29 to i64
> >   %34 = sext i32 %30 to i64
> >   %35 = sext i32 %31 to i64
> >   %36 = getelementptr inbounds i32, i32* %1, i64 %32
> >   %37 = getelementptr inbounds i32, i32* %1, i64 %33
> >   %38 = getelementptr inbounds i32, i32* %1, i64 %34
> >   %39 = getelementptr inbounds i32, i32* %1, i64 %35
> >   %40 = load i32, i32* %36, align 4, !tbaa !7
> >   %41 = load i32, i32* %37, align 4, !tbaa !7
> >   %42 = load i32, i32* %38, align 4, !tbaa !7
> >   %43 = load i32, i32* %39, align 4, !tbaa !7
> >   store i32 %40, i32* %24, align 4, !tbaa !11
> >   store i32 %41, i32* %25, align 4, !tbaa !11
> >   store i32 %42, i32* %26, align 4, !tbaa !11
> >   store i32 %43, i32* %27, align 4, !tbaa !11
> >   %index.next = add i64 %index, 4
> >   %44 = icmp eq i64 %index.next, %n.vec
> >   br i1 %44, label %middle.block, label %vector.body, !llvm.loop !12
> > 
> > ->
> > 
> > Final Header: 
> > BB#22: derived from LLVM BB %vector.body
> >     Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
> >     Predecessors according to CFG: BB#21 BB#23
> >         %R5D<def> = LGF %R4D, -32, %noreg; mem:LD4[%scevgep112](tbaa=!21)
> >         %R14D<def> = LGF %R4D, -40, %noreg; mem:LD4[%scevgep113](tbaa=!21)
> >         %R5D<def> = SLLG %R5D<kill>, %noreg, 2
> >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> >         %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%41](tbaa=!2)
> >         %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%40](tbaa=!2)
> >         %R14D<def> = LGF %R4D, -48, %noreg; mem:LD4[%scevgep114](tbaa=!21)
> >         %R11D<def> = LGF %R4D, -56, %noreg; mem:LD4[%scevgep115](tbaa=!21)
> >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> >         %R11D<def> = SLLG %R11D<kill>, %noreg, 2
> >         %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%39](tbaa=!2)
> >         %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%38](tbaa=!2)
> >         STFH %R14H<kill>, %R4D, -56, %noreg; mem:ST4[%scevgep115](tbaa=!21)
> >         STY %R14L<kill>, %R4D, -48, %noreg; mem:ST4[%scevgep114](tbaa=!21)
> >         STFH %R5H<kill>, %R4D, -40, %noreg; mem:ST4[%scevgep113](tbaa=!21)
> >         STY %R5L<kill>, %R4D, -32, %noreg; mem:ST4[%scevgep112](tbaa=!21)
> >         CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>
> > 
> > instr-count:  17
> > 
> > MODIFIED TO NOT TOUCH INTERLEAVED LOADS:
> > 
> > Loop after vectorize pass:
> > 
> > vector.body:                                      ; preds = %vector.body, %vector.ph
> >   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
> >   %broadcast.splatinsert = insertelement <4 x i64> undef, i64 %index, i32 0
> >   %broadcast.splat = shufflevector <4 x i64> %broadcast.splatinsert, <4 x i64> undef, <4 x i32> zeroinitializer
> >   %induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
> >   %20 = add i64 %index, 0
> >   %21 = add i64 %index, 1
> >   %22 = add i64 %index, 2
> >   %23 = add i64 %index, 3
> >   %24 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %20, i32 1
> >   %25 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %21, i32 1
> >   %26 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %22, i32 1
> >   %27 = getelementptr inbounds %struct.attrib, %struct.attrib* %15, i64 %23, i32 1
> >   %28 = getelementptr i32, i32* %24, i32 0
> >   %29 = bitcast i32* %28 to <8 x i32>*
> >   %wide.vec = load <8 x i32>, <8 x i32>* %29, align 4, !tbaa !11
> >   %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
> >   %30 = extractelement <4 x i32> %strided.vec, i32 0
> >   %31 = sext i32 %30 to i64
> >   %32 = extractelement <4 x i32> %strided.vec, i32 1
> >   %33 = sext i32 %32 to i64
> >   %34 = extractelement <4 x i32> %strided.vec, i32 2
> >   %35 = sext i32 %34 to i64
> >   %36 = extractelement <4 x i32> %strided.vec, i32 3
> >   %37 = sext i32 %36 to i64
> >   %38 = getelementptr inbounds i32, i32* %1, i64 %31
> >   %39 = getelementptr inbounds i32, i32* %1, i64 %33
> >   %40 = getelementptr inbounds i32, i32* %1, i64 %35
> >   %41 = getelementptr inbounds i32, i32* %1, i64 %37
> >   %42 = load i32, i32* %38, align 4, !tbaa !7
> >   %43 = load i32, i32* %39, align 4, !tbaa !7
> >   %44 = load i32, i32* %40, align 4, !tbaa !7
> >   %45 = load i32, i32* %41, align 4, !tbaa !7
> >   store i32 %42, i32* %24, align 4, !tbaa !11
> >   store i32 %43, i32* %25, align 4, !tbaa !11
> >   store i32 %44, i32* %26, align 4, !tbaa !11
> >   store i32 %45, i32* %27, align 4, !tbaa !11
> >   %index.next = add i64 %index, 4
> >   %46 = icmp eq i64 %index.next, %n.vec
> >   br i1 %46, label %middle.block, label %vector.body, !llvm.loop !12
> > 
> > Final Header: 
> > BB#22: derived from LLVM BB %vector.body
> >     Live Ins: %R0D %R1D %R2D %R3D %R4D %R12D %R13D
> >     Predecessors according to CFG: BB#21 BB#23
> >         %V0<def> = VL %R4D, 16, %noreg; mem:LD16[%lsr.iv106113+16](align=4)(tbaa=!21)
> >         %R5D<def> = VLGVF %V0, %noreg, 2
> >         %R5D<def> = LGFR %R5L<kill>, %R5D<imp-use,kill>
> >         %R5D<def> = SLLG %R5D<kill>, %noreg, 2
> >         %R5L<def> = L %R12D, 0, %R5D<kill>; mem:LD4[%44](tbaa=!2)
> >         %R14D<def> = VLGVF %V0<kill>, %noreg, 0
> >         %V0<def> = VL %R4D, 0, %noreg; mem:LD16[%lsr.iv106113](align=4)(tbaa=!21)
> >         %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
> >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> >         %R5H<def> = LFH %R12D, 0, %R14D<kill>; mem:LD4[%43](tbaa=!2)
> >         %R14D<def> = VLGVF %V0, %noreg, 0
> >         %R14D<def> = LGFR %R14L<kill>, %R14D<imp-use,kill>
> >         %R14D<def> = SLLG %R14D<kill>, %noreg, 2
> >         %R14L<def> = L %R12D, 0, %R14D<kill>; mem:LD4[%41](tbaa=!2)
> >         %R11D<def> = VLGVF %V0<kill>, %noreg, 2
> >         %R11D<def> = LGFR %R11L<kill>, %R11D<imp-use,kill>
> >         %R11D<def> = SLLG %R11D<kill>, %noreg, 2
> >         %R14H<def> = LFH %R12D, 0, %R11D<kill>; mem:LD4[%42](tbaa=!2)
> >         STFH %R14H<kill>, %R4D, 8, %noreg; mem:ST4[%scevgep117](tbaa=!21)
> >         ST %R14L<kill>, %R4D, 0, %noreg; mem:ST4[%lsr.iv106108](tbaa=!21)
> >         STFH %R5H<kill>, %R4D, 16, %noreg; mem:ST4[%scevgep116](tbaa=!21)
> >         ST %R5L<kill>, %R4D, 24, %noreg; mem:ST4[%scevgep115](tbaa=!21)
> >         CGIJ %R3D, 0, 8, <BB#24>, %CC<imp-def,dead>
> > 
> > instr-count:  23
> > 
> > ```
> > 
> > I experimented with scalarizing the whole interleave group instead, like:
> > 
> > 
> > ```
> >   for (auto *I : AddrDefs) {
> >     if (isa<LoadInst>(I)) {
> >       if (getWideningDecision(I, VF) == CM_Widen)
> >         // Scalarize a widened load of address.
> >         setWideningDecision(I, VF, CM_Scalarize,
> >                             (VF * getMemoryInstructionCost(I, 1)));
> >       else if (auto Group = Legal->getInterleavedAccessGroup(I)) {
> >         // Scalarize an interleave group of address loads.
> >         for (unsigned I = 0; I < Group->getFactor(); ++I) {
> >           if (Instruction *Member = Group->getMember(I))
> >             setWideningDecision(Member, VF, CM_Scalarize,
> >                                 (VF * getMemoryInstructionCost(Member, 1)));
> >         }
> >       }
> >     } else
> >       // Make sure I gets scalarized and a cost estimate without
> >       // scalarization overhead.
> >       ForcedScalars[VF].insert(I);
> >   }
> > 
> > ```
> > 
> > This was also very close to original patch - just 5 new loops now become vectorized, which seems to be a slight improvement in those loops. This is because now the cost is accurate for the scalarized loads.
> > 
> > So to summarize, this really doesn't matter that much (on spec), but it seems that (on spec) the best thing to do is to scalarize the whole interleave group, second best is to just scalarize (unmodified patch), third place is to interleave loads of addresses, which is not so good as we then have to do extracts...
> > 
> You can't change decision of one load from the interleave group - or all, or nothing.
  > "So this isn't a big deal, but still I would insist that there isn't a good reason here to interleave loads of addresses. "
This is a cost model decision. Interleaving of 64-bit loads is profitable on X86.

https://reviews.llvm.org/D32422