[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Thu Jan 16 09:26:41 PST 2014

Hi Diego, 

It looks like the problem is with the code in the vectorizer that tries to estimate the most profitable vectorization factor:

> LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
> i64* %state, align 8, !dbg !58, !tbaa !61

It looks like a cost model problem.  The vectorizer thinks that loading %3 (above) is non consecutive and would require scatter/gather.  Is that correct? I wonder that SCEV is reporting. Is there an index overflow problem that is preventing us from loading consecutive elements?

Thanks,
Nadav

On Jan 16, 2014, at 8:16 AM, Diego Novillo <dnovillo at google.com> wrote:

> On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at apple.com> wrote:
> 
>> Was the vectorizer successful in unrolling the loop in quantum_sigma_x?  I
>> wonder if 'size’ is typically high or low.
> 
> No. The vectorizer stated that it wasn't going to bother with the loop
> because it wasn't profitable. Specifically:
> 
> LV: Checking a loop in "quantum_sigma_x"
> LV: Found a loop: for.body
> LV: Found an induction variable.
> LV: Found a write-only loop!
> LV: We can vectorize this loop!
> LV: Found trip count: 0
> LV: The Widest type: 64 bits.
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
> %for.body ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %state =
> getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
> %indvars.iv, i32 1, !dbg !58
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %3 = load
> i64* %state, align 8, !dbg !58, !tbaa !61
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %xor =
> xor i64 %3, %shl, !dbg !58
> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
> %xor, i64* %state, align 8, !dbg !58, !tbaa !61
> LV: Found an estimated cost of 1 for VF 1 For instruction:
> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %4 =
> trunc i64 %indvars.iv.next to i32, !dbg !52
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %cmp =
> icmp slt i32 %4, %1, !dbg !52
> LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1
> %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
> LV: Scalar loop costs: 5.
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
> %for.body ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %state =
> getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
> %indvars.iv, i32 1, !dbg !58
> LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
> i64* %state, align 8, !dbg !58, !tbaa !61
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %xor =
> xor i64 %3, %shl, !dbg !58
> LV: Found an estimated cost of 6 for VF 2 For instruction:   store i64
> %xor, i64* %state, align 8, !dbg !58, !tbaa !61
> LV: Found an estimated cost of 1 for VF 2 For instruction:
> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %4 =
> trunc i64 %indvars.iv.next to i32, !dbg !52
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %cmp =
> icmp slt i32 %4, %1, !dbg !52
> LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1
> %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
> LV: Vector loop of width 2 costs: 7.
> LV: Selecting VF = : 1.
> LV: The target has 16 vector registers
> LV(REG): Calculating max register usage:
> LV(REG): At #0 Interval # 0
> LV(REG): At #1 Interval # 1
> LV(REG): At #2 Interval # 2
> LV(REG): At #3 Interval # 3
> LV(REG): At #5 Interval # 2
> LV(REG): At #6 Interval # 2
> LV(REG): At #7 Interval # 2
> LV(REG): Found max usage: 3
> LV(REG): Found invariant usage: 3
> LV(REG): LoopSize: 9
> LV: Found a vectorizable loop (1) in gates.ll
> LV: Unroll Factor is 1
> LV: Vectorization is possible but not beneficial.
> 
> 
> I poked briefly at the vectorizer code to see if there is anything
> that the profile data could've told it, but this loop did not meet the
> requirements for unrolling. And even if it did, the trip count is not
> constant and the unroll factor used by the vectorizer is pretty low.
> So, even if we vectorized it (or parts of it) I don't think the
> speedup would be significant.
> 
> What really helps this loop is to peel it a few times and do the
> remaining iterations in the loop.
> 
> 
> Diego.