[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Thu Jan 16 09:29:31 PST 2014

On Jan 16, 2014, at 9:26 AM, Nadav Rotem <nrotem at apple.com> wrote:

> Hi Diego, 
> 
> It looks like the problem is with the code in the vectorizer that tries to estimate the most profitable vectorization factor:
> 
>> LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
>> i64* %state, align 8, !dbg !58, !tbaa !61
> 
> 
> It looks like a cost model problem.  The vectorizer thinks that loading %3 (above) is non consecutive and would require scatter/gather.  Is that correct? I wonder that SCEV is reporting. Is there an index overflow problem that is preventing us from loading consecutive elements?
> 

Oh, yes, yes.

reg->node[i].state ^=  ...

We are accessing a struct with multiple members. This would require strided access because we need to skip the other members of the struct. 

Thanks,
Nadav 

> 
> On Jan 16, 2014, at 8:16 AM, Diego Novillo <dnovillo at google.com> wrote:
> 
>> On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at apple.com> wrote:
>> 
>>> Was the vectorizer successful in unrolling the loop in quantum_sigma_x?  I
>>> wonder if 'size’ is typically high or low.
>> 
>> No. The vectorizer stated that it wasn't going to bother with the loop
>> because it wasn't profitable. Specifically:
>> 
>> LV: Checking a loop in "quantum_sigma_x"
>> LV: Found a loop: for.body
>> LV: Found an induction variable.
>> LV: Found a write-only loop!
>> LV: We can vectorize this loop!
>> LV: Found trip count: 0
>> LV: The Widest type: 64 bits.
>> LV: Found an estimated cost of 0 for VF 1 For instruction:
>> %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
>> %for.body ]
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %state =
>> getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
>> %indvars.iv, i32 1, !dbg !58
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %3 = load
>> i64* %state, align 8, !dbg !58, !tbaa !61
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %xor =
>> xor i64 %3, %shl, !dbg !58
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
>> %xor, i64* %state, align 8, !dbg !58, !tbaa !61
>> LV: Found an estimated cost of 1 for VF 1 For instruction:
>> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %4 =
>> trunc i64 %indvars.iv.next to i32, !dbg !52
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %cmp =
>> icmp slt i32 %4, %1, !dbg !52
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1
>> %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
>> LV: Scalar loop costs: 5.
>> LV: Found an estimated cost of 0 for VF 2 For instruction:
>> %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
>> %for.body ]
>> LV: Found an estimated cost of 0 for VF 2 For instruction:   %state =
>> getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
>> %indvars.iv, i32 1, !dbg !58
>> LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
>> i64* %state, align 8, !dbg !58, !tbaa !61
>> LV: Found an estimated cost of 1 for VF 2 For instruction:   %xor =
>> xor i64 %3, %shl, !dbg !58
>> LV: Found an estimated cost of 6 for VF 2 For instruction:   store i64
>> %xor, i64* %state, align 8, !dbg !58, !tbaa !61
>> LV: Found an estimated cost of 1 for VF 2 For instruction:
>> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
>> LV: Found an estimated cost of 0 for VF 2 For instruction:   %4 =
>> trunc i64 %indvars.iv.next to i32, !dbg !52
>> LV: Found an estimated cost of 1 for VF 2 For instruction:   %cmp =
>> icmp slt i32 %4, %1, !dbg !52
>> LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1
>> %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
>> LV: Vector loop of width 2 costs: 7.
>> LV: Selecting VF = : 1.
>> LV: The target has 16 vector registers
>> LV(REG): Calculating max register usage:
>> LV(REG): At #0 Interval # 0
>> LV(REG): At #1 Interval # 1
>> LV(REG): At #2 Interval # 2
>> LV(REG): At #3 Interval # 3
>> LV(REG): At #5 Interval # 2
>> LV(REG): At #6 Interval # 2
>> LV(REG): At #7 Interval # 2
>> LV(REG): Found max usage: 3
>> LV(REG): Found invariant usage: 3
>> LV(REG): LoopSize: 9
>> LV: Found a vectorizable loop (1) in gates.ll
>> LV: Unroll Factor is 1
>> LV: Vectorization is possible but not beneficial.
>> 
>> 
>> I poked briefly at the vectorizer code to see if there is anything
>> that the profile data could've told it, but this loop did not meet the
>> requirements for unrolling. And even if it did, the trip count is not
>> constant and the unroll factor used by the vectorizer is pretty low.
>> So, even if we vectorized it (or parts of it) I don't think the
>> speedup would be significant.
>> 
>> What really helps this loop is to peel it a few times and do the
>> remaining iterations in the loop.
>> 
>> 
>> Diego.
>