[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Thu Jan 16 08:16:59 PST 2014

On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at apple.com> wrote:

> Was the vectorizer successful in unrolling the loop in quantum_sigma_x?  I
> wonder if 'size’ is typically high or low.

No. The vectorizer stated that it wasn't going to bother with the loop
because it wasn't profitable. Specifically:

LV: Checking a loop in "quantum_sigma_x"
LV: Found a loop: for.body
LV: Found an induction variable.
LV: Found a write-only loop!
LV: We can vectorize this loop!
LV: Found trip count: 0
LV: The Widest type: 64 bits.
LV: The Widest register is: 128 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction:
%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
%for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %state =
getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
%indvars.iv, i32 1, !dbg !58
LV: Found an estimated cost of 1 for VF 1 For instruction:   %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 1 For instruction:   %xor =
xor i64 %3, %shl, !dbg !58
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
%xor, i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 1 For instruction:
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
LV: Found an estimated cost of 0 for VF 1 For instruction:   %4 =
trunc i64 %indvars.iv.next to i32, !dbg !52
LV: Found an estimated cost of 1 for VF 1 For instruction:   %cmp =
icmp slt i32 %4, %1, !dbg !52
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1
%cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
LV: Scalar loop costs: 5.
LV: Found an estimated cost of 0 for VF 2 For instruction:
%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
%for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %state =
getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
%indvars.iv, i32 1, !dbg !58
LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 2 For instruction:   %xor =
xor i64 %3, %shl, !dbg !58
LV: Found an estimated cost of 6 for VF 2 For instruction:   store i64
%xor, i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 2 For instruction:
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
LV: Found an estimated cost of 0 for VF 2 For instruction:   %4 =
trunc i64 %indvars.iv.next to i32, !dbg !52
LV: Found an estimated cost of 1 for VF 2 For instruction:   %cmp =
icmp slt i32 %4, %1, !dbg !52
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1
%cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
LV: Vector loop of width 2 costs: 7.
LV: Selecting VF = : 1.
LV: The target has 16 vector registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #5 Interval # 2
LV(REG): At #6 Interval # 2
LV(REG): At #7 Interval # 2
LV(REG): Found max usage: 3
LV(REG): Found invariant usage: 3
LV(REG): LoopSize: 9
LV: Found a vectorizable loop (1) in gates.ll
LV: Unroll Factor is 1
LV: Vectorization is possible but not beneficial.

I poked briefly at the vectorizer code to see if there is anything
that the profile data could've told it, but this loop did not meet the
requirements for unrolling. And even if it did, the trip count is not
constant and the unroll factor used by the vectorizer is pretty low.
So, even if we vectorized it (or parts of it) I don't think the
speedup would be significant.

What really helps this loop is to peel it a few times and do the
remaining iterations in the loop.

Diego.