[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence

Thu Oct 6 11:40:00 PDT 2016

Hi Jonas,

It does look like we should be able to simplify this. Would you mind filing
a bug? Looking at the code after InstCombine, the vector adds are trivially
redundant (I think EarlyCSE should already be able to remove them). I think
we could then teach InstructionSimplify to simplify the remaining shuffles
similar to the way it already handles extracts.

-- Matt

On Thu, Oct 6, 2016 at 10:30 AM, Jonas Paulsson via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
> Hi,
>
> I have experimented with enabling the LoopVectorizer for SystemZ. I have
> come across a loop which, when vectorized, seems to have been poorly
> generated. In short, there seems to be a completely unnecessary sequence of
> shufflevector instructions, that doesn't get optimized away anywhere. In
> other words, there is a shuffling so that leads back to the original vector:
>
>        [0 1 2 3 4 5 6 7]
>
>  [0 4]   [1 5]   [2 6]   [3 7]
>
>    [0 4 1 5]       [2 6 3 7]
>
>        [0 1 2 3 4 5 6 7]
>
> Is this something the instruction combiner, or perhaps the
> InterleavedAccess pass should handle? Even though I suspect that there are
> currently many target hooks for SystemZ with bad values returned, this
> seems like something that the optimizers should handle regardless. The
> result of this is unnecessary target instruction - as can be seen at the
> bottom.
>
> I would appreciate any input on this, and if needed I can supply a test
> case.
>
> /Jonas
>
>
> Loop before vectorize pass:
>
> while.body320:                                    ; preds =
> %while.body320.preheader, %while.body320
>   %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
>   %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
>   %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
>   %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
> %while.body320.preheader ]
>   %dec = add nsw i32 %len.0288, -1
>   %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
>   %176 = load i64, i64* %ll.0290, align 8
>   %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
>   %177 = load i64, i64* %rl.0289, align 8
>   %and322 = and i64 %177, %176
>   %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
>   store i64 %and322, i64* %dl.0291, align 8
>   %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
>   %178 = load i64, i64* %incdec.ptr, align 8
>   %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
>   %179 = load i64, i64* %incdec.ptr321, align 8
>   %and326 = and i64 %179, %178
>   %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
>   store i64 %and326, i64* %incdec.ptr323, align 8
>   %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
>   %180 = load i64, i64* %incdec.ptr324, align 8
>   %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
>   %181 = load i64, i64* %incdec.ptr325, align 8
>   %and330 = and i64 %181, %180
>   %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
>   store i64 %and330, i64* %incdec.ptr327, align 8
>   %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
>   %182 = load i64, i64* %incdec.ptr328, align 8
>   %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
>   %183 = load i64, i64* %incdec.ptr329, align 8
>   %and334 = and i64 %183, %182
>   %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
>   store i64 %and334, i64* %incdec.ptr331, align 8
>   %tobool319 = icmp eq i32 %dec, 0
>   br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320
>
>
> Vectorizing:
>
> LV: Checking a loop in "Perl_do_vop" from do_vop.bc
> LV: Loop hints: force=? width=0 unroll=0
> LV: Found a loop: while.body320
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Did not find one integer induction var.
> LV: We can vectorize this loop (with a runtime bound check)!
> LV: Analyzing interleaved accesses...
> LV: Creating an interleave group with:  store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted:  store i64 %and330, i64* %incdec.ptr327, align 8
>     into the interleave group with  store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted:  store i64 %and326, i64* %incdec.ptr323, align 8
>     into the interleave group with  store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted:  store i64 %and322, i64* %dl.0291, align 8
>     into the interleave group with  store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Creating an interleave group with:  %183 = load i64, i64*
> %incdec.ptr329, align 8
> LV: Inserted:  %181 = load i64, i64* %incdec.ptr325, align 8
>     into the interleave group with  %183 = load i64, i64* %incdec.ptr329,
> align 8
> LV: Inserted:  %179 = load i64, i64* %incdec.ptr321, align 8
>     into the interleave group with  %183 = load i64, i64* %incdec.ptr329,
> align 8
> LV: Inserted:  %177 = load i64, i64* %rl.0289, align 8
>     into the interleave group with  %183 = load i64, i64* %incdec.ptr329,
> align 8
> LV: Creating an interleave group with:  %182 = load i64, i64*
> %incdec.ptr328, align 8
> LV: Inserted:  %180 = load i64, i64* %incdec.ptr324, align 8
>     into the interleave group with  %182 = load i64, i64* %incdec.ptr328,
> align 8
> LV: Inserted:  %178 = load i64, i64* %incdec.ptr, align 8
>     into the interleave group with  %182 = load i64, i64* %incdec.ptr328,
> align 8
> LV: Inserted:  %176 = load i64, i64* %ll.0290, align 8
>     into the interleave group with  %182 = load i64, i64* %incdec.ptr328,
> align 8
> LV: Found uniform instruction:   %tobool319 = icmp eq i32 %dec, 0
> LV: Found uniform instruction:   %incdec.ptr324 = getelementptr inbounds
> i64, i64* %ll.0290, i64 2
> LV: Found uniform instruction:   %incdec.ptr329 = getelementptr inbounds
> i64, i64* %rl.0289, i64 3
> LV: Found uniform instruction:   %incdec.ptr323 = getelementptr inbounds
> i64, i64* %dl.0291, i64 1
> LV: Found uniform instruction:   %incdec.ptr328 = getelementptr inbounds
> i64, i64* %ll.0290, i64 3
> LV: Found uniform instruction:   %incdec.ptr321 = getelementptr inbounds
> i64, i64* %rl.0289, i64 1
> LV: Found uniform instruction:   %incdec.ptr327 = getelementptr inbounds
> i64, i64* %dl.0291, i64 2
> LV: Found uniform instruction:   %incdec.ptr325 = getelementptr inbounds
> i64, i64* %rl.0289, i64 2
> LV: Found uniform instruction:   %incdec.ptr331 = getelementptr inbounds
> i64, i64* %dl.0291, i64 3
> LV: Found uniform instruction:   %incdec.ptr = getelementptr inbounds i64,
> i64* %ll.0290, i64 1
> LV: Found uniform instruction:   %dl.0291 = phi i64* [ %incdec.ptr335,
> %while.body320 ], [ %73, %while.body320.preheader ]
> LV: Found uniform instruction:   %incdec.ptr335 = getelementptr inbounds
> i64, i64* %dl.0291, i64 4
> LV: Found uniform instruction:   %ll.0290 = phi i64* [ %incdec.ptr332,
> %while.body320 ], [ %74, %while.body320.preheader ]
> LV: Found uniform instruction:   %incdec.ptr332 = getelementptr inbounds
> i64, i64* %ll.0290, i64 4
> LV: Found uniform instruction:   %rl.0289 = phi i64* [ %incdec.ptr333,
> %while.body320 ], [ %75, %while.body320.preheader ]
> LV: Found uniform instruction:   %incdec.ptr333 = getelementptr inbounds
> i64, i64* %rl.0289, i64 4
> LV: Found uniform instruction:   %len.0288 = phi i32 [ %dec,
> %while.body320 ], [ %conv316, %while.body320.preheader ]
> LV: Found uniform instruction:   %dec = add nsw i32 %len.0288, -1
> LV: Found trip count: 0
> LV: The Smallest and Widest types: 64 / 64 bits.
> LV: The Widest register is: 128 bits.
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %dl.0291 =
> phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %ll.0290 =
> phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %rl.0289 =
> phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %len.0288 =
> phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ]
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %dec = add
> nsw i32 %len.0288, -1
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %incdec.ptr =
> getelementptr inbounds i64, i64* %ll.0290, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %176 = load
> i64, i64* %ll.0290, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %177 = load
> i64, i64* %rl.0289, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %and322 = and
> i64 %177, %176
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
> %and322, i64* %dl.0291, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %178 = load
> i64, i64* %incdec.ptr, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %179 = load
> i64, i64* %incdec.ptr321, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %and326 = and
> i64 %179, %178
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
> %and326, i64* %incdec.ptr323, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %180 = load
> i64, i64* %incdec.ptr324, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %181 = load
> i64, i64* %incdec.ptr325, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %and330 = and
> i64 %181, %180
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
> %and330, i64* %incdec.ptr327, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %182 = load
> i64, i64* %incdec.ptr328, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %183 = load
> i64, i64* %incdec.ptr329, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %and334 = and
> i64 %183, %182
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
> %and334, i64* %incdec.ptr331, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %tobool319 =
> icmp eq i32 %dec, 0
> LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1
> %tobool319, label %sw.epilog381.loopexit, label %while.body320
> LV: Scalar loop costs: 18.
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %dl.0291 =
> phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %ll.0290 =
> phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %rl.0289 =
> phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %len.0288 =
> phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ]
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %dec = add
> nsw i32 %len.0288, -1
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %incdec.ptr =
> getelementptr inbounds i64, i64* %ll.0290, i64 1
> LV: Found an estimated cost of 4 for VF 2 For instruction:   %176 = load
> i64, i64* %ll.0290, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
> LV: Found an estimated cost of 4 for VF 2 For instruction:   %177 = load
> i64, i64* %rl.0289, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %and322 = and
> i64 %177, %176
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
> LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64
> %and322, i64* %dl.0291, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %178 = load
> i64, i64* %incdec.ptr, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %179 = load
> i64, i64* %incdec.ptr321, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %and326 = and
> i64 %179, %178
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64
> %and326, i64* %incdec.ptr323, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %180 = load
> i64, i64* %incdec.ptr324, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %181 = load
> i64, i64* %incdec.ptr325, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %and330 = and
> i64 %181, %180
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64
> %and330, i64* %incdec.ptr327, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %182 = load
> i64, i64* %incdec.ptr328, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %183 = load
> i64, i64* %incdec.ptr329, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %and334 = and
> i64 %183, %182
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
> LV: Found an estimated cost of 4 for VF 2 For instruction:   store i64
> %and334, i64* %incdec.ptr331, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %tobool319 =
> icmp eq i32 %dec, 0
> LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1
> %tobool319, label %sw.epilog381.loopexit, label %while.body320
> LV: Vector loop of width 2 costs: 9.
> LV: Selecting VF: 2.
> LV: The target has 32 registers
> LV(REG): Calculating max register usage:
> LV(REG): At #0 Interval # 0
> LV(REG): At #1 Interval # 1
> LV(REG): At #2 Interval # 2
> LV(REG): At #3 Interval # 3
> LV(REG): At #4 Interval # 4
> LV(REG): At #5 Interval # 4
> LV(REG): At #6 Interval # 5
> LV(REG): At #7 Interval # 6
> LV(REG): At #8 Interval # 7
> LV(REG): At #9 Interval # 8
> LV(REG): At #10 Interval # 7
> LV(REG): At #12 Interval # 7
> LV(REG): At #13 Interval # 8
> LV(REG): At #14 Interval # 8
> LV(REG): At #15 Interval # 9
> LV(REG): At #16 Interval # 9
> LV(REG): At #17 Interval # 8
> LV(REG): At #19 Interval # 7
> LV(REG): At #20 Interval # 8
> LV(REG): At #21 Interval # 8
> LV(REG): At #22 Interval # 9
> LV(REG): At #23 Interval # 9
> LV(REG): At #24 Interval # 8
> LV(REG): At #26 Interval # 7
> LV(REG): At #27 Interval # 7
> LV(REG): At #28 Interval # 7
> LV(REG): At #29 Interval # 7
> LV(REG): At #30 Interval # 7
> LV(REG): At #31 Interval # 6
> LV(REG): At #33 Interval # 5
> LV(REG): VF = 2
> LV(REG): Found max usage: 2
> LV(REG): Found invariant usage: 4
> LV(REG): LoopSize: 35
> LV: Loop cost is 18
> LV: Interleaving to reduce branch cost.
> LV: Interleaving is not beneficial.
> LV: Found a vectorizable loop (2) in do_vop.bc
> LV: Interleaving disabled by the pass manager
> LV: Scalarizing:  %dec = add nsw i32 %len.0288, -1
> LV: Scalarizing:  %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290,
> i64 1
> LV: Scalarizing:  %incdec.ptr321 = getelementptr inbounds i64, i64*
> %rl.0289, i64 1
> LV: Scalarizing:  %incdec.ptr323 = getelementptr inbounds i64, i64*
> %dl.0291, i64 1
> LV: Scalarizing:  %incdec.ptr324 = getelementptr inbounds i64, i64*
> %ll.0290, i64 2
> LV: Scalarizing:  %incdec.ptr325 = getelementptr inbounds i64, i64*
> %rl.0289, i64 2
> LV: Scalarizing:  %incdec.ptr327 = getelementptr inbounds i64, i64*
> %dl.0291, i64 2
> LV: Scalarizing:  %incdec.ptr328 = getelementptr inbounds i64, i64*
> %ll.0290, i64 3
> LV: Scalarizing:  %incdec.ptr329 = getelementptr inbounds i64, i64*
> %rl.0289, i64 3
> LV: Scalarizing:  %incdec.ptr331 = getelementptr inbounds i64, i64*
> %dl.0291, i64 3
> LV: Scalarizing:  %incdec.ptr332 = getelementptr inbounds i64, i64*
> %ll.0290, i64 4
> LV: Scalarizing:  %incdec.ptr333 = getelementptr inbounds i64, i64*
> %rl.0289, i64 4
> LV: Scalarizing:  %incdec.ptr335 = getelementptr inbounds i64, i64*
> %dl.0291, i64 4
> LV: Scalarizing:  %tobool319 = icmp eq i32 %dec, 0
>
> vectorized loop (vectorization width: 2, interleaved count: 1)
>
> Loop after vectorize pass:
>
> vector.body419:                                   ; preds =
> %vector.body419, %vector.ph440
>   %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442,
> %vector.body419 ]
>   %184 = add i64 %index441, 0
>   %185 = shl i64 %184, 2
>   %next.gep453 = getelementptr i64, i64* %73, i64 %185
>   %186 = add i64 %index441, 0
>   %187 = shl i64 %186, 2
>   %next.gep454 = getelementptr i64, i64* %74, i64 %187
>   %188 = add i64 %index441, 0
>   %189 = shl i64 %188, 2
>   %next.gep455 = getelementptr i64, i64* %75, i64 %189
>   %190 = trunc i64 %index441 to i32
>   %offset.idx456 = sub i32 %conv316, %190
>   %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32
> %offset.idx456, i32 0
>   %broadcast.splat458 = shufflevector <2 x i32> %broadcast.splatinsert457,
> <2 x i32> undef, <2 x i32> zeroinitializer
>   %induction459 = add <2 x i32> %broadcast.splat458, <i32 0, i32 -1>
>   %191 = add i32 %offset.idx456, 0
>   %192 = add nsw i32 %191, -1
>   %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1
>   %194 = getelementptr i64, i64* %next.gep454, i32 0
>   %195 = bitcast i64* %194 to <8 x i64>*
>   %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8, !alias.scope !21
>   %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef,
> <2 x i32> <i32 0, i32 4>
>   %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef,
> <2 x i32> <i32 1, i32 5>
>   %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef,
> <2 x i32> <i32 2, i32 6>
>   %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef,
> <2 x i32> <i32 3, i32 7>
>   %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1
>   %197 = getelementptr i64, i64* %next.gep455, i32 0
>   %198 = bitcast i64* %197 to <8 x i64>*
>   %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8, !alias.scope !24
>   %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef,
> <2 x i32> <i32 0, i32 4>
>   %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef,
> <2 x i32> <i32 1, i32 5>
>   %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef,
> <2 x i32> <i32 2, i32 6>
>   %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef,
> <2 x i32> <i32 3, i32 7>
>   %199 = and <2 x i64> %strided.vec466, %strided.vec461
>   %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1
>   %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2
>   %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2
>   %203 = and <2 x i64> %strided.vec467, %strided.vec462
>   %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2
>   %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3
>   %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3
>   %207 = and <2 x i64> %strided.vec468, %strided.vec463
>   %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3
>   %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4
>   %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4
>   %211 = and <2 x i64> %strided.vec469, %strided.vec464
>   %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4
>   %213 = getelementptr i64, i64* %208, i32 -3
>   %214 = bitcast i64* %213 to <8 x i64>*
>   %215 = shufflevector <2 x i64> %199, <2 x i64> %203, <4 x i32> <i32 0,
> i32 1, i32 2, i32 3>
>   %216 = shufflevector <2 x i64> %207, <2 x i64> %211, <4 x i32> <i32 0,
> i32 1, i32 2, i32 3>
>   %217 = shufflevector <4 x i64> %215, <4 x i64> %216, <8 x i32> <i32 0,
> i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
>   %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x i64> undef, <8
> x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
>   store <8 x i64> %interleaved.vec470, <8 x i64>* %214, align 8,
> !alias.scope !26, !noalias !28
>   %218 = icmp eq i32 %192, 0
>   %index.next442 = add i64 %index441, 2
>   %219 = icmp eq i64 %index.next442, %n.vec425
>   br i1 %219, label %middle.block420, label %vector.body419, !llvm.loop !29
>
> Loop after instruction combining:
>
> vector.body419:                                   ; preds =
> %vector.body419, %vector.body419.preheader
>   %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2,
> %vector.body419.preheader ]
>   %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond,
> %vector.body419.preheader ]
>   %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57,
> %vector.body419.preheader ]
>   %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [ %n.vec425,
> %vector.body419.preheader ]
>   %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>*
>   %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>*
>   %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>*
>   %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961, align 8,
> !alias.scope !21
>   %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658, align 8,
> !alias.scope !24
>   %179 = and <8 x i64> %wide.vec465, %wide.vec460
>   %180 = shufflevector <8 x i64> %179, <8 x i64> undef, <2 x i32> <i32 0,
> i32 4>
>   %181 = and <8 x i64> %wide.vec465, %wide.vec460
>   %182 = shufflevector <8 x i64> %181, <8 x i64> undef, <2 x i32> <i32 1,
> i32 5>
>   %183 = and <8 x i64> %wide.vec465, %wide.vec460
>   %184 = shufflevector <8 x i64> %183, <8 x i64> undef, <2 x i32> <i32 2,
> i32 6>
>   %185 = and <8 x i64> %wide.vec465, %wide.vec460
>   %186 = shufflevector <8 x i64> %185, <8 x i64> undef, <2 x i32> <i32 3,
> i32 7>
>   %187 = shufflevector <2 x i64> %180, <2 x i64> %182, <4 x i32> <i32 0,
> i32 1, i32 2, i32 3>
>   %188 = shufflevector <2 x i64> %184, <2 x i64> %186, <4 x i32> <i32 0,
> i32 1, i32 2, i32 3>
>   %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x i64> %188, <8 x
> i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
>   store <8 x i64> %interleaved.vec470, <8 x i64>* %lsr.iv6264, align 8,
> !alias.scope !26, !noalias !28
>   %lsr.iv.next55 = add i64 %lsr.iv54, -2
>   %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64
>   %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64
>   %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64
>   %189 = icmp eq i64 %lsr.iv.next55, 0
>   br i1 %189, label %middle.block420, label %vector.body419, !llvm.loop !29
>
> Final vectorized loop
>
> .LBB0_141:                              # %vector.body419
>                                         # =>This Inner Loop Header: Depth=1
>         vl      %v0, 48(%r8)
>         vl      %v1, 48(%r7)
>         vn      %v0, %v1, %v0
>         vl      %v1, 16(%r8)
>         vl      %v2, 16(%r7)
>         vn      %v1, %v2, %v1
>         vmrlg   %v2, %v1, %v0
>         vmrhg   %v0, %v1, %v0
>         vmrlg   %v1, %v0, %v2
>         vst     %v1, 48(%r9)
>         vl      %v1, 32(%r8)
>         vl      %v3, 32(%r7)
>         vn      %v1, %v3, %v1
>         vl      %v3, 0(%r8)
>         vl      %v4, 0(%r7)
>         vn      %v3, %v4, %v3
>         vmrlg   %v4, %v3, %v1
>         vmrhg   %v1, %v3, %v1
>         vmrlg   %v3, %v1, %v4
>         vst     %v3, 32(%r9)
>         vmrhg   %v0, %v0, %v2
>         vst     %v0, 16(%r9)
>         vmrhg   %v0, %v1, %v4
>         vst     %v0, 0(%r9)
>         la      %r9, 64(%r9)
>         la      %r8, 64(%r8)
>         la      %r7, 64(%r7)
>         aghi    %r13, -2
>         jne     .LBB0_141
>
> Final scalar loop :
> .LBB0_152:                              # %while.body320
>                                         # =>This Inner Loop Header: Depth=1
>         lg      %r13, 0(%r14)
>         ng      %r13, 0(%r5)
>         stg     %r13, 0(%r4)
>         lg      %r13, 8(%r14)
>         ng      %r13, 8(%r5)
>         stg     %r13, 8(%r4)
>         lg      %r13, 16(%r14)
>         ng      %r13, 16(%r5)
>         stg     %r13, 16(%r4)
>         lg      %r13, 24(%r14)
>         ng      %r13, 24(%r5)
>         stg     %r13, 24(%r4)
>         la      %r4, 32(%r4)
>         la      %r14, 32(%r14)
>         la      %r5, 32(%r5)
>         brct    %r0, .LBB0_152
>         j       .LBB0_155
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161006/f79b0e32/attachment.html>