[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence

Fri Oct 7 02:07:28 PDT 2016

Hi Matt,

ok - see https://llvm.org/bugs/show_bug.cgi?id=30630.

/Jonas

On 2016-10-06 20:40, Matthew Simpson wrote:
> Hi Jonas,
>
> It does look like we should be able to simplify this. Would you mind 
> filing a bug? Looking at the code after InstCombine, the vector adds 
> are trivially redundant (I think EarlyCSE should already be able to 
> remove them). I think we could then teach InstructionSimplify to 
> simplify the remaining shuffles similar to the way it already handles 
> extracts.
>
> -- Matt
>
> On Thu, Oct 6, 2016 at 10:30 AM, Jonas Paulsson via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
>     Hi,
>
>     I have experimented with enabling the LoopVectorizer for SystemZ.
>     I have come across a loop which, when vectorized, seems to have
>     been poorly generated. In short, there seems to be a completely
>     unnecessary sequence of shufflevector instructions, that doesn't
>     get optimized away anywhere. In other words, there is a shuffling
>     so that leads back to the original vector:
>
>            [0 1 2 3 4 5 6 7]
>
>      [0 4]   [1 5]   [2 6]   [3 7]
>
>        [0 4 1 5]       [2 6 3 7]
>
>            [0 1 2 3 4 5 6 7]
>
>     Is this something the instruction combiner, or perhaps the
>     InterleavedAccess pass should handle? Even though I suspect that
>     there are currently many target hooks for SystemZ with bad values
>     returned, this seems like something that the optimizers should
>     handle regardless. The result of this is unnecessary target
>     instruction - as can be seen at the bottom.
>
>     I would appreciate any input on this, and if needed I can supply a
>     test case.
>
>     /Jonas
>
>
>     Loop before vectorize pass:
>
>     while.body320:                                    ; preds =
>     %while.body320.preheader, %while.body320
>       %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
>     %while.body320.preheader ]
>       %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
>     %while.body320.preheader ]
>       %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
>     %while.body320.preheader ]
>       %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
>     %while.body320.preheader ]
>       %dec = add nsw i32 %len.0288, -1
>       %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
>       %176 = load i64, i64* %ll.0290, align 8
>       %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
>       %177 = load i64, i64* %rl.0289, align 8
>       %and322 = and i64 %177, %176
>       %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
>       store i64 %and322, i64* %dl.0291, align 8
>       %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
>       %178 = load i64, i64* %incdec.ptr, align 8
>       %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
>       %179 = load i64, i64* %incdec.ptr321, align 8
>       %and326 = and i64 %179, %178
>       %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
>       store i64 %and326, i64* %incdec.ptr323, align 8
>       %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
>       %180 = load i64, i64* %incdec.ptr324, align 8
>       %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
>       %181 = load i64, i64* %incdec.ptr325, align 8
>       %and330 = and i64 %181, %180
>       %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
>       store i64 %and330, i64* %incdec.ptr327, align 8
>       %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
>       %182 = load i64, i64* %incdec.ptr328, align 8
>       %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
>       %183 = load i64, i64* %incdec.ptr329, align 8
>       %and334 = and i64 %183, %182
>       %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
>       store i64 %and334, i64* %incdec.ptr331, align 8
>       %tobool319 = icmp eq i32 %dec, 0
>       br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320
>
>
>     Vectorizing:
>
>     LV: Checking a loop in "Perl_do_vop" from do_vop.bc
>     LV: Loop hints: force=? width=0 unroll=0
>     LV: Found a loop: while.body320
>     LV: Found an induction variable.
>     LV: Found an induction variable.
>     LV: Found an induction variable.
>     LV: Found an induction variable.
>     LV: Did not find one integer induction var.
>     LV: We can vectorize this loop (with a runtime bound check)!
>     LV: Analyzing interleaved accesses...
>     LV: Creating an interleave group with:  store i64 %and334, i64*
>     %incdec.ptr331, align 8
>     LV: Inserted:  store i64 %and330, i64* %incdec.ptr327, align 8
>         into the interleave group with  store i64 %and334, i64*
>     %incdec.ptr331, align 8
>     LV: Inserted:  store i64 %and326, i64* %incdec.ptr323, align 8
>         into the interleave group with  store i64 %and334, i64*
>     %incdec.ptr331, align 8
>     LV: Inserted:  store i64 %and322, i64* %dl.0291, align 8
>         into the interleave group with  store i64 %and334, i64*
>     %incdec.ptr331, align 8
>     LV: Creating an interleave group with:  %183 = load i64, i64*
>     %incdec.ptr329, align 8
>     LV: Inserted:  %181 = load i64, i64* %incdec.ptr325, align 8
>         into the interleave group with  %183 = load i64, i64*
>     %incdec.ptr329, align 8
>     LV: Inserted:  %179 = load i64, i64* %incdec.ptr321, align 8
>         into the interleave group with  %183 = load i64, i64*
>     %incdec.ptr329, align 8
>     LV: Inserted:  %177 = load i64, i64* %rl.0289, align 8
>         into the interleave group with  %183 = load i64, i64*
>     %incdec.ptr329, align 8
>     LV: Creating an interleave group with:  %182 = load i64, i64*
>     %incdec.ptr328, align 8
>     LV: Inserted:  %180 = load i64, i64* %incdec.ptr324, align 8
>         into the interleave group with  %182 = load i64, i64*
>     %incdec.ptr328, align 8
>     LV: Inserted:  %178 = load i64, i64* %incdec.ptr, align 8
>         into the interleave group with  %182 = load i64, i64*
>     %incdec.ptr328, align 8
>     LV: Inserted:  %176 = load i64, i64* %ll.0290, align 8
>         into the interleave group with  %182 = load i64, i64*
>     %incdec.ptr328, align 8
>     LV: Found uniform instruction:   %tobool319 = icmp eq i32 %dec, 0
>     LV: Found uniform instruction:   %incdec.ptr324 = getelementptr
>     inbounds i64, i64* %ll.0290, i64 2
>     LV: Found uniform instruction:   %incdec.ptr329 = getelementptr
>     inbounds i64, i64* %rl.0289, i64 3
>     LV: Found uniform instruction:   %incdec.ptr323 = getelementptr
>     inbounds i64, i64* %dl.0291, i64 1
>     LV: Found uniform instruction:   %incdec.ptr328 = getelementptr
>     inbounds i64, i64* %ll.0290, i64 3
>     LV: Found uniform instruction:   %incdec.ptr321 = getelementptr
>     inbounds i64, i64* %rl.0289, i64 1
>     LV: Found uniform instruction:   %incdec.ptr327 = getelementptr
>     inbounds i64, i64* %dl.0291, i64 2
>     LV: Found uniform instruction:   %incdec.ptr325 = getelementptr
>     inbounds i64, i64* %rl.0289, i64 2
>     LV: Found uniform instruction:   %incdec.ptr331 = getelementptr
>     inbounds i64, i64* %dl.0291, i64 3
>     LV: Found uniform instruction:   %incdec.ptr = getelementptr
>     inbounds i64, i64* %ll.0290, i64 1
>     LV: Found uniform instruction:   %dl.0291 = phi i64* [
>     %incdec.ptr335, %while.body320 ], [ %73, %while.body320.preheader ]
>     LV: Found uniform instruction:   %incdec.ptr335 = getelementptr
>     inbounds i64, i64* %dl.0291, i64 4
>     LV: Found uniform instruction:   %ll.0290 = phi i64* [
>     %incdec.ptr332, %while.body320 ], [ %74, %while.body320.preheader ]
>     LV: Found uniform instruction:   %incdec.ptr332 = getelementptr
>     inbounds i64, i64* %ll.0290, i64 4
>     LV: Found uniform instruction:   %rl.0289 = phi i64* [
>     %incdec.ptr333, %while.body320 ], [ %75, %while.body320.preheader ]
>     LV: Found uniform instruction:   %incdec.ptr333 = getelementptr
>     inbounds i64, i64* %rl.0289, i64 4
>     LV: Found uniform instruction:   %len.0288 = phi i32 [ %dec,
>     %while.body320 ], [ %conv316, %while.body320.preheader ]
>     LV: Found uniform instruction:   %dec = add nsw i32 %len.0288, -1
>     LV: Found trip count: 0
>     LV: The Smallest and Widest types: 64 / 64 bits.
>     LV: The Widest register is: 128 bits.
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %dec =
>     add nsw i32 %len.0288, -1
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %176 =
>     load i64, i64* %ll.0290, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %177 =
>     load i64, i64* %rl.0289, align 8
>     LV: Found an estimated cost of 1 for VF 1 For instruction:
>      %and322 = and i64 %177, %176
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  store
>     i64 %and322, i64* %dl.0291, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %178 =
>     load i64, i64* %incdec.ptr, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %179 =
>     load i64, i64* %incdec.ptr321, align 8
>     LV: Found an estimated cost of 1 for VF 1 For instruction:
>      %and326 = and i64 %179, %178
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  store
>     i64 %and326, i64* %incdec.ptr323, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %180 =
>     load i64, i64* %incdec.ptr324, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %181 =
>     load i64, i64* %incdec.ptr325, align 8
>     LV: Found an estimated cost of 1 for VF 1 For instruction:
>      %and330 = and i64 %181, %180
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  store
>     i64 %and330, i64* %incdec.ptr327, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %182 =
>     load i64, i64* %incdec.ptr328, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %183 =
>     load i64, i64* %incdec.ptr329, align 8
>     LV: Found an estimated cost of 1 for VF 1 For instruction:
>      %and334 = and i64 %183, %182
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  store
>     i64 %and334, i64* %incdec.ptr331, align 8
>     LV: Found an estimated cost of 1 for VF 1 For instruction:
>      %tobool319 = icmp eq i32 %dec, 0
>     LV: Found an estimated cost of 0 for VF 1 For instruction:  br i1
>     %tobool319, label %sw.epilog381.loopexit, label %while.body320
>     LV: Scalar loop costs: 18.
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 1 for VF 2 For instruction:  %dec =
>     add nsw i32 %len.0288, -1
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
>     LV: Found an estimated cost of 4 for VF 2 For instruction:  %176 =
>     load i64, i64* %ll.0290, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
>     LV: Found an estimated cost of 4 for VF 2 For instruction:  %177 =
>     load i64, i64* %rl.0289, align 8
>     LV: Found an estimated cost of 1 for VF 2 For instruction:
>      %and322 = and i64 %177, %176
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  store
>     i64 %and322, i64* %dl.0291, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %178 =
>     load i64, i64* %incdec.ptr, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %179 =
>     load i64, i64* %incdec.ptr321, align 8
>     LV: Found an estimated cost of 1 for VF 2 For instruction:
>      %and326 = and i64 %179, %178
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  store
>     i64 %and326, i64* %incdec.ptr323, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %180 =
>     load i64, i64* %incdec.ptr324, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %181 =
>     load i64, i64* %incdec.ptr325, align 8
>     LV: Found an estimated cost of 1 for VF 2 For instruction:
>      %and330 = and i64 %181, %180
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  store
>     i64 %and330, i64* %incdec.ptr327, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %182 =
>     load i64, i64* %incdec.ptr328, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %183 =
>     load i64, i64* %incdec.ptr329, align 8
>     LV: Found an estimated cost of 1 for VF 2 For instruction:
>      %and334 = and i64 %183, %182
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
>     LV: Found an estimated cost of 4 for VF 2 For instruction:  store
>     i64 %and334, i64* %incdec.ptr331, align 8
>     LV: Found an estimated cost of 1 for VF 2 For instruction:
>      %tobool319 = icmp eq i32 %dec, 0
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  br i1
>     %tobool319, label %sw.epilog381.loopexit, label %while.body320
>     LV: Vector loop of width 2 costs: 9.
>     LV: Selecting VF: 2.
>     LV: The target has 32 registers
>     LV(REG): Calculating max register usage:
>     LV(REG): At #0 Interval # 0
>     LV(REG): At #1 Interval # 1
>     LV(REG): At #2 Interval # 2
>     LV(REG): At #3 Interval # 3
>     LV(REG): At #4 Interval # 4
>     LV(REG): At #5 Interval # 4
>     LV(REG): At #6 Interval # 5
>     LV(REG): At #7 Interval # 6
>     LV(REG): At #8 Interval # 7
>     LV(REG): At #9 Interval # 8
>     LV(REG): At #10 Interval # 7
>     LV(REG): At #12 Interval # 7
>     LV(REG): At #13 Interval # 8
>     LV(REG): At #14 Interval # 8
>     LV(REG): At #15 Interval # 9
>     LV(REG): At #16 Interval # 9
>     LV(REG): At #17 Interval # 8
>     LV(REG): At #19 Interval # 7
>     LV(REG): At #20 Interval # 8
>     LV(REG): At #21 Interval # 8
>     LV(REG): At #22 Interval # 9
>     LV(REG): At #23 Interval # 9
>     LV(REG): At #24 Interval # 8
>     LV(REG): At #26 Interval # 7
>     LV(REG): At #27 Interval # 7
>     LV(REG): At #28 Interval # 7
>     LV(REG): At #29 Interval # 7
>     LV(REG): At #30 Interval # 7
>     LV(REG): At #31 Interval # 6
>     LV(REG): At #33 Interval # 5
>     LV(REG): VF = 2
>     LV(REG): Found max usage: 2
>     LV(REG): Found invariant usage: 4
>     LV(REG): LoopSize: 35
>     LV: Loop cost is 18
>     LV: Interleaving to reduce branch cost.
>     LV: Interleaving is not beneficial.
>     LV: Found a vectorizable loop (2) in do_vop.bc
>     LV: Interleaving disabled by the pass manager
>     LV: Scalarizing:  %dec = add nsw i32 %len.0288, -1
>     LV: Scalarizing:  %incdec.ptr = getelementptr inbounds i64, i64*
>     %ll.0290, i64 1
>     LV: Scalarizing:  %incdec.ptr321 = getelementptr inbounds i64,
>     i64* %rl.0289, i64 1
>     LV: Scalarizing:  %incdec.ptr323 = getelementptr inbounds i64,
>     i64* %dl.0291, i64 1
>     LV: Scalarizing:  %incdec.ptr324 = getelementptr inbounds i64,
>     i64* %ll.0290, i64 2
>     LV: Scalarizing:  %incdec.ptr325 = getelementptr inbounds i64,
>     i64* %rl.0289, i64 2
>     LV: Scalarizing:  %incdec.ptr327 = getelementptr inbounds i64,
>     i64* %dl.0291, i64 2
>     LV: Scalarizing:  %incdec.ptr328 = getelementptr inbounds i64,
>     i64* %ll.0290, i64 3
>     LV: Scalarizing:  %incdec.ptr329 = getelementptr inbounds i64,
>     i64* %rl.0289, i64 3
>     LV: Scalarizing:  %incdec.ptr331 = getelementptr inbounds i64,
>     i64* %dl.0291, i64 3
>     LV: Scalarizing:  %incdec.ptr332 = getelementptr inbounds i64,
>     i64* %ll.0290, i64 4
>     LV: Scalarizing:  %incdec.ptr333 = getelementptr inbounds i64,
>     i64* %rl.0289, i64 4
>     LV: Scalarizing:  %incdec.ptr335 = getelementptr inbounds i64,
>     i64* %dl.0291, i64 4
>     LV: Scalarizing:  %tobool319 = icmp eq i32 %dec, 0
>
>     vectorized loop (vectorization width: 2, interleaved count: 1)
>
>     Loop after vectorize pass:
>
>     vector.body419:                                   ; preds =
>     %vector.body419, %vector.ph440
>       %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442,
>     %vector.body419 ]
>       %184 = add i64 %index441, 0
>       %185 = shl i64 %184, 2
>       %next.gep453 = getelementptr i64, i64* %73, i64 %185
>       %186 = add i64 %index441, 0
>       %187 = shl i64 %186, 2
>       %next.gep454 = getelementptr i64, i64* %74, i64 %187
>       %188 = add i64 %index441, 0
>       %189 = shl i64 %188, 2
>       %next.gep455 = getelementptr i64, i64* %75, i64 %189
>       %190 = trunc i64 %index441 to i32
>       %offset.idx456 = sub i32 %conv316, %190
>       %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32
>     %offset.idx456, i32 0
>       %broadcast.splat458 = shufflevector <2 x i32>
>     %broadcast.splatinsert457, <2 x i32> undef, <2 x i32> zeroinitializer
>       %induction459 = add <2 x i32> %broadcast.splat458, <i32 0, i32 -1>
>       %191 = add i32 %offset.idx456, 0
>       %192 = add nsw i32 %191, -1
>       %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1
>       %194 = getelementptr i64, i64* %next.gep454, i32 0
>       %195 = bitcast i64* %194 to <8 x i64>*
>       %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8,
>     !alias.scope !21
>       %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x
>     i64> undef, <2 x i32> <i32 0, i32 4>
>       %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x
>     i64> undef, <2 x i32> <i32 1, i32 5>
>       %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x
>     i64> undef, <2 x i32> <i32 2, i32 6>
>       %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x
>     i64> undef, <2 x i32> <i32 3, i32 7>
>       %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1
>       %197 = getelementptr i64, i64* %next.gep455, i32 0
>       %198 = bitcast i64* %197 to <8 x i64>*
>       %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8,
>     !alias.scope !24
>       %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x
>     i64> undef, <2 x i32> <i32 0, i32 4>
>       %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x
>     i64> undef, <2 x i32> <i32 1, i32 5>
>       %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x
>     i64> undef, <2 x i32> <i32 2, i32 6>
>       %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x
>     i64> undef, <2 x i32> <i32 3, i32 7>
>       %199 = and <2 x i64> %strided.vec466, %strided.vec461
>       %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1
>       %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2
>       %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2
>       %203 = and <2 x i64> %strided.vec467, %strided.vec462
>       %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2
>       %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3
>       %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3
>       %207 = and <2 x i64> %strided.vec468, %strided.vec463
>       %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3
>       %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4
>       %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4
>       %211 = and <2 x i64> %strided.vec469, %strided.vec464
>       %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4
>       %213 = getelementptr i64, i64* %208, i32 -3
>       %214 = bitcast i64* %213 to <8 x i64>*
>       %215 = shufflevector <2 x i64> %199, <2 x i64> %203, <4 x i32>
>     <i32 0, i32 1, i32 2, i32 3>
>       %216 = shufflevector <2 x i64> %207, <2 x i64> %211, <4 x i32>
>     <i32 0, i32 1, i32 2, i32 3>
>       %217 = shufflevector <4 x i64> %215, <4 x i64> %216, <8 x i32>
>     <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
>       %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x i64>
>     undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5,
>     i32 7>
>       store <8 x i64> %interleaved.vec470, <8 x i64>* %214, align 8,
>     !alias.scope !26, !noalias !28
>       %218 = icmp eq i32 %192, 0
>       %index.next442 = add i64 %index441, 2
>       %219 = icmp eq i64 %index.next442, %n.vec425
>       br i1 %219, label %middle.block420, label %vector.body419,
>     !llvm.loop !29
>
>     Loop after instruction combining:
>
>     vector.body419:                                   ; preds =
>     %vector.body419, %vector.body419.preheader
>       %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2,
>     %vector.body419.preheader ]
>       %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond,
>     %vector.body419.preheader ]
>       %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57,
>     %vector.body419.preheader ]
>       %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [
>     %n.vec425, %vector.body419.preheader ]
>       %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>*
>       %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>*
>       %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>*
>       %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961, align 8,
>     !alias.scope !21
>       %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658, align 8,
>     !alias.scope !24
>       %179 = and <8 x i64> %wide.vec465, %wide.vec460
>       %180 = shufflevector <8 x i64> %179, <8 x i64> undef, <2 x i32>
>     <i32 0, i32 4>
>       %181 = and <8 x i64> %wide.vec465, %wide.vec460
>       %182 = shufflevector <8 x i64> %181, <8 x i64> undef, <2 x i32>
>     <i32 1, i32 5>
>       %183 = and <8 x i64> %wide.vec465, %wide.vec460
>       %184 = shufflevector <8 x i64> %183, <8 x i64> undef, <2 x i32>
>     <i32 2, i32 6>
>       %185 = and <8 x i64> %wide.vec465, %wide.vec460
>       %186 = shufflevector <8 x i64> %185, <8 x i64> undef, <2 x i32>
>     <i32 3, i32 7>
>       %187 = shufflevector <2 x i64> %180, <2 x i64> %182, <4 x i32>
>     <i32 0, i32 1, i32 2, i32 3>
>       %188 = shufflevector <2 x i64> %184, <2 x i64> %186, <4 x i32>
>     <i32 0, i32 1, i32 2, i32 3>
>       %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x i64>
>     %188, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5,
>     i32 7>
>       store <8 x i64> %interleaved.vec470, <8 x i64>* %lsr.iv6264,
>     align 8, !alias.scope !26, !noalias !28
>       %lsr.iv.next55 = add i64 %lsr.iv54, -2
>       %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64
>       %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64
>       %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64
>       %189 = icmp eq i64 %lsr.iv.next55, 0
>       br i1 %189, label %middle.block420, label %vector.body419,
>     !llvm.loop !29
>
>     Final vectorized loop
>
>     .LBB0_141:                              # %vector.body419
>                                             # =>This Inner Loop
>     Header: Depth=1
>             vl      %v0, 48(%r8)
>             vl      %v1, 48(%r7)
>             vn      %v0, %v1, %v0
>             vl      %v1, 16(%r8)
>             vl      %v2, 16(%r7)
>             vn      %v1, %v2, %v1
>             vmrlg   %v2, %v1, %v0
>             vmrhg   %v0, %v1, %v0
>             vmrlg   %v1, %v0, %v2
>             vst     %v1, 48(%r9)
>             vl      %v1, 32(%r8)
>             vl      %v3, 32(%r7)
>             vn      %v1, %v3, %v1
>             vl      %v3, 0(%r8)
>             vl      %v4, 0(%r7)
>             vn      %v3, %v4, %v3
>             vmrlg   %v4, %v3, %v1
>             vmrhg   %v1, %v3, %v1
>             vmrlg   %v3, %v1, %v4
>             vst     %v3, 32(%r9)
>             vmrhg   %v0, %v0, %v2
>             vst     %v0, 16(%r9)
>             vmrhg   %v0, %v1, %v4
>             vst     %v0, 0(%r9)
>             la      %r9, 64(%r9)
>             la      %r8, 64(%r8)
>             la      %r7, 64(%r7)
>             aghi    %r13, -2
>             jne     .LBB0_141
>
>     Final scalar loop :
>     .LBB0_152:                              # %while.body320
>                                             # =>This Inner Loop
>     Header: Depth=1
>             lg      %r13, 0(%r14)
>             ng      %r13, 0(%r5)
>             stg     %r13, 0(%r4)
>             lg      %r13, 8(%r14)
>             ng      %r13, 8(%r5)
>             stg     %r13, 8(%r4)
>             lg      %r13, 16(%r14)
>             ng      %r13, 16(%r5)
>             stg     %r13, 16(%r4)
>             lg      %r13, 24(%r14)
>             ng      %r13, 24(%r5)
>             stg     %r13, 24(%r4)
>             la      %r4, 32(%r4)
>             la      %r14, 32(%r14)
>             la      %r5, 32(%r5)
>             brct    %r0, .LBB0_152
>             j       .LBB0_155
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161007/ddd64a02/attachment.html>