[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence

Thu Oct 6 07:30:03 PDT 2016

Hi,

I have experimented with enabling the LoopVectorizer for SystemZ. I have 
come across a loop which, when vectorized, seems to have been poorly 
generated. In short, there seems to be a completely unnecessary sequence 
of shufflevector instructions, that doesn't get optimized away anywhere. 
In other words, there is a shuffling so that leads back to the original 
vector:

        [0 1 2 3 4 5 6 7]

  [0 4]   [1 5]   [2 6]   [3 7]

    [0 4 1 5]       [2 6 3 7]

        [0 1 2 3 4 5 6 7]

Is this something the instruction combiner, or perhaps the 
InterleavedAccess pass should handle? Even though I suspect that there 
are currently many target hooks for SystemZ with bad values returned, 
this seems like something that the optimizers should handle regardless. 
The result of this is unnecessary target instruction - as can be seen at 
the bottom.

I would appreciate any input on this, and if needed I can supply a test 
case.

/Jonas

Loop before vectorize pass:

while.body320:                                    ; preds = 
%while.body320.preheader, %while.body320
   %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, 
%while.body320.preheader ]
   %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, 
%while.body320.preheader ]
   %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, 
%while.body320.preheader ]
   %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316, 
%while.body320.preheader ]
   %dec = add nsw i32 %len.0288, -1
   %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
   %176 = load i64, i64* %ll.0290, align 8
   %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
   %177 = load i64, i64* %rl.0289, align 8
   %and322 = and i64 %177, %176
   %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
   store i64 %and322, i64* %dl.0291, align 8
   %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
   %178 = load i64, i64* %incdec.ptr, align 8
   %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
   %179 = load i64, i64* %incdec.ptr321, align 8
   %and326 = and i64 %179, %178
   %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
   store i64 %and326, i64* %incdec.ptr323, align 8
   %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
   %180 = load i64, i64* %incdec.ptr324, align 8
   %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
   %181 = load i64, i64* %incdec.ptr325, align 8
   %and330 = and i64 %181, %180
   %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
   store i64 %and330, i64* %incdec.ptr327, align 8
   %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
   %182 = load i64, i64* %incdec.ptr328, align 8
   %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
   %183 = load i64, i64* %incdec.ptr329, align 8
   %and334 = and i64 %183, %182
   %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
   store i64 %and334, i64* %incdec.ptr331, align 8
   %tobool319 = icmp eq i32 %dec, 0
   br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320

Vectorizing:

LV: Checking a loop in "Perl_do_vop" from do_vop.bc
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: while.body320
LV: Found an induction variable.
LV: Found an induction variable.
LV: Found an induction variable.
LV: Found an induction variable.
LV: Did not find one integer induction var.
LV: We can vectorize this loop (with a runtime bound check)!
LV: Analyzing interleaved accesses...
LV: Creating an interleave group with:  store i64 %and334, i64* 
%incdec.ptr331, align 8
LV: Inserted:  store i64 %and330, i64* %incdec.ptr327, align 8
     into the interleave group with  store i64 %and334, i64* 
%incdec.ptr331, align 8
LV: Inserted:  store i64 %and326, i64* %incdec.ptr323, align 8
     into the interleave group with  store i64 %and334, i64* 
%incdec.ptr331, align 8
LV: Inserted:  store i64 %and322, i64* %dl.0291, align 8
     into the interleave group with  store i64 %and334, i64* 
%incdec.ptr331, align 8
LV: Creating an interleave group with:  %183 = load i64, i64* 
%incdec.ptr329, align 8
LV: Inserted:  %181 = load i64, i64* %incdec.ptr325, align 8
     into the interleave group with  %183 = load i64, i64* 
%incdec.ptr329, align 8
LV: Inserted:  %179 = load i64, i64* %incdec.ptr321, align 8
     into the interleave group with  %183 = load i64, i64* 
%incdec.ptr329, align 8
LV: Inserted:  %177 = load i64, i64* %rl.0289, align 8
     into the interleave group with  %183 = load i64, i64* 
%incdec.ptr329, align 8
LV: Creating an interleave group with:  %182 = load i64, i64* 
%incdec.ptr328, align 8
LV: Inserted:  %180 = load i64, i64* %incdec.ptr324, align 8
     into the interleave group with  %182 = load i64, i64* 
%incdec.ptr328, align 8
LV: Inserted:  %178 = load i64, i64* %incdec.ptr, align 8
     into the interleave group with  %182 = load i64, i64* 
%incdec.ptr328, align 8
LV: Inserted:  %176 = load i64, i64* %ll.0290, align 8
     into the interleave group with  %182 = load i64, i64* 
%incdec.ptr328, align 8
LV: Found uniform instruction:   %tobool319 = icmp eq i32 %dec, 0
LV: Found uniform instruction:   %incdec.ptr324 = getelementptr inbounds 
i64, i64* %ll.0290, i64 2
LV: Found uniform instruction:   %incdec.ptr329 = getelementptr inbounds 
i64, i64* %rl.0289, i64 3
LV: Found uniform instruction:   %incdec.ptr323 = getelementptr inbounds 
i64, i64* %dl.0291, i64 1
LV: Found uniform instruction:   %incdec.ptr328 = getelementptr inbounds 
i64, i64* %ll.0290, i64 3
LV: Found uniform instruction:   %incdec.ptr321 = getelementptr inbounds 
i64, i64* %rl.0289, i64 1
LV: Found uniform instruction:   %incdec.ptr327 = getelementptr inbounds 
i64, i64* %dl.0291, i64 2
LV: Found uniform instruction:   %incdec.ptr325 = getelementptr inbounds 
i64, i64* %rl.0289, i64 2
LV: Found uniform instruction:   %incdec.ptr331 = getelementptr inbounds 
i64, i64* %dl.0291, i64 3
LV: Found uniform instruction:   %incdec.ptr = getelementptr inbounds 
i64, i64* %ll.0290, i64 1
LV: Found uniform instruction:   %dl.0291 = phi i64* [ %incdec.ptr335, 
%while.body320 ], [ %73, %while.body320.preheader ]
LV: Found uniform instruction:   %incdec.ptr335 = getelementptr inbounds 
i64, i64* %dl.0291, i64 4
LV: Found uniform instruction:   %ll.0290 = phi i64* [ %incdec.ptr332, 
%while.body320 ], [ %74, %while.body320.preheader ]
LV: Found uniform instruction:   %incdec.ptr332 = getelementptr inbounds 
i64, i64* %ll.0290, i64 4
LV: Found uniform instruction:   %rl.0289 = phi i64* [ %incdec.ptr333, 
%while.body320 ], [ %75, %while.body320.preheader ]
LV: Found uniform instruction:   %incdec.ptr333 = getelementptr inbounds 
i64, i64* %rl.0289, i64 4
LV: Found uniform instruction:   %len.0288 = phi i32 [ %dec, 
%while.body320 ], [ %conv316, %while.body320.preheader ]
LV: Found uniform instruction:   %dec = add nsw i32 %len.0288, -1
LV: Found trip count: 0
LV: The Smallest and Widest types: 64 / 64 bits.
LV: The Widest register is: 128 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction:   %dl.0291 = 
phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %ll.0290 = 
phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %rl.0289 = 
phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %len.0288 = 
phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ]
LV: Found an estimated cost of 1 for VF 1 For instruction:   %dec = add 
nsw i32 %len.0288, -1
LV: Found an estimated cost of 0 for VF 1 For instruction:   %incdec.ptr 
= getelementptr inbounds i64, i64* %ll.0290, i64 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %176 = load 
i64, i64* %ll.0290, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %177 = load 
i64, i64* %rl.0289, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %and322 = 
and i64 %177, %176
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64 
%and322, i64* %dl.0291, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
LV: Found an estimated cost of 1 for VF 1 For instruction:   %178 = load 
i64, i64* %incdec.ptr, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
LV: Found an estimated cost of 1 for VF 1 For instruction:   %179 = load 
i64, i64* %incdec.ptr321, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %and326 = 
and i64 %179, %178
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64 
%and326, i64* %incdec.ptr323, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
LV: Found an estimated cost of 1 for VF 1 For instruction:   %180 = load 
i64, i64* %incdec.ptr324, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
LV: Found an estimated cost of 1 for VF 1 For instruction:   %181 = load 
i64, i64* %incdec.ptr325, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %and330 = 
and i64 %181, %180
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64 
%and330, i64* %incdec.ptr327, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
LV: Found an estimated cost of 1 for VF 1 For instruction:   %182 = load 
i64, i64* %incdec.ptr328, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
LV: Found an estimated cost of 1 for VF 1 For instruction:   %183 = load 
i64, i64* %incdec.ptr329, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %and334 = 
and i64 %183, %182
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64 
%and334, i64* %incdec.ptr331, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %tobool319 
= icmp eq i32 %dec, 0
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 
%tobool319, label %sw.epilog381.loopexit, label %while.body320
LV: Scalar loop costs: 18.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %dl.0291 = 
phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %ll.0290 = 
phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %rl.0289 = 
phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %len.0288 = 
phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ]
LV: Found an estimated cost of 1 for VF 2 For instruction:   %dec = add 
nsw i32 %len.0288, -1
LV: Found an estimated cost of 0 for VF 2 For instruction:   %incdec.ptr 
= getelementptr inbounds i64, i64* %ll.0290, i64 1
LV: Found an estimated cost of 4 for VF 2 For instruction:   %176 = load 
i64, i64* %ll.0290, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
LV: Found an estimated cost of 4 for VF 2 For instruction:   %177 = load 
i64, i64* %rl.0289, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %and322 = 
and i64 %177, %176
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64 
%and322, i64* %dl.0291, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
LV: Found an estimated cost of 0 for VF 2 For instruction:   %178 = load 
i64, i64* %incdec.ptr, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
LV: Found an estimated cost of 0 for VF 2 For instruction:   %179 = load 
i64, i64* %incdec.ptr321, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %and326 = 
and i64 %179, %178
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64 
%and326, i64* %incdec.ptr323, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
LV: Found an estimated cost of 0 for VF 2 For instruction:   %180 = load 
i64, i64* %incdec.ptr324, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
LV: Found an estimated cost of 0 for VF 2 For instruction:   %181 = load 
i64, i64* %incdec.ptr325, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %and330 = 
and i64 %181, %180
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64 
%and330, i64* %incdec.ptr327, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
LV: Found an estimated cost of 0 for VF 2 For instruction:   %182 = load 
i64, i64* %incdec.ptr328, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
LV: Found an estimated cost of 0 for VF 2 For instruction:   %183 = load 
i64, i64* %incdec.ptr329, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %and334 = 
and i64 %183, %182
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
LV: Found an estimated cost of 4 for VF 2 For instruction:   store i64 
%and334, i64* %incdec.ptr331, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %tobool319 
= icmp eq i32 %dec, 0
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 
%tobool319, label %sw.epilog381.loopexit, label %while.body320
LV: Vector loop of width 2 costs: 9.
LV: Selecting VF: 2.
LV: The target has 32 registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #4 Interval # 4
LV(REG): At #5 Interval # 4
LV(REG): At #6 Interval # 5
LV(REG): At #7 Interval # 6
LV(REG): At #8 Interval # 7
LV(REG): At #9 Interval # 8
LV(REG): At #10 Interval # 7
LV(REG): At #12 Interval # 7
LV(REG): At #13 Interval # 8
LV(REG): At #14 Interval # 8
LV(REG): At #15 Interval # 9
LV(REG): At #16 Interval # 9
LV(REG): At #17 Interval # 8
LV(REG): At #19 Interval # 7
LV(REG): At #20 Interval # 8
LV(REG): At #21 Interval # 8
LV(REG): At #22 Interval # 9
LV(REG): At #23 Interval # 9
LV(REG): At #24 Interval # 8
LV(REG): At #26 Interval # 7
LV(REG): At #27 Interval # 7
LV(REG): At #28 Interval # 7
LV(REG): At #29 Interval # 7
LV(REG): At #30 Interval # 7
LV(REG): At #31 Interval # 6
LV(REG): At #33 Interval # 5
LV(REG): VF = 2
LV(REG): Found max usage: 2
LV(REG): Found invariant usage: 4
LV(REG): LoopSize: 35
LV: Loop cost is 18
LV: Interleaving to reduce branch cost.
LV: Interleaving is not beneficial.
LV: Found a vectorizable loop (2) in do_vop.bc
LV: Interleaving disabled by the pass manager
LV: Scalarizing:  %dec = add nsw i32 %len.0288, -1
LV: Scalarizing:  %incdec.ptr = getelementptr inbounds i64, i64* 
%ll.0290, i64 1
LV: Scalarizing:  %incdec.ptr321 = getelementptr inbounds i64, i64* 
%rl.0289, i64 1
LV: Scalarizing:  %incdec.ptr323 = getelementptr inbounds i64, i64* 
%dl.0291, i64 1
LV: Scalarizing:  %incdec.ptr324 = getelementptr inbounds i64, i64* 
%ll.0290, i64 2
LV: Scalarizing:  %incdec.ptr325 = getelementptr inbounds i64, i64* 
%rl.0289, i64 2
LV: Scalarizing:  %incdec.ptr327 = getelementptr inbounds i64, i64* 
%dl.0291, i64 2
LV: Scalarizing:  %incdec.ptr328 = getelementptr inbounds i64, i64* 
%ll.0290, i64 3
LV: Scalarizing:  %incdec.ptr329 = getelementptr inbounds i64, i64* 
%rl.0289, i64 3
LV: Scalarizing:  %incdec.ptr331 = getelementptr inbounds i64, i64* 
%dl.0291, i64 3
LV: Scalarizing:  %incdec.ptr332 = getelementptr inbounds i64, i64* 
%ll.0290, i64 4
LV: Scalarizing:  %incdec.ptr333 = getelementptr inbounds i64, i64* 
%rl.0289, i64 4
LV: Scalarizing:  %incdec.ptr335 = getelementptr inbounds i64, i64* 
%dl.0291, i64 4
LV: Scalarizing:  %tobool319 = icmp eq i32 %dec, 0

vectorized loop (vectorization width: 2, interleaved count: 1)

Loop after vectorize pass:

vector.body419:                                   ; preds = 
%vector.body419, %vector.ph440
   %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442, 
%vector.body419 ]
   %184 = add i64 %index441, 0
   %185 = shl i64 %184, 2
   %next.gep453 = getelementptr i64, i64* %73, i64 %185
   %186 = add i64 %index441, 0
   %187 = shl i64 %186, 2
   %next.gep454 = getelementptr i64, i64* %74, i64 %187
   %188 = add i64 %index441, 0
   %189 = shl i64 %188, 2
   %next.gep455 = getelementptr i64, i64* %75, i64 %189
   %190 = trunc i64 %index441 to i32
   %offset.idx456 = sub i32 %conv316, %190
   %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32 
%offset.idx456, i32 0
   %broadcast.splat458 = shufflevector <2 x i32> 
%broadcast.splatinsert457, <2 x i32> undef, <2 x i32> zeroinitializer
   %induction459 = add <2 x i32> %broadcast.splat458, <i32 0, i32 -1>
   %191 = add i32 %offset.idx456, 0
   %192 = add nsw i32 %191, -1
   %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1
   %194 = getelementptr i64, i64* %next.gep454, i32 0
   %195 = bitcast i64* %194 to <8 x i64>*
   %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8, !alias.scope !21
   %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x i64> 
undef, <2 x i32> <i32 0, i32 4>
   %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x i64> 
undef, <2 x i32> <i32 1, i32 5>
   %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x i64> 
undef, <2 x i32> <i32 2, i32 6>
   %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x i64> 
undef, <2 x i32> <i32 3, i32 7>
   %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1
   %197 = getelementptr i64, i64* %next.gep455, i32 0
   %198 = bitcast i64* %197 to <8 x i64>*
   %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8, !alias.scope !24
   %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x i64> 
undef, <2 x i32> <i32 0, i32 4>
   %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x i64> 
undef, <2 x i32> <i32 1, i32 5>
   %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x i64> 
undef, <2 x i32> <i32 2, i32 6>
   %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x i64> 
undef, <2 x i32> <i32 3, i32 7>
   %199 = and <2 x i64> %strided.vec466, %strided.vec461
   %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1
   %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2
   %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2
   %203 = and <2 x i64> %strided.vec467, %strided.vec462
   %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2
   %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3
   %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3
   %207 = and <2 x i64> %strided.vec468, %strided.vec463
   %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3
   %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4
   %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4
   %211 = and <2 x i64> %strided.vec469, %strided.vec464
   %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4
   %213 = getelementptr i64, i64* %208, i32 -3
   %214 = bitcast i64* %213 to <8 x i64>*
   %215 = shufflevector <2 x i64> %199, <2 x i64> %203, <4 x i32> <i32 
0, i32 1, i32 2, i32 3>
   %216 = shufflevector <2 x i64> %207, <2 x i64> %211, <4 x i32> <i32 
0, i32 1, i32 2, i32 3>
   %217 = shufflevector <4 x i64> %215, <4 x i64> %216, <8 x i32> <i32 
0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
   %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x i64> undef, 
<8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
   store <8 x i64> %interleaved.vec470, <8 x i64>* %214, align 8, 
!alias.scope !26, !noalias !28
   %218 = icmp eq i32 %192, 0
   %index.next442 = add i64 %index441, 2
   %219 = icmp eq i64 %index.next442, %n.vec425
   br i1 %219, label %middle.block420, label %vector.body419, !llvm.loop !29

Loop after instruction combining:

vector.body419:                                   ; preds = 
%vector.body419, %vector.body419.preheader
   %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2, 
%vector.body419.preheader ]
   %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond, 
%vector.body419.preheader ]
   %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57, 
%vector.body419.preheader ]
   %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [ %n.vec425, 
%vector.body419.preheader ]
   %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>*
   %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>*
   %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>*
   %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961, align 8, 
!alias.scope !21
   %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658, align 8, 
!alias.scope !24
   %179 = and <8 x i64> %wide.vec465, %wide.vec460
   %180 = shufflevector <8 x i64> %179, <8 x i64> undef, <2 x i32> <i32 
0, i32 4>
   %181 = and <8 x i64> %wide.vec465, %wide.vec460
   %182 = shufflevector <8 x i64> %181, <8 x i64> undef, <2 x i32> <i32 
1, i32 5>
   %183 = and <8 x i64> %wide.vec465, %wide.vec460
   %184 = shufflevector <8 x i64> %183, <8 x i64> undef, <2 x i32> <i32 
2, i32 6>
   %185 = and <8 x i64> %wide.vec465, %wide.vec460
   %186 = shufflevector <8 x i64> %185, <8 x i64> undef, <2 x i32> <i32 
3, i32 7>
   %187 = shufflevector <2 x i64> %180, <2 x i64> %182, <4 x i32> <i32 
0, i32 1, i32 2, i32 3>
   %188 = shufflevector <2 x i64> %184, <2 x i64> %186, <4 x i32> <i32 
0, i32 1, i32 2, i32 3>
   %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x i64> %188, 
<8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
   store <8 x i64> %interleaved.vec470, <8 x i64>* %lsr.iv6264, align 8, 
!alias.scope !26, !noalias !28
   %lsr.iv.next55 = add i64 %lsr.iv54, -2
   %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64
   %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64
   %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64
   %189 = icmp eq i64 %lsr.iv.next55, 0
   br i1 %189, label %middle.block420, label %vector.body419, !llvm.loop !29

Final vectorized loop

.LBB0_141:                              # %vector.body419
                                         # =>This Inner Loop Header: Depth=1
         vl      %v0, 48(%r8)
         vl      %v1, 48(%r7)
         vn      %v0, %v1, %v0
         vl      %v1, 16(%r8)
         vl      %v2, 16(%r7)
         vn      %v1, %v2, %v1
         vmrlg   %v2, %v1, %v0
         vmrhg   %v0, %v1, %v0
         vmrlg   %v1, %v0, %v2
         vst     %v1, 48(%r9)
         vl      %v1, 32(%r8)
         vl      %v3, 32(%r7)
         vn      %v1, %v3, %v1
         vl      %v3, 0(%r8)
         vl      %v4, 0(%r7)
         vn      %v3, %v4, %v3
         vmrlg   %v4, %v3, %v1
         vmrhg   %v1, %v3, %v1
         vmrlg   %v3, %v1, %v4
         vst     %v3, 32(%r9)
         vmrhg   %v0, %v0, %v2
         vst     %v0, 16(%r9)
         vmrhg   %v0, %v1, %v4
         vst     %v0, 0(%r9)
         la      %r9, 64(%r9)
         la      %r8, 64(%r8)
         la      %r7, 64(%r7)
         aghi    %r13, -2
         jne     .LBB0_141

Final scalar loop :
.LBB0_152:                              # %while.body320
                                         # =>This Inner Loop Header: Depth=1
         lg      %r13, 0(%r14)
         ng      %r13, 0(%r5)
         stg     %r13, 0(%r4)
         lg      %r13, 8(%r14)
         ng      %r13, 8(%r5)
         stg     %r13, 8(%r4)
         lg      %r13, 16(%r14)
         ng      %r13, 16(%r5)
         stg     %r13, 16(%r4)
         lg      %r13, 24(%r14)
         ng      %r13, 24(%r5)
         stg     %r13, 24(%r4)
         la      %r4, 32(%r4)
         la      %r14, 32(%r14)
         la      %r5, 32(%r5)
         brct    %r0, .LBB0_152
         j       .LBB0_155