[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence
Jonas Paulsson via llvm-dev
llvm-dev at lists.llvm.org
Fri Oct 7 02:07:28 PDT 2016
Hi Matt,
ok - see https://llvm.org/bugs/show_bug.cgi?id=30630.
/Jonas
On 2016-10-06 20:40, Matthew Simpson wrote:
> Hi Jonas,
>
> It does look like we should be able to simplify this. Would you mind
> filing a bug? Looking at the code after InstCombine, the vector adds
> are trivially redundant (I think EarlyCSE should already be able to
> remove them). I think we could then teach InstructionSimplify to
> simplify the remaining shuffles similar to the way it already handles
> extracts.
>
> -- Matt
>
> On Thu, Oct 6, 2016 at 10:30 AM, Jonas Paulsson via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
> Hi,
>
> I have experimented with enabling the LoopVectorizer for SystemZ.
> I have come across a loop which, when vectorized, seems to have
> been poorly generated. In short, there seems to be a completely
> unnecessary sequence of shufflevector instructions, that doesn't
> get optimized away anywhere. In other words, there is a shuffling
> so that leads back to the original vector:
>
> [0 1 2 3 4 5 6 7]
>
> [0 4] [1 5] [2 6] [3 7]
>
> [0 4 1 5] [2 6 3 7]
>
> [0 1 2 3 4 5 6 7]
>
> Is this something the instruction combiner, or perhaps the
> InterleavedAccess pass should handle? Even though I suspect that
> there are currently many target hooks for SystemZ with bad values
> returned, this seems like something that the optimizers should
> handle regardless. The result of this is unnecessary target
> instruction - as can be seen at the bottom.
>
> I would appreciate any input on this, and if needed I can supply a
> test case.
>
> /Jonas
>
>
> Loop before vectorize pass:
>
> while.body320: ; preds =
> %while.body320.preheader, %while.body320
> %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
> %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
> %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
> %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
> %while.body320.preheader ]
> %dec = add nsw i32 %len.0288, -1
> %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
> %176 = load i64, i64* %ll.0290, align 8
> %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
> %177 = load i64, i64* %rl.0289, align 8
> %and322 = and i64 %177, %176
> %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
> store i64 %and322, i64* %dl.0291, align 8
> %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
> %178 = load i64, i64* %incdec.ptr, align 8
> %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
> %179 = load i64, i64* %incdec.ptr321, align 8
> %and326 = and i64 %179, %178
> %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
> store i64 %and326, i64* %incdec.ptr323, align 8
> %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
> %180 = load i64, i64* %incdec.ptr324, align 8
> %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
> %181 = load i64, i64* %incdec.ptr325, align 8
> %and330 = and i64 %181, %180
> %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
> store i64 %and330, i64* %incdec.ptr327, align 8
> %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
> %182 = load i64, i64* %incdec.ptr328, align 8
> %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
> %183 = load i64, i64* %incdec.ptr329, align 8
> %and334 = and i64 %183, %182
> %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
> store i64 %and334, i64* %incdec.ptr331, align 8
> %tobool319 = icmp eq i32 %dec, 0
> br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320
>
>
> Vectorizing:
>
> LV: Checking a loop in "Perl_do_vop" from do_vop.bc
> LV: Loop hints: force=? width=0 unroll=0
> LV: Found a loop: while.body320
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Did not find one integer induction var.
> LV: We can vectorize this loop (with a runtime bound check)!
> LV: Analyzing interleaved accesses...
> LV: Creating an interleave group with: store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted: store i64 %and330, i64* %incdec.ptr327, align 8
> into the interleave group with store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted: store i64 %and326, i64* %incdec.ptr323, align 8
> into the interleave group with store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted: store i64 %and322, i64* %dl.0291, align 8
> into the interleave group with store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Creating an interleave group with: %183 = load i64, i64*
> %incdec.ptr329, align 8
> LV: Inserted: %181 = load i64, i64* %incdec.ptr325, align 8
> into the interleave group with %183 = load i64, i64*
> %incdec.ptr329, align 8
> LV: Inserted: %179 = load i64, i64* %incdec.ptr321, align 8
> into the interleave group with %183 = load i64, i64*
> %incdec.ptr329, align 8
> LV: Inserted: %177 = load i64, i64* %rl.0289, align 8
> into the interleave group with %183 = load i64, i64*
> %incdec.ptr329, align 8
> LV: Creating an interleave group with: %182 = load i64, i64*
> %incdec.ptr328, align 8
> LV: Inserted: %180 = load i64, i64* %incdec.ptr324, align 8
> into the interleave group with %182 = load i64, i64*
> %incdec.ptr328, align 8
> LV: Inserted: %178 = load i64, i64* %incdec.ptr, align 8
> into the interleave group with %182 = load i64, i64*
> %incdec.ptr328, align 8
> LV: Inserted: %176 = load i64, i64* %ll.0290, align 8
> into the interleave group with %182 = load i64, i64*
> %incdec.ptr328, align 8
> LV: Found uniform instruction: %tobool319 = icmp eq i32 %dec, 0
> LV: Found uniform instruction: %incdec.ptr324 = getelementptr
> inbounds i64, i64* %ll.0290, i64 2
> LV: Found uniform instruction: %incdec.ptr329 = getelementptr
> inbounds i64, i64* %rl.0289, i64 3
> LV: Found uniform instruction: %incdec.ptr323 = getelementptr
> inbounds i64, i64* %dl.0291, i64 1
> LV: Found uniform instruction: %incdec.ptr328 = getelementptr
> inbounds i64, i64* %ll.0290, i64 3
> LV: Found uniform instruction: %incdec.ptr321 = getelementptr
> inbounds i64, i64* %rl.0289, i64 1
> LV: Found uniform instruction: %incdec.ptr327 = getelementptr
> inbounds i64, i64* %dl.0291, i64 2
> LV: Found uniform instruction: %incdec.ptr325 = getelementptr
> inbounds i64, i64* %rl.0289, i64 2
> LV: Found uniform instruction: %incdec.ptr331 = getelementptr
> inbounds i64, i64* %dl.0291, i64 3
> LV: Found uniform instruction: %incdec.ptr = getelementptr
> inbounds i64, i64* %ll.0290, i64 1
> LV: Found uniform instruction: %dl.0291 = phi i64* [
> %incdec.ptr335, %while.body320 ], [ %73, %while.body320.preheader ]
> LV: Found uniform instruction: %incdec.ptr335 = getelementptr
> inbounds i64, i64* %dl.0291, i64 4
> LV: Found uniform instruction: %ll.0290 = phi i64* [
> %incdec.ptr332, %while.body320 ], [ %74, %while.body320.preheader ]
> LV: Found uniform instruction: %incdec.ptr332 = getelementptr
> inbounds i64, i64* %ll.0290, i64 4
> LV: Found uniform instruction: %rl.0289 = phi i64* [
> %incdec.ptr333, %while.body320 ], [ %75, %while.body320.preheader ]
> LV: Found uniform instruction: %incdec.ptr333 = getelementptr
> inbounds i64, i64* %rl.0289, i64 4
> LV: Found uniform instruction: %len.0288 = phi i32 [ %dec,
> %while.body320 ], [ %conv316, %while.body320.preheader ]
> LV: Found uniform instruction: %dec = add nsw i32 %len.0288, -1
> LV: Found trip count: 0
> LV: The Smallest and Widest types: 64 / 64 bits.
> LV: The Widest register is: 128 bits.
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
> %while.body320.preheader ]
> LV: Found an estimated cost of 1 for VF 1 For instruction: %dec =
> add nsw i32 %len.0288, -1
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction: %176 =
> load i64, i64* %ll.0290, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction: %177 =
> load i64, i64* %rl.0289, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:
> %and322 = and i64 %177, %176
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction: store
> i64 %and322, i64* %dl.0291, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction: %178 =
> load i64, i64* %incdec.ptr, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction: %179 =
> load i64, i64* %incdec.ptr321, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:
> %and326 = and i64 %179, %178
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction: store
> i64 %and326, i64* %incdec.ptr323, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction: %180 =
> load i64, i64* %incdec.ptr324, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction: %181 =
> load i64, i64* %incdec.ptr325, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:
> %and330 = and i64 %181, %180
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction: store
> i64 %and330, i64* %incdec.ptr327, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction: %182 =
> load i64, i64* %incdec.ptr328, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction: %183 =
> load i64, i64* %incdec.ptr329, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:
> %and334 = and i64 %183, %182
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction: store
> i64 %and334, i64* %incdec.ptr331, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:
> %tobool319 = icmp eq i32 %dec, 0
> LV: Found an estimated cost of 0 for VF 1 For instruction: br i1
> %tobool319, label %sw.epilog381.loopexit, label %while.body320
> LV: Scalar loop costs: 18.
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
> %while.body320.preheader ]
> LV: Found an estimated cost of 1 for VF 2 For instruction: %dec =
> add nsw i32 %len.0288, -1
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
> LV: Found an estimated cost of 4 for VF 2 For instruction: %176 =
> load i64, i64* %ll.0290, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
> LV: Found an estimated cost of 4 for VF 2 For instruction: %177 =
> load i64, i64* %rl.0289, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:
> %and322 = and i64 %177, %176
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
> LV: Found an estimated cost of 0 for VF 2 For instruction: store
> i64 %and322, i64* %dl.0291, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction: %178 =
> load i64, i64* %incdec.ptr, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction: %179 =
> load i64, i64* %incdec.ptr321, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:
> %and326 = and i64 %179, %178
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction: store
> i64 %and326, i64* %incdec.ptr323, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction: %180 =
> load i64, i64* %incdec.ptr324, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction: %181 =
> load i64, i64* %incdec.ptr325, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:
> %and330 = and i64 %181, %180
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction: store
> i64 %and330, i64* %incdec.ptr327, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
> LV: Found an estimated cost of 0 for VF 2 For instruction: %182 =
> load i64, i64* %incdec.ptr328, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
> LV: Found an estimated cost of 0 for VF 2 For instruction: %183 =
> load i64, i64* %incdec.ptr329, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:
> %and334 = and i64 %183, %182
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
> LV: Found an estimated cost of 4 for VF 2 For instruction: store
> i64 %and334, i64* %incdec.ptr331, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:
> %tobool319 = icmp eq i32 %dec, 0
> LV: Found an estimated cost of 0 for VF 2 For instruction: br i1
> %tobool319, label %sw.epilog381.loopexit, label %while.body320
> LV: Vector loop of width 2 costs: 9.
> LV: Selecting VF: 2.
> LV: The target has 32 registers
> LV(REG): Calculating max register usage:
> LV(REG): At #0 Interval # 0
> LV(REG): At #1 Interval # 1
> LV(REG): At #2 Interval # 2
> LV(REG): At #3 Interval # 3
> LV(REG): At #4 Interval # 4
> LV(REG): At #5 Interval # 4
> LV(REG): At #6 Interval # 5
> LV(REG): At #7 Interval # 6
> LV(REG): At #8 Interval # 7
> LV(REG): At #9 Interval # 8
> LV(REG): At #10 Interval # 7
> LV(REG): At #12 Interval # 7
> LV(REG): At #13 Interval # 8
> LV(REG): At #14 Interval # 8
> LV(REG): At #15 Interval # 9
> LV(REG): At #16 Interval # 9
> LV(REG): At #17 Interval # 8
> LV(REG): At #19 Interval # 7
> LV(REG): At #20 Interval # 8
> LV(REG): At #21 Interval # 8
> LV(REG): At #22 Interval # 9
> LV(REG): At #23 Interval # 9
> LV(REG): At #24 Interval # 8
> LV(REG): At #26 Interval # 7
> LV(REG): At #27 Interval # 7
> LV(REG): At #28 Interval # 7
> LV(REG): At #29 Interval # 7
> LV(REG): At #30 Interval # 7
> LV(REG): At #31 Interval # 6
> LV(REG): At #33 Interval # 5
> LV(REG): VF = 2
> LV(REG): Found max usage: 2
> LV(REG): Found invariant usage: 4
> LV(REG): LoopSize: 35
> LV: Loop cost is 18
> LV: Interleaving to reduce branch cost.
> LV: Interleaving is not beneficial.
> LV: Found a vectorizable loop (2) in do_vop.bc
> LV: Interleaving disabled by the pass manager
> LV: Scalarizing: %dec = add nsw i32 %len.0288, -1
> LV: Scalarizing: %incdec.ptr = getelementptr inbounds i64, i64*
> %ll.0290, i64 1
> LV: Scalarizing: %incdec.ptr321 = getelementptr inbounds i64,
> i64* %rl.0289, i64 1
> LV: Scalarizing: %incdec.ptr323 = getelementptr inbounds i64,
> i64* %dl.0291, i64 1
> LV: Scalarizing: %incdec.ptr324 = getelementptr inbounds i64,
> i64* %ll.0290, i64 2
> LV: Scalarizing: %incdec.ptr325 = getelementptr inbounds i64,
> i64* %rl.0289, i64 2
> LV: Scalarizing: %incdec.ptr327 = getelementptr inbounds i64,
> i64* %dl.0291, i64 2
> LV: Scalarizing: %incdec.ptr328 = getelementptr inbounds i64,
> i64* %ll.0290, i64 3
> LV: Scalarizing: %incdec.ptr329 = getelementptr inbounds i64,
> i64* %rl.0289, i64 3
> LV: Scalarizing: %incdec.ptr331 = getelementptr inbounds i64,
> i64* %dl.0291, i64 3
> LV: Scalarizing: %incdec.ptr332 = getelementptr inbounds i64,
> i64* %ll.0290, i64 4
> LV: Scalarizing: %incdec.ptr333 = getelementptr inbounds i64,
> i64* %rl.0289, i64 4
> LV: Scalarizing: %incdec.ptr335 = getelementptr inbounds i64,
> i64* %dl.0291, i64 4
> LV: Scalarizing: %tobool319 = icmp eq i32 %dec, 0
>
> vectorized loop (vectorization width: 2, interleaved count: 1)
>
> Loop after vectorize pass:
>
> vector.body419: ; preds =
> %vector.body419, %vector.ph440
> %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442,
> %vector.body419 ]
> %184 = add i64 %index441, 0
> %185 = shl i64 %184, 2
> %next.gep453 = getelementptr i64, i64* %73, i64 %185
> %186 = add i64 %index441, 0
> %187 = shl i64 %186, 2
> %next.gep454 = getelementptr i64, i64* %74, i64 %187
> %188 = add i64 %index441, 0
> %189 = shl i64 %188, 2
> %next.gep455 = getelementptr i64, i64* %75, i64 %189
> %190 = trunc i64 %index441 to i32
> %offset.idx456 = sub i32 %conv316, %190
> %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32
> %offset.idx456, i32 0
> %broadcast.splat458 = shufflevector <2 x i32>
> %broadcast.splatinsert457, <2 x i32> undef, <2 x i32> zeroinitializer
> %induction459 = add <2 x i32> %broadcast.splat458, <i32 0, i32 -1>
> %191 = add i32 %offset.idx456, 0
> %192 = add nsw i32 %191, -1
> %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1
> %194 = getelementptr i64, i64* %next.gep454, i32 0
> %195 = bitcast i64* %194 to <8 x i64>*
> %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8,
> !alias.scope !21
> %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x
> i64> undef, <2 x i32> <i32 0, i32 4>
> %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x
> i64> undef, <2 x i32> <i32 1, i32 5>
> %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x
> i64> undef, <2 x i32> <i32 2, i32 6>
> %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x
> i64> undef, <2 x i32> <i32 3, i32 7>
> %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1
> %197 = getelementptr i64, i64* %next.gep455, i32 0
> %198 = bitcast i64* %197 to <8 x i64>*
> %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8,
> !alias.scope !24
> %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x
> i64> undef, <2 x i32> <i32 0, i32 4>
> %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x
> i64> undef, <2 x i32> <i32 1, i32 5>
> %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x
> i64> undef, <2 x i32> <i32 2, i32 6>
> %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x
> i64> undef, <2 x i32> <i32 3, i32 7>
> %199 = and <2 x i64> %strided.vec466, %strided.vec461
> %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1
> %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2
> %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2
> %203 = and <2 x i64> %strided.vec467, %strided.vec462
> %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2
> %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3
> %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3
> %207 = and <2 x i64> %strided.vec468, %strided.vec463
> %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3
> %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4
> %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4
> %211 = and <2 x i64> %strided.vec469, %strided.vec464
> %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4
> %213 = getelementptr i64, i64* %208, i32 -3
> %214 = bitcast i64* %213 to <8 x i64>*
> %215 = shufflevector <2 x i64> %199, <2 x i64> %203, <4 x i32>
> <i32 0, i32 1, i32 2, i32 3>
> %216 = shufflevector <2 x i64> %207, <2 x i64> %211, <4 x i32>
> <i32 0, i32 1, i32 2, i32 3>
> %217 = shufflevector <4 x i64> %215, <4 x i64> %216, <8 x i32>
> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
> %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x i64>
> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5,
> i32 7>
> store <8 x i64> %interleaved.vec470, <8 x i64>* %214, align 8,
> !alias.scope !26, !noalias !28
> %218 = icmp eq i32 %192, 0
> %index.next442 = add i64 %index441, 2
> %219 = icmp eq i64 %index.next442, %n.vec425
> br i1 %219, label %middle.block420, label %vector.body419,
> !llvm.loop !29
>
> Loop after instruction combining:
>
> vector.body419: ; preds =
> %vector.body419, %vector.body419.preheader
> %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2,
> %vector.body419.preheader ]
> %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond,
> %vector.body419.preheader ]
> %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57,
> %vector.body419.preheader ]
> %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [
> %n.vec425, %vector.body419.preheader ]
> %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>*
> %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>*
> %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>*
> %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961, align 8,
> !alias.scope !21
> %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658, align 8,
> !alias.scope !24
> %179 = and <8 x i64> %wide.vec465, %wide.vec460
> %180 = shufflevector <8 x i64> %179, <8 x i64> undef, <2 x i32>
> <i32 0, i32 4>
> %181 = and <8 x i64> %wide.vec465, %wide.vec460
> %182 = shufflevector <8 x i64> %181, <8 x i64> undef, <2 x i32>
> <i32 1, i32 5>
> %183 = and <8 x i64> %wide.vec465, %wide.vec460
> %184 = shufflevector <8 x i64> %183, <8 x i64> undef, <2 x i32>
> <i32 2, i32 6>
> %185 = and <8 x i64> %wide.vec465, %wide.vec460
> %186 = shufflevector <8 x i64> %185, <8 x i64> undef, <2 x i32>
> <i32 3, i32 7>
> %187 = shufflevector <2 x i64> %180, <2 x i64> %182, <4 x i32>
> <i32 0, i32 1, i32 2, i32 3>
> %188 = shufflevector <2 x i64> %184, <2 x i64> %186, <4 x i32>
> <i32 0, i32 1, i32 2, i32 3>
> %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x i64>
> %188, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5,
> i32 7>
> store <8 x i64> %interleaved.vec470, <8 x i64>* %lsr.iv6264,
> align 8, !alias.scope !26, !noalias !28
> %lsr.iv.next55 = add i64 %lsr.iv54, -2
> %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64
> %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64
> %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64
> %189 = icmp eq i64 %lsr.iv.next55, 0
> br i1 %189, label %middle.block420, label %vector.body419,
> !llvm.loop !29
>
> Final vectorized loop
>
> .LBB0_141: # %vector.body419
> # =>This Inner Loop
> Header: Depth=1
> vl %v0, 48(%r8)
> vl %v1, 48(%r7)
> vn %v0, %v1, %v0
> vl %v1, 16(%r8)
> vl %v2, 16(%r7)
> vn %v1, %v2, %v1
> vmrlg %v2, %v1, %v0
> vmrhg %v0, %v1, %v0
> vmrlg %v1, %v0, %v2
> vst %v1, 48(%r9)
> vl %v1, 32(%r8)
> vl %v3, 32(%r7)
> vn %v1, %v3, %v1
> vl %v3, 0(%r8)
> vl %v4, 0(%r7)
> vn %v3, %v4, %v3
> vmrlg %v4, %v3, %v1
> vmrhg %v1, %v3, %v1
> vmrlg %v3, %v1, %v4
> vst %v3, 32(%r9)
> vmrhg %v0, %v0, %v2
> vst %v0, 16(%r9)
> vmrhg %v0, %v1, %v4
> vst %v0, 0(%r9)
> la %r9, 64(%r9)
> la %r8, 64(%r8)
> la %r7, 64(%r7)
> aghi %r13, -2
> jne .LBB0_141
>
> Final scalar loop :
> .LBB0_152: # %while.body320
> # =>This Inner Loop
> Header: Depth=1
> lg %r13, 0(%r14)
> ng %r13, 0(%r5)
> stg %r13, 0(%r4)
> lg %r13, 8(%r14)
> ng %r13, 8(%r5)
> stg %r13, 8(%r4)
> lg %r13, 16(%r14)
> ng %r13, 16(%r5)
> stg %r13, 16(%r4)
> lg %r13, 24(%r14)
> ng %r13, 24(%r5)
> stg %r13, 24(%r4)
> la %r4, 32(%r4)
> la %r14, 32(%r14)
> la %r5, 32(%r5)
> brct %r0, .LBB0_152
> j .LBB0_155
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161007/ddd64a02/attachment.html>
More information about the llvm-dev
mailing list