[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence
Matthew Simpson via llvm-dev
llvm-dev at lists.llvm.org
Thu Oct 6 11:40:00 PDT 2016
Hi Jonas,
It does look like we should be able to simplify this. Would you mind filing
a bug? Looking at the code after InstCombine, the vector adds are trivially
redundant (I think EarlyCSE should already be able to remove them). I think
we could then teach InstructionSimplify to simplify the remaining shuffles
similar to the way it already handles extracts.
-- Matt
On Thu, Oct 6, 2016 at 10:30 AM, Jonas Paulsson via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> Hi,
>
> I have experimented with enabling the LoopVectorizer for SystemZ. I have
> come across a loop which, when vectorized, seems to have been poorly
> generated. In short, there seems to be a completely unnecessary sequence of
> shufflevector instructions, that doesn't get optimized away anywhere. In
> other words, there is a shuffling so that leads back to the original vector:
>
> [0 1 2 3 4 5 6 7]
>
> [0 4] [1 5] [2 6] [3 7]
>
> [0 4 1 5] [2 6 3 7]
>
> [0 1 2 3 4 5 6 7]
>
> Is this something the instruction combiner, or perhaps the
> InterleavedAccess pass should handle? Even though I suspect that there are
> currently many target hooks for SystemZ with bad values returned, this
> seems like something that the optimizers should handle regardless. The
> result of this is unnecessary target instruction - as can be seen at the
> bottom.
>
> I would appreciate any input on this, and if needed I can supply a test
> case.
>
> /Jonas
>
>
> Loop before vectorize pass:
>
> while.body320: ; preds =
> %while.body320.preheader, %while.body320
> %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
> %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
> %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
> %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
> %while.body320.preheader ]
> %dec = add nsw i32 %len.0288, -1
> %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
> %176 = load i64, i64* %ll.0290, align 8
> %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
> %177 = load i64, i64* %rl.0289, align 8
> %and322 = and i64 %177, %176
> %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
> store i64 %and322, i64* %dl.0291, align 8
> %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
> %178 = load i64, i64* %incdec.ptr, align 8
> %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
> %179 = load i64, i64* %incdec.ptr321, align 8
> %and326 = and i64 %179, %178
> %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
> store i64 %and326, i64* %incdec.ptr323, align 8
> %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
> %180 = load i64, i64* %incdec.ptr324, align 8
> %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
> %181 = load i64, i64* %incdec.ptr325, align 8
> %and330 = and i64 %181, %180
> %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
> store i64 %and330, i64* %incdec.ptr327, align 8
> %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
> %182 = load i64, i64* %incdec.ptr328, align 8
> %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
> %183 = load i64, i64* %incdec.ptr329, align 8
> %and334 = and i64 %183, %182
> %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
> store i64 %and334, i64* %incdec.ptr331, align 8
> %tobool319 = icmp eq i32 %dec, 0
> br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320
>
>
> Vectorizing:
>
> LV: Checking a loop in "Perl_do_vop" from do_vop.bc
> LV: Loop hints: force=? width=0 unroll=0
> LV: Found a loop: while.body320
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Did not find one integer induction var.
> LV: We can vectorize this loop (with a runtime bound check)!
> LV: Analyzing interleaved accesses...
> LV: Creating an interleave group with: store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted: store i64 %and330, i64* %incdec.ptr327, align 8
> into the interleave group with store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted: store i64 %and326, i64* %incdec.ptr323, align 8
> into the interleave group with store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted: store i64 %and322, i64* %dl.0291, align 8
> into the interleave group with store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Creating an interleave group with: %183 = load i64, i64*
> %incdec.ptr329, align 8
> LV: Inserted: %181 = load i64, i64* %incdec.ptr325, align 8
> into the interleave group with %183 = load i64, i64* %incdec.ptr329,
> align 8
> LV: Inserted: %179 = load i64, i64* %incdec.ptr321, align 8
> into the interleave group with %183 = load i64, i64* %incdec.ptr329,
> align 8
> LV: Inserted: %177 = load i64, i64* %rl.0289, align 8
> into the interleave group with %183 = load i64, i64* %incdec.ptr329,
> align 8
> LV: Creating an interleave group with: %182 = load i64, i64*
> %incdec.ptr328, align 8
> LV: Inserted: %180 = load i64, i64* %incdec.ptr324, align 8
> into the interleave group with %182 = load i64, i64* %incdec.ptr328,
> align 8
> LV: Inserted: %178 = load i64, i64* %incdec.ptr, align 8
> into the interleave group with %182 = load i64, i64* %incdec.ptr328,
> align 8
> LV: Inserted: %176 = load i64, i64* %ll.0290, align 8
> into the interleave group with %182 = load i64, i64* %incdec.ptr328,
> align 8
> LV: Found uniform instruction: %tobool319 = icmp eq i32 %dec, 0
> LV: Found uniform instruction: %incdec.ptr324 = getelementptr inbounds
> i64, i64* %ll.0290, i64 2
> LV: Found uniform instruction: %incdec.ptr329 = getelementptr inbounds
> i64, i64* %rl.0289, i64 3
> LV: Found uniform instruction: %incdec.ptr323 = getelementptr inbounds
> i64, i64* %dl.0291, i64 1
> LV: Found uniform instruction: %incdec.ptr328 = getelementptr inbounds
> i64, i64* %ll.0290, i64 3
> LV: Found uniform instruction: %incdec.ptr321 = getelementptr inbounds
> i64, i64* %rl.0289, i64 1
> LV: Found uniform instruction: %incdec.ptr327 = getelementptr inbounds
> i64, i64* %dl.0291, i64 2
> LV: Found uniform instruction: %incdec.ptr325 = getelementptr inbounds
> i64, i64* %rl.0289, i64 2
> LV: Found uniform instruction: %incdec.ptr331 = getelementptr inbounds
> i64, i64* %dl.0291, i64 3
> LV: Found uniform instruction: %incdec.ptr = getelementptr inbounds i64,
> i64* %ll.0290, i64 1
> LV: Found uniform instruction: %dl.0291 = phi i64* [ %incdec.ptr335,
> %while.body320 ], [ %73, %while.body320.preheader ]
> LV: Found uniform instruction: %incdec.ptr335 = getelementptr inbounds
> i64, i64* %dl.0291, i64 4
> LV: Found uniform instruction: %ll.0290 = phi i64* [ %incdec.ptr332,
> %while.body320 ], [ %74, %while.body320.preheader ]
> LV: Found uniform instruction: %incdec.ptr332 = getelementptr inbounds
> i64, i64* %ll.0290, i64 4
> LV: Found uniform instruction: %rl.0289 = phi i64* [ %incdec.ptr333,
> %while.body320 ], [ %75, %while.body320.preheader ]
> LV: Found uniform instruction: %incdec.ptr333 = getelementptr inbounds
> i64, i64* %rl.0289, i64 4
> LV: Found uniform instruction: %len.0288 = phi i32 [ %dec,
> %while.body320 ], [ %conv316, %while.body320.preheader ]
> LV: Found uniform instruction: %dec = add nsw i32 %len.0288, -1
> LV: Found trip count: 0
> LV: The Smallest and Widest types: 64 / 64 bits.
> LV: The Widest register is: 128 bits.
> LV: Found an estimated cost of 0 for VF 1 For instruction: %dl.0291 =
> phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction: %ll.0290 =
> phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction: %rl.0289 =
> phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction: %len.0288 =
> phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ]
> LV: Found an estimated cost of 1 for VF 1 For instruction: %dec = add
> nsw i32 %len.0288, -1
> LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr =
> getelementptr inbounds i64, i64* %ll.0290, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction: %176 = load
> i64, i64* %ll.0290, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction: %177 = load
> i64, i64* %rl.0289, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction: %and322 = and
> i64 %177, %176
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction: store i64
> %and322, i64* %dl.0291, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction: %178 = load
> i64, i64* %incdec.ptr, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction: %179 = load
> i64, i64* %incdec.ptr321, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction: %and326 = and
> i64 %179, %178
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction: store i64
> %and326, i64* %incdec.ptr323, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction: %180 = load
> i64, i64* %incdec.ptr324, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction: %181 = load
> i64, i64* %incdec.ptr325, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction: %and330 = and
> i64 %181, %180
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction: store i64
> %and330, i64* %incdec.ptr327, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction: %182 = load
> i64, i64* %incdec.ptr328, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction: %183 = load
> i64, i64* %incdec.ptr329, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction: %and334 = and
> i64 %183, %182
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction: store i64
> %and334, i64* %incdec.ptr331, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction: %tobool319 =
> icmp eq i32 %dec, 0
> LV: Found an estimated cost of 0 for VF 1 For instruction: br i1
> %tobool319, label %sw.epilog381.loopexit, label %while.body320
> LV: Scalar loop costs: 18.
> LV: Found an estimated cost of 0 for VF 2 For instruction: %dl.0291 =
> phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction: %ll.0290 =
> phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction: %rl.0289 =
> phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction: %len.0288 =
> phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ]
> LV: Found an estimated cost of 1 for VF 2 For instruction: %dec = add
> nsw i32 %len.0288, -1
> LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr =
> getelementptr inbounds i64, i64* %ll.0290, i64 1
> LV: Found an estimated cost of 4 for VF 2 For instruction: %176 = load
> i64, i64* %ll.0290, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
> LV: Found an estimated cost of 4 for VF 2 For instruction: %177 = load
> i64, i64* %rl.0289, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction: %and322 = and
> i64 %177, %176
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
> LV: Found an estimated cost of 0 for VF 2 For instruction: store i64
> %and322, i64* %dl.0291, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction: %178 = load
> i64, i64* %incdec.ptr, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction: %179 = load
> i64, i64* %incdec.ptr321, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction: %and326 = and
> i64 %179, %178
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction: store i64
> %and326, i64* %incdec.ptr323, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction: %180 = load
> i64, i64* %incdec.ptr324, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction: %181 = load
> i64, i64* %incdec.ptr325, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction: %and330 = and
> i64 %181, %180
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction: store i64
> %and330, i64* %incdec.ptr327, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
> LV: Found an estimated cost of 0 for VF 2 For instruction: %182 = load
> i64, i64* %incdec.ptr328, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
> LV: Found an estimated cost of 0 for VF 2 For instruction: %183 = load
> i64, i64* %incdec.ptr329, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction: %and334 = and
> i64 %183, %182
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
> LV: Found an estimated cost of 4 for VF 2 For instruction: store i64
> %and334, i64* %incdec.ptr331, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction: %tobool319 =
> icmp eq i32 %dec, 0
> LV: Found an estimated cost of 0 for VF 2 For instruction: br i1
> %tobool319, label %sw.epilog381.loopexit, label %while.body320
> LV: Vector loop of width 2 costs: 9.
> LV: Selecting VF: 2.
> LV: The target has 32 registers
> LV(REG): Calculating max register usage:
> LV(REG): At #0 Interval # 0
> LV(REG): At #1 Interval # 1
> LV(REG): At #2 Interval # 2
> LV(REG): At #3 Interval # 3
> LV(REG): At #4 Interval # 4
> LV(REG): At #5 Interval # 4
> LV(REG): At #6 Interval # 5
> LV(REG): At #7 Interval # 6
> LV(REG): At #8 Interval # 7
> LV(REG): At #9 Interval # 8
> LV(REG): At #10 Interval # 7
> LV(REG): At #12 Interval # 7
> LV(REG): At #13 Interval # 8
> LV(REG): At #14 Interval # 8
> LV(REG): At #15 Interval # 9
> LV(REG): At #16 Interval # 9
> LV(REG): At #17 Interval # 8
> LV(REG): At #19 Interval # 7
> LV(REG): At #20 Interval # 8
> LV(REG): At #21 Interval # 8
> LV(REG): At #22 Interval # 9
> LV(REG): At #23 Interval # 9
> LV(REG): At #24 Interval # 8
> LV(REG): At #26 Interval # 7
> LV(REG): At #27 Interval # 7
> LV(REG): At #28 Interval # 7
> LV(REG): At #29 Interval # 7
> LV(REG): At #30 Interval # 7
> LV(REG): At #31 Interval # 6
> LV(REG): At #33 Interval # 5
> LV(REG): VF = 2
> LV(REG): Found max usage: 2
> LV(REG): Found invariant usage: 4
> LV(REG): LoopSize: 35
> LV: Loop cost is 18
> LV: Interleaving to reduce branch cost.
> LV: Interleaving is not beneficial.
> LV: Found a vectorizable loop (2) in do_vop.bc
> LV: Interleaving disabled by the pass manager
> LV: Scalarizing: %dec = add nsw i32 %len.0288, -1
> LV: Scalarizing: %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290,
> i64 1
> LV: Scalarizing: %incdec.ptr321 = getelementptr inbounds i64, i64*
> %rl.0289, i64 1
> LV: Scalarizing: %incdec.ptr323 = getelementptr inbounds i64, i64*
> %dl.0291, i64 1
> LV: Scalarizing: %incdec.ptr324 = getelementptr inbounds i64, i64*
> %ll.0290, i64 2
> LV: Scalarizing: %incdec.ptr325 = getelementptr inbounds i64, i64*
> %rl.0289, i64 2
> LV: Scalarizing: %incdec.ptr327 = getelementptr inbounds i64, i64*
> %dl.0291, i64 2
> LV: Scalarizing: %incdec.ptr328 = getelementptr inbounds i64, i64*
> %ll.0290, i64 3
> LV: Scalarizing: %incdec.ptr329 = getelementptr inbounds i64, i64*
> %rl.0289, i64 3
> LV: Scalarizing: %incdec.ptr331 = getelementptr inbounds i64, i64*
> %dl.0291, i64 3
> LV: Scalarizing: %incdec.ptr332 = getelementptr inbounds i64, i64*
> %ll.0290, i64 4
> LV: Scalarizing: %incdec.ptr333 = getelementptr inbounds i64, i64*
> %rl.0289, i64 4
> LV: Scalarizing: %incdec.ptr335 = getelementptr inbounds i64, i64*
> %dl.0291, i64 4
> LV: Scalarizing: %tobool319 = icmp eq i32 %dec, 0
>
> vectorized loop (vectorization width: 2, interleaved count: 1)
>
> Loop after vectorize pass:
>
> vector.body419: ; preds =
> %vector.body419, %vector.ph440
> %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442,
> %vector.body419 ]
> %184 = add i64 %index441, 0
> %185 = shl i64 %184, 2
> %next.gep453 = getelementptr i64, i64* %73, i64 %185
> %186 = add i64 %index441, 0
> %187 = shl i64 %186, 2
> %next.gep454 = getelementptr i64, i64* %74, i64 %187
> %188 = add i64 %index441, 0
> %189 = shl i64 %188, 2
> %next.gep455 = getelementptr i64, i64* %75, i64 %189
> %190 = trunc i64 %index441 to i32
> %offset.idx456 = sub i32 %conv316, %190
> %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32
> %offset.idx456, i32 0
> %broadcast.splat458 = shufflevector <2 x i32> %broadcast.splatinsert457,
> <2 x i32> undef, <2 x i32> zeroinitializer
> %induction459 = add <2 x i32> %broadcast.splat458, <i32 0, i32 -1>
> %191 = add i32 %offset.idx456, 0
> %192 = add nsw i32 %191, -1
> %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1
> %194 = getelementptr i64, i64* %next.gep454, i32 0
> %195 = bitcast i64* %194 to <8 x i64>*
> %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8, !alias.scope !21
> %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef,
> <2 x i32> <i32 0, i32 4>
> %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef,
> <2 x i32> <i32 1, i32 5>
> %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef,
> <2 x i32> <i32 2, i32 6>
> %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef,
> <2 x i32> <i32 3, i32 7>
> %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1
> %197 = getelementptr i64, i64* %next.gep455, i32 0
> %198 = bitcast i64* %197 to <8 x i64>*
> %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8, !alias.scope !24
> %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef,
> <2 x i32> <i32 0, i32 4>
> %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef,
> <2 x i32> <i32 1, i32 5>
> %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef,
> <2 x i32> <i32 2, i32 6>
> %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef,
> <2 x i32> <i32 3, i32 7>
> %199 = and <2 x i64> %strided.vec466, %strided.vec461
> %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1
> %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2
> %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2
> %203 = and <2 x i64> %strided.vec467, %strided.vec462
> %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2
> %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3
> %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3
> %207 = and <2 x i64> %strided.vec468, %strided.vec463
> %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3
> %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4
> %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4
> %211 = and <2 x i64> %strided.vec469, %strided.vec464
> %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4
> %213 = getelementptr i64, i64* %208, i32 -3
> %214 = bitcast i64* %213 to <8 x i64>*
> %215 = shufflevector <2 x i64> %199, <2 x i64> %203, <4 x i32> <i32 0,
> i32 1, i32 2, i32 3>
> %216 = shufflevector <2 x i64> %207, <2 x i64> %211, <4 x i32> <i32 0,
> i32 1, i32 2, i32 3>
> %217 = shufflevector <4 x i64> %215, <4 x i64> %216, <8 x i32> <i32 0,
> i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
> %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x i64> undef, <8
> x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
> store <8 x i64> %interleaved.vec470, <8 x i64>* %214, align 8,
> !alias.scope !26, !noalias !28
> %218 = icmp eq i32 %192, 0
> %index.next442 = add i64 %index441, 2
> %219 = icmp eq i64 %index.next442, %n.vec425
> br i1 %219, label %middle.block420, label %vector.body419, !llvm.loop !29
>
> Loop after instruction combining:
>
> vector.body419: ; preds =
> %vector.body419, %vector.body419.preheader
> %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2,
> %vector.body419.preheader ]
> %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond,
> %vector.body419.preheader ]
> %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57,
> %vector.body419.preheader ]
> %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [ %n.vec425,
> %vector.body419.preheader ]
> %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>*
> %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>*
> %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>*
> %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961, align 8,
> !alias.scope !21
> %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658, align 8,
> !alias.scope !24
> %179 = and <8 x i64> %wide.vec465, %wide.vec460
> %180 = shufflevector <8 x i64> %179, <8 x i64> undef, <2 x i32> <i32 0,
> i32 4>
> %181 = and <8 x i64> %wide.vec465, %wide.vec460
> %182 = shufflevector <8 x i64> %181, <8 x i64> undef, <2 x i32> <i32 1,
> i32 5>
> %183 = and <8 x i64> %wide.vec465, %wide.vec460
> %184 = shufflevector <8 x i64> %183, <8 x i64> undef, <2 x i32> <i32 2,
> i32 6>
> %185 = and <8 x i64> %wide.vec465, %wide.vec460
> %186 = shufflevector <8 x i64> %185, <8 x i64> undef, <2 x i32> <i32 3,
> i32 7>
> %187 = shufflevector <2 x i64> %180, <2 x i64> %182, <4 x i32> <i32 0,
> i32 1, i32 2, i32 3>
> %188 = shufflevector <2 x i64> %184, <2 x i64> %186, <4 x i32> <i32 0,
> i32 1, i32 2, i32 3>
> %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x i64> %188, <8 x
> i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
> store <8 x i64> %interleaved.vec470, <8 x i64>* %lsr.iv6264, align 8,
> !alias.scope !26, !noalias !28
> %lsr.iv.next55 = add i64 %lsr.iv54, -2
> %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64
> %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64
> %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64
> %189 = icmp eq i64 %lsr.iv.next55, 0
> br i1 %189, label %middle.block420, label %vector.body419, !llvm.loop !29
>
> Final vectorized loop
>
> .LBB0_141: # %vector.body419
> # =>This Inner Loop Header: Depth=1
> vl %v0, 48(%r8)
> vl %v1, 48(%r7)
> vn %v0, %v1, %v0
> vl %v1, 16(%r8)
> vl %v2, 16(%r7)
> vn %v1, %v2, %v1
> vmrlg %v2, %v1, %v0
> vmrhg %v0, %v1, %v0
> vmrlg %v1, %v0, %v2
> vst %v1, 48(%r9)
> vl %v1, 32(%r8)
> vl %v3, 32(%r7)
> vn %v1, %v3, %v1
> vl %v3, 0(%r8)
> vl %v4, 0(%r7)
> vn %v3, %v4, %v3
> vmrlg %v4, %v3, %v1
> vmrhg %v1, %v3, %v1
> vmrlg %v3, %v1, %v4
> vst %v3, 32(%r9)
> vmrhg %v0, %v0, %v2
> vst %v0, 16(%r9)
> vmrhg %v0, %v1, %v4
> vst %v0, 0(%r9)
> la %r9, 64(%r9)
> la %r8, 64(%r8)
> la %r7, 64(%r7)
> aghi %r13, -2
> jne .LBB0_141
>
> Final scalar loop :
> .LBB0_152: # %while.body320
> # =>This Inner Loop Header: Depth=1
> lg %r13, 0(%r14)
> ng %r13, 0(%r5)
> stg %r13, 0(%r4)
> lg %r13, 8(%r14)
> ng %r13, 8(%r5)
> stg %r13, 8(%r4)
> lg %r13, 16(%r14)
> ng %r13, 16(%r5)
> stg %r13, 16(%r4)
> lg %r13, 24(%r14)
> ng %r13, 24(%r5)
> stg %r13, 24(%r4)
> la %r4, 32(%r4)
> la %r14, 32(%r14)
> la %r5, 32(%r5)
> brct %r0, .LBB0_152
> j .LBB0_155
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161006/f79b0e32/attachment.html>
More information about the llvm-dev
mailing list