[PATCH] D20315: [LV] For some induction variables, use vector phis instead of widening the scalar in the loop body

Mon May 16 17:56:57 PDT 2016

mkuper created this revision.
mkuper added reviewers: delena, jmolloy, danielcdh.
mkuper added subscribers: llvm-commits, wmi, Ayal, davidxl.
Herald added a subscriber: mzolotukhin.

This changes the way we treat widening of induction variables.

In the existing code, whenever we need a widened IV, we widen the scalar IV on the fly, by splatting it and adding the step vector.
Instead, we can create a real vector IV, which tends to save a couple of instructions per iteration. This patch only changes the behavior in the most basic case - integer primary IVs with a constant step. If this looks sensible, I'll try to follow-up with the other cases.

It seems to be more or less performance neutral, but for basic cases the code looks better, so I have the feeling this is a step in the right direction.
To take the most trivial example:

```
void vec(unsigned int *a, unsigned int k) {
#pragma clang loop vectorize_width(4) interleave_count(1)
#pragma nounroll
  for(unsigned int i = 0; i < k; ++i)
    a[i] = i;
}

```
For AVX, without this patch, we get:

```
# BB#5:
	xorl	%ecx, %ecx
	vmovdqa	.LCPI0_0(%rip), %xmm0   # xmm0 = [0,1,2,3]
	.p2align	4, 0x90
.LBB0_6:                                # =>This Inner Loop Header: Depth=1
	vmovd	%ecx, %xmm1
	vpshufd	$0, %xmm1, %xmm1        # xmm1 = xmm1[0,0,0,0]
	vpaddd	%xmm0, %xmm1, %xmm1
	vmovdqu	%xmm1, (%rdi,%rcx,4)
	addq	$4, %rcx
	cmpq	%rcx, %rdx
	jne	.LBB0_6
```

And with this patch:

```
# BB#5:                                 # %vector.body.preheader
	vmovdqa	.LCPI0_0(%rip), %xmm1   # xmm1 = [0,1,2,3]
	vmovdqa	.LCPI0_1(%rip), %xmm0   # xmm0 = [4,4,4,4]
	movq	%rdi, %rcx
	movq	%r8, %rdx
	.p2align	4, 0x90
.LBB0_6:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vmovdqu	%xmm1, (%rcx)
	vpaddd	%xmm0, %xmm1, %xmm1
	addq	$16, %rcx
	addq	$-4, %rdx
	jne	.LBB0_6
```

As this example shows, when we actually need the scalar IV, e.g. for a scalar GEP, InstCombine seems to clean things up nicely, so it doesn't look like LV needs to consider that.
Other views (especially on when this may be a bad thing) are welcome. 

http://reviews.llvm.org/D20315

Files:
  lib/Transforms/Vectorize/LoopVectorize.cpp
  test/Transforms/LoopVectorize/PowerPC/vsx-tsvc-s173.ll
  test/Transforms/LoopVectorize/X86/gather_scatter.ll
  test/Transforms/LoopVectorize/cast-induction.ll
  test/Transforms/LoopVectorize/gcc-examples.ll
  test/Transforms/LoopVectorize/gep_with_bitcast.ll
  test/Transforms/LoopVectorize/global_alias.ll
  test/Transforms/LoopVectorize/induction_plus.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D20315.57415.patch
Type: text/x-patch
Size: 12414 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160517/c4ce8d43/attachment.bin>