[PATCH] D48193: [LoopVectorizer] Use an interleave count of 1 when using a vector library call

Thu Jun 14 13:14:50 PDT 2018

rob.lougher created this revision.
rob.lougher added reviewers: mkuper, hfinkel, mssimpso.
Herald added a subscriber: dmgreen.

Given the following test program:

  #include <math.h>

  void test(float *a, float *b, int n) {
    for(int i = 0; i < n; i++)
      b[i] = sinf(a[i]);
  }

If we tell the compiler we have a vector-library available and compile it as follows:

$ clang -O2 --target=x86_64-unknown-linux -march=btver2 -mllvm -vector-library=SVML -S test.c

The loop will be vectorized with a vectorization factor of 8, and the call to sinf will be widened to a vector library call (__svml_sinf8):

  .LBB0_6:                                # %vector.body
                                          # =>This Inner Loop Header: Depth=1
  	vmovups	(%r12,%r13,4), %ymm0
  	vmovups	32(%r12,%r13,4), %ymm1
  	vmovups	64(%r12,%r13,4), %ymm3
  	vmovups	96(%r12,%r13,4), %ymm2
  	vmovups	%ymm1, (%rsp)           # 32-byte Spill
  	vmovups	%ymm3, 32(%rsp)         # 32-byte Spill
  	vmovups	%ymm2, 96(%rsp)         # 32-byte Spill
  	callq	__svml_sinf8
  	vmovups	%ymm0, 64(%rsp)         # 32-byte Spill
  	vmovups	(%rsp), %ymm0           # 32-byte Reload
  	callq	__svml_sinf8
  	vmovups	%ymm0, (%rsp)           # 32-byte Spill
  	vmovups	32(%rsp), %ymm0         # 32-byte Reload
  	callq	__svml_sinf8
  	vmovups	%ymm0, 32(%rsp)         # 32-byte Spill
  	vmovups	96(%rsp), %ymm0         # 32-byte Reload
  	callq	__svml_sinf8
  	vmovups	64(%rsp), %ymm1         # 32-byte Reload
  	vmovups	(%rsp), %ymm3           # 32-byte Reload
  	vmovups	32(%rsp), %ymm2         # 32-byte Reload
  	vmovups	%ymm1, (%r14,%r13,4)
  	vmovups	%ymm3, 32(%r14,%r13,4)
  	vmovups	%ymm2, 64(%r14,%r13,4)
  	vmovups	%ymm0, 96(%r14,%r13,4)
  	addq	$32, %r13
  	cmpq	%r13, %rbx
  	jne	.LBB0_6

However, as can be seen the code generated is poor, containing a large number of spills and reloads.  The reason for this is the loop vectorizer has chosen an interleave count (aka unroll factor) of 4.

In general, the heuristics tries to create parallel instances of the loop to expose ILP without causing spilling.  It bases this on the number of registers used in the loop and the number of registers available.  However, due to the way instructions are interleaved, the vector call causes  the registers for the other instances to be spilled (thus defeating the heuristics).

This patch changes the heuristics to use an interleave count of 1 when a call will be vectorized to a library call.  The test above now generates:

  .LBB0_6:                                # %vector.body
                                          # =>This Inner Loop Header: Depth=1
  	vmovups	(%r12,%r13,4), %ymm0
  	callq	__svml_sinf8
  	vmovups	%ymm0, (%r14,%r13,4)
  	addq	$8, %r13
  	cmpq	%r13, %rbx
  	jne	.LBB0_6

Repository:
  rL LLVM

https://reviews.llvm.org/D48193

Files:
  include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
  lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
  lib/Transforms/Vectorize/LoopVectorize.cpp
  test/Transforms/LoopVectorize/X86/interleaving-veclib-call.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D48193.151397.patch
Type: text/x-patch
Size: 10909 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180614/beb5a78c/attachment-0001.bin>