[PATCH] Optimize scattered vector insert/extract pattern

Fri May 15 16:51:06 PDT 2015

There is a performance issue noticed on AArch64 for out internal testcase after a fix in SROA was brought in by DUI.

commit d4748bbd497b550a4e5db246c6708fcd6de542da
Author: Owen Anderson <resistor at mac.com>
Date:   Thu Aug 7 21:07:35 2014 +0000

    Fix a case in SROA where lifetime intrinsics could inhibit alloca promotion.  In
    this case, the code path dealing with vector promotion was missing the explicit
    checks for lifetime intrinsics that were present on the corresponding integer
    promotion path.

Here are more details on the investigation:

1) What SROA is doing.
SROA will replace large allocas with either integer SSA values or vector SSA values.

The alloca "short pix [32]" is rewritten as 4 vectors of type <8 x i16> (IR code supports vectors) to avoid the load/stores to the stack-allocated variable.

2) Our problem is the inability of the backend to combine scattered loads and stores with the insert and extract instructions to generate scalar/lane-based loads and stores in the presence of extension instructions. Example:
x = ld
y = insert x v1, 1
Generates:
ld1 { v0.b }[1], [x0]

When there no extension/truncation of the loaded values we are fine, the backend generates the optimized code. But that is not the case for the following

x = ld
y = ext x
z = insert y v1, 1
Generates:
ldrb	 w8, [x0]
ins	v0.h[1], w8
However this is better code:
ld1	{ v0.b }[1], [x0]
ushll	v0.8h, v0.8b, #0

You can clearly see the advantage of the latter code when you have more than one insert instruction:

	ldrb	 w8, [x0]
	ldrb	 w9, [x1]
	ins	v0.h[1], w8
	ins	v0.h[5], w9
        Better code would be:
	ld1	{ v0.b }[1], [x0]
	ld1	{ v0.b }[5], [x1]
	ushll	v0.8h, v0.8b, #0

The same is true for extract instructions:
  umov  w8, v0.b[1]
  umov  w9, v0.b[5]
  strh   w8, [x0]
  strh   w9, [x1]
  Better code would be:
  ushll v0.8h, v0.8b, #0
  st1 { v0.h }[1], [x0]
  st1 { v0.h }[5], [x1]

Look at the testcase with examples of what is bad code and the IR transformation we need to do to generate the optimized code.

Therefore after SROA we need to detect these patterns in the IR and fix the IR code so the backend can generate the optimized instructions.

This should be done target-independent. It can be done in Instr Combine, or slp vectorizer, we choose the latter.

Even though it is SROA who is generating the insert/extract instructions, I do not think we should fix it there.

REPOSITORY
  rL LLVM

http://reviews.llvm.org/D9804

Files:
  lib/Transforms/IPO/PassManagerBuilder.cpp
  lib/Transforms/Vectorize/SLPVectorizer.cpp
  test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll
  test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D9804.25904.patch
Type: text/x-patch
Size: 32048 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150515/5d3c33e8/attachment.bin>