[PATCH] D9804: Optimize scattered vector insert/extract pattern

Fri Sep 25 16:13:04 PDT 2015

apazos added a subscriber: apazos.
apazos added a comment.

Hi folks,

Just want to clarify where this issue comes from:

1. SROA will replace large allocas with vector SSA values.

E.g., alloca "short a [32]" is rewritten as 4 vectors of type <8 x i16>  to avoid the load/stores to the stack-allocated variable.
This results in insert/extract instructions being generated in the IR code.

2. The AArch64 backend is not able to combine scattered loads and stores with the insert/extract instructions to generate scalar/lane-based loads/stores in the presence of extension instructions.

Example 1: When there no extension/truncation of the loaded values we are fine, the backend generates optimized code.
x = ld
y = insert x v1, 1
Generates:
ld1 { v0.b }[1], [x0]

Example 2: But when extension instructions are present:
x = ld
y = ext x
z = insert y v1, 1
Generates:
ldrb	 w8, [x0]
ins	v0.h[1], w8

However this is better code:
ld1	{ v0.b }[1], [x0]
ushll	v0.8h, v0.8b, #0

You notice it is better code when you have more than one insert instruction:
	ldrb	 w8, [x0]
	ldrb	 w9, [x1]
	ins	v0.h[1], w8
	ins	v0.h[5], w9
Better code would be:
	ld1	{ v0.b }[1], [x0]
	ld1	{ v0.b }[5], [x1]
	ushll	v0.8h, v0.8b, #0

The same is true for extract instructions:

  umov  w8, v0.b[1]
  umov  w9, v0.b[5]
  strh   w8, [x0]
  strh   w9, [x1]

Better code would be:

  ushll v0.8h, v0.8b, #0
  st1 { v0.h }[1], [x0]
  st1 { v0.h }[5], [x1]

Therefore after SROA we need to detect these patterns in the IR and fix the IR code so the backend can generate the optimized instructions.

This should be done target-independent. Maybe it can be done in Inst Combine, or SLP vectorizer (as in this patch).

Even though it is SROA who is generating the insert/extract instructions, I do not think we should fix it there.

This is the problem Lawrence is trying to solve. Any other suggestion?

Repository:
  rL LLVM

http://reviews.llvm.org/D9804