[PATCH] Fix SROA for intrinsics

Ana Pazos apazos at codeaurora.org
Thu Mar 12 14:12:47 PDT 2015

I took a closer look at the degradation caused by Owen's patch on AArch64.

With Owen's patch SROA promotes more allocas to vector values and generates a lot of scattered vector insert element instructions. But the backend is not able to optimize those scattered vector insert element instructions when there are extension operations in between the load and the insert instruction. It ends up executing a lot more vector insert instructions degrading performance.

Here is a simple example:

x = ld
y = ld
y = insert x v1, 1
z = insert y v1, 5

ld1 { v0.b }[1], [x0]
ld1 { v0.b }[5], [x1]

x = ld
ex = ext x
y = ld
ey = ex y
z = insert ex v1, 1
k = insert ey v1, 5


  ldrb     w8, [x0]
  ldrb     w9, [x1]
  ins    v0.h[1], w8
  ins    v0.h[5], w9

Better code would be:

  ld1    { v0.b }[1], [x0]
  ld1    { v0.b }[5], [x1]
  ushll    v0.8h, v0.8b, #0

Even though it is SROA who is generating the vector insert instructions (b.t.w, same issue with vecttor extract instructions), I do not think we should fix it there.

Chandler , what do you think? Should we try to generate better code from SROA?

In my opinion we should do an IR optimization (Instr Combine or even SLP vectorizer?) to allow the backend to generate better machine code. Here is what the transformation would look like:

; Problem: The difference in element size prevents optimized code from being generated
define <8 x i16>  @test_ins4(i8* %arrayidx1, i8* %arrayidx2)  {

  %1 = load i8* %arrayidx1
  %conv1 = zext i8 %1 to i16
  %2 = load i8* %arrayidx2
  %conv2 = zext i8 %2 to i16
  %x = insertelement <8 x i16> undef, i16 %conv1, i32 1
  %y = insertelement <8 x i16> undef, i16 %conv2, i32 5
  %z = add <8 x i16> %x, %y
  ret <8 x i16> %z


; Solution: Transforming the IR to eliminate the difference in element size allowing us to generate optimized code
define <8 x i16>  @test_ins5(i8* %arrayidx1, i8* %arrayidx2)  {

  %1 = load i8* %arrayidx1
  %conv1 = zext i8 %1 to i16
  %2 = load i8* %arrayidx2
  %conv2 = zext i8 %2 to i16
  %x = insertelement <8 x i8> undef, i8 %1, i32 1
  %y = insertelement <8 x i8> %x, i8 %2, i32 5
  %z = zext <8 x i8> %y to <8 x i16>
  ret <8 x i16> %z


With all of the above I think we can close this revision. We do not nee to change Owen's patch (tough the logic is quite confusing in that function).



More information about the llvm-commits mailing list