[llvm-commits] Please Review: AVX code optimization

Sun Jul 15 23:49:45 PDT 2012

I checked my optimization on codegen level against –O2 optimization. This is the code for comparison:

IR before –O2:
  %c = extractelement <8 x i32> %a, i32 1
  %d = insertelement <8 x i32> %b, i32 %c, i32 7
The code:
        vunpcklps          %ymm0, %ymm0, %ymm0                 ## ymm0 = ymm0[0,0,1,1,4,4,5,5]
        vperm2f128      $0, %ymm0, %ymm0, %ymm0          ## ymm0 = ymm0[0,1,0,1]
        vblendps           $128, %ymm0, %ymm1, %ymm0

After –O2:
  %d = shufflevector <8 x i32> %b, <8 x i32> %a, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 9>

The code:
        vextractf128    $1, %ymm1, %xmm2
        vshufps $33, %xmm2, %xmm0, %xmm0        ## xmm0 = xmm0[1,0],xmm2[2,0]
        vshufps $36, %xmm0, %xmm2, %xmm0        ## xmm0 = xmm2[0,1],xmm0[2,0]
        vinsertf128     $1, %xmm0, %ymm1, %ymm0

--------------------------
IR before –O2:

  %c = extractelement <4 x i64> %a, i32 3
  %d = insertelement <4 x i64> %b, i64 %c, i32 2

vunpckhpd       %ymm0, %ymm0, %ymm0 ## ymm0 = ymm0[1,1,3,3]
vblendpd        $4, %ymm0, %ymm1, %ymm0

IR after –O2
  %d = shufflevector <4 x i64> %a, <4 x i64> %b, <4 x i32> <i32 0, i32 1, i32 7, i32 3>

vextractf128    $1, %ymm1, %xmm2
vextractf128    $1, %ymm0, %xmm0
vpunpckhqdq     %xmm2, %xmm0, %xmm0 ## xmm0 = xmm0[1],xmm2[1]
vinsertf128     $1, %xmm0, %ymm1, %ymm0

I have to say that the code, as I generated now in my own branch requires more changes in X86ISelLowering. I plan to send the patches one-by-one.
And answering on your question

Ø  Or is this a pattern that parts of the backend will produce internally where the IR optimizers couldn't see it?

Right, the optimizer does not see this pattern, our backend generates it later.

- Elena
From: Nick Lewycky [mailto:nlewycky at google.com]
Sent: Friday, July 13, 2012 21:43
To: Demikhovsky, Elena
Cc: Nick Lewycky; Commit Messages and Patches for LLVM
Subject: Re: [llvm-commits] Please Review: AVX code optimization

On 11 July 2012 03:34, Demikhovsky, Elena <elena.demikhovsky at intel.com<mailto:elena.demikhovsky at intel.com>> wrote:
I'm not sure that all architectures will see performance gain.
While building shuffles, I know that each shuffle will be replaced with one machine instruction.
I also know that shuffle is cheaper (1 cycle) than extract (3 cycles) and insert (2 cycles).
I know that blend is better than other shuffle. And this information is specific for X86 and written in IA optimization guide.

The IR-level optimizers already do transform your testcases into shufflevector instructions. Here's the result after opt -O2:

define <8 x i32> @test20(<8 x i32> %a, <8 x i32> %b) nounwind readnone {
  %d = shufflevector <8 x i32> %b, <8 x i32> %a, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 9>
  ret <8 x i32> %d
}

define <8 x i32> @test21(<8 x i32> %a, <8 x i32> %b) nounwind readnone {
  %d = shufflevector <8 x i32> %b, <8 x i32> %a, <8 x i32> <i32 0, i32 1, i32 9, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i32> %d
}

define <4 x i64> @test22(<4 x i64> %a, <4 x i64> %b) nounwind readnone {
  %d = shufflevector <4 x i64> %b, <4 x i64> %a, <4 x i32> <i32 0, i32 1, i32 7, i32 3>
  ret <4 x i64> %d
}

define <4 x i64> @test23(<4 x i64> %a, <4 x i64> %b) nounwind readnone {
  %d = shufflevector <4 x i64> %b, <4 x i64> %a, <4 x i32> <i32 0, i32 1, i32 7, i32 3>
  ret <4 x i64> %d
}

In what case does the patch you sent in improve generated code? Running the optimizing code generator on unoptimized IR? Or is this a pattern that parts of the backend will produce internally where the IR optimizers couldn't see it?

Nick

- Elena
-----Original Message-----
From: Nick Lewycky [mailto:nicholas at mxc.ca<mailto:nicholas at mxc.ca>]
Sent: Wednesday, July 11, 2012 11:47
To: Demikhovsky, Elena
Cc: Commit Messages and Patches for LLVM
Subject: Re: [llvm-commits] Please Review: AVX code optimization

Demikhovsky, Elena wrote:
> I wrote an optimization for extractelement - insertelement sequences.
> Please review.

It looks like this is a dagcombine to turn insertelement+extractelement pairs into vector shuffles. Perhaps I'm missing a good reason, but why not do this as an IR optimization?

Nick
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu<mailto:llvm-commits at cs.uiuc.edu>
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120716/ebb22869/attachment.html>