[PATCH] D48725: [SLP] Vectorize bit-parallel operations with SWAR.

Fri Jun 29 08:32:02 PDT 2018

courbet added a comment.

In https://reviews.llvm.org/D48725#1147883, @RKSimon wrote:

> If we're only ever going to be using load/store + and/or/xor ops I wonder if we'd be better off doing this in the DAG alongside the LoadCombine handling? SLP is going to struggle with more general cases where the sizes of bundle elements differ.

In https://reviews.llvm.org/D48725#1147883, @RKSimon wrote:

> If we're only ever going to be using load/store + and/or/xor ops I wonder if we'd be better off doing this in the DAG alongside the LoadCombine handling? SLP is going to struggle with more general cases where the sizes of bundle elements differ.

There are other advantages that we get from reusing the infrastructure of the SLP vectorizer. Besides load/stores and logicals we also get shuffles for free. Consider this code:

  struct S {
    int32_t a;
    int32_t b;
    int64_t c;
    int32_t d;
  };

  S copy_2xi32(const S& s) {
    S result;
    result.a = s.b;
    result.b = s.a;
    return result;
  }

Without the change this lowers to:

  copy_2xi32(S): # @copy_2xi32(S)
    mov eax, dword ptr [rsp + 12]
    mov dword ptr [rdi], eax
    mov eax, dword ptr [rsp + 8]
    mov dword ptr [rdi + 4], eax
    mov rax, rdi
    ret

With the change this lowers to:

  0000000000000000 <_Z10copy_2xi32RK1S>:
     0:	f3 0f 7e 06          	movq   (%rsi),%xmm0
     4:	66 0f 70 c0 e1       	pshufd $0xe1,%xmm0,%xmm0
     9:	66 0f d6 07          	movq   %xmm0,(%rdi)
     d:	48 89 f8             	mov    %rdi,%rax
    10:	c3                   	retq   

Repository:
  rL LLVM

https://reviews.llvm.org/D48725