[llvm-bugs] [Bug 38788] New: sse4.1 all/any reductions on <4 x i32> inconsistent and possibly suboptimal

Fri Aug 31 02:39:39 PDT 2018

https://bugs.llvm.org/show_bug.cgi?id=38788

            Bug ID: 38788
           Summary: sse4.1 all/any reductions on <4 x i32> inconsistent
                    and possibly suboptimal
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: gonzalobg88 at gmail.com
                CC: llvm-bugs at lists.llvm.org

This is https://github.com/rust-lang-nursery/packed_simd/issues/103

When the following functions:

define zeroext i1 @all_sse2(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
  %0 = bitcast <4 x i32>* %x to <16 x i8>*
  %1 = load <16 x i8>, <16 x i8>* %0, align 16
  %2 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %1) #2
  %3 = icmp eq i32 %2, 65535
  ret i1 %3
}

define zeroext i1 @any_sse2(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
  %0 = bitcast <4 x i32>* %x to <16 x i8>*
  %1 = load <16 x i8>, <16 x i8>* %0, align 16
  %2 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %1) #2
  %3 = icmp ne i32 %2, 0
  ret i1 %3
}

define zeroext i1 @all_sse41(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
  %0 = bitcast <4 x i32>* %x to <2 x i64>*
  %1 = load <2 x i64>, <2 x i64>* %0, align 16
  %2 = tail call i32 @llvm.x86.sse41.ptestc(<2 x i64> %1, <2 x i64> <i64 -1,
i64 -1>) #2
  %3 = icmp eq i32 %2, 1
  ret i1 %3
}

define zeroext i1 @any_sse41(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
  %0 = bitcast <4 x i32>* %x to <2 x i64>*
  %1 = load <2 x i64>, <2 x i64>* %0, align 16
  %2 = tail call i32 @llvm.x86.sse41.ptestz(<2 x i64> %1, <2 x i64> %1) #2
  %3 = icmp eq i32 %2, 0
  ret i1 %3
}

are compiled and optimized with the SSE4.1 target-feature enabled, the
different all_ and any_ variants produce different machine code even though
they perform the exact same operation: 

all_sse2:
  movdqa xmm0, xmmword ptr [rdi]
  pmovmskb eax, xmm0
  cmp eax, 65535
  sete al
  ret

any_sse2:
  movdqa xmm0, xmmword ptr [rdi]
  pmovmskb eax, xmm0
  test eax, eax
  setne al
  ret

all_sse41:
  movdqa xmm0, xmmword ptr [rdi]
  pcmpeqd xmm1, xmm1
  ptest xmm0, xmm1
  setb al
  ret

any_sse41:
  movdqa xmm0, xmmword ptr [rdi]
  ptest xmm0, xmm0
  setne al
  ret

IACA reports (https://gist.github.com/gnzlbg/80d3139393615c18495b1dd7855fc787): 

all SSE2 (movmsk) -> Uops: 4, Throughput 1.00 Cycles
all SSE4.1 (ptest) -> Uops: 5, Throughput 1.24 Cycles
any SSE2 (movmsk) -> Uops: 4, Throughput 1.00 Cycles
any SSE4.1 (ptest) -> Uops: 4, Throughput 1.00 Cycles

And instruction wise, all functions have 5 instructions except any_sse41 which
contains 4 instructions. 

So _maybe_ (I am not sure):

- the all_sse41 function should be optimized / lowered to use
@llvm.x86.sse2.pmovmskb.128 instead of @llvm.x86.sse41.ptestc 

- the any_sse2 function should be optimized to use @llvm.x86.sse41.ptestz
instead of @llvm.x86.sse2.pmovmskb.128 when SSE4.1 is enabled to improve
code-size

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20180831/c7ba0230/attachment-0001.html>