[PATCH] D10964: [Codegen] Add intrinsics 'hsum*' and corresponding SDNodes for horizontal sum operation.

Tue Oct 27 20:56:36 PDT 2015

davidxl added a subscriber: davidxl.

================
Comment at: docs/LangRef.rst:10979
@@ +10978,3 @@
+
+      declare <integer> @llvm.hsum.i32.v4i32(<4 x integer> %a)
+      declare <float> @llvm.hsum.f32.v4f32(<4 x float> %a)
----------------
For the integer case, having scalar result type (with the same size as the vector element) make this intrinsic less useful -- due to overflow conditions. The vectorizer will have difficulty proving overflow does not happen and won't be able to generate it in many cases.

As Bruno commented, having vector result type may be the way to go. For instance, for the input type of v4i8, if the result type can be v2i16 -- the hsum is split into 2 horizontal adds each one producing a 16 bit result. If the result type is v1i32, the hsum adds four i8 integers and produces a 32bit result.  Limiting this to power of 2 number of elements seems reasonable.

================
Comment at: test/CodeGen/X86/vec-hadd-float-128.ll:10
@@ +9,3 @@
+; UNSAFE-NEXT:    movapd %xmm0, %xmm1
+; UNSAFE-NEXT:    shufpd {{.*#+}} xmm1 = xmm1[1,0]
+; UNSAFE-NEXT:    addps %xmm0, %xmm1
----------------
Should it be shufps .... xmm1 = xmm1[1, ?, ?, ?]

================
Comment at: test/CodeGen/X86/vec-hadd-float-128.ll:13
@@ +12,3 @@
+; UNSAFE-NEXT:    movaps %xmm1, %xmm0
+; UNSAFE-NEXT:    shufps {{.*#+}} xmm0 = xmm0[1,3,2,3]
+; UNSAFE-NEXT:    addps %xmm1, %xmm0
----------------
this shufps and addps should not be expected 

================
Comment at: test/CodeGen/X86/vec-hadd-int-128.ll:8
@@ +7,3 @@
+; CHECK:       # BB#0:
+; CHECK-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
+; CHECK-NEXT:    paddd %xmm0, %xmm1
----------------
The result does not look right -- should pshufb be generated instead?

================
Comment at: test/CodeGen/X86/vec-hadd-int-128.ll:24
@@ +23,3 @@
+; CHECK:       # BB#0:
+; CHECK-NEXT:    pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
+; CHECK-NEXT:    paddd %xmm0, %xmm1
----------------
should phsufw be generated?

Or more efficient with phaddw?

http://reviews.llvm.org/D10964