[PATCH] D32416: [x86, SSE] AVX1 PR28129

Tue Apr 25 09:10:58 PDT 2017

spatel added inline comments.

================
Comment at: test/CodeGen/X86/vector-pcmp.ll:156-158
+; AVX1-NEXT:    vxorps %ymm1, %ymm1, %ymm1
+; AVX1-NEXT:    vcmptrueps %ymm1, %ymm1, %ymm1
 ; AVX1-NEXT:    vxorps %ymm1, %ymm0, %ymm0
----------------
RKSimon wrote:
> spatel wrote:
> > RKSimon wrote:
> > > spatel wrote:
> > > > That's an interesting case...that we probably can't answer at the DAG level. Would it be better to use two 128-bit vpxor instructions instead of incurring a potential domain-crossing penalty with the one 256-bit vxorps?
> > > Do you mean this? 
> > > ```
> > > vextractf128 $1, %ymm0, %xmm1
> > > vpxor %xmm2, %xmm2, %xmm2
> > > vpcmpgtb %xmm1, %xmm2, %xmm1
> > > vpcmpgtb %xmm0, %xmm2, %xmm0
> > > vcmpeqd %xmm2, %xmm2, %xmm2
> > > vpxor %xmm2, %xmm1, %xmm1
> > > vpxor %xmm2, %xmm0, %xmm0
> > > vinsertf128 $1, %xmm1, %ymm0, %ymm0
> > > ```
> > Yes - I remember reading somewhere (and not sure how widely this applies) that the 'insertX128' insts may not actually have domain-crossing penalties. The other variable in this mix (thinking about Jaguar here) is that the 256-bit ops may be cracked and double-pumped anyway, so if we have that + domain-crossing penalty, then the two 128-bit insts should be faster?
> A quick hot loop test suggests that the old vpcmpeqd+vinsertf128+xor approach takes 8cy, the 256-bit xor+vcmptrueps+xor approach takes 7cy and the 128-bit vpcmpeqd+2*xor takes 6cy on Jaguar.
> 
> It might be worth looking at splitting some 256-bit bitwise operations that take concatenated 128-bit operations, but I don't think it should get in the way of this patch.
Agreed - the splitting problem is separate:
https://bugs.llvm.org/show_bug.cgi?id=32790

https://reviews.llvm.org/D32416