[PATCH] D32416: [x86, SSE] AVX1 PR28129

Mon Apr 24 07:23:19 PDT 2017

spatel added reviewers: craig.topper, zvi.
spatel added inline comments.

================
Comment at: lib/Target/X86/X86InstrSSE.td:7754-7755

 // Without AVX2 we need to concat two v4i32 V_SETALLONES to create a 256-bit
 // all ones value.
 let Predicates = [HasAVX1Only] in
----------------
This comment should be updated to match the new code.

Is it correct that this pattern won't apply to most integer code for an AVX target because that would already be legalized to v4i32/v2i64? If that's true, I think it's also worth mentioning here.

I'm imagining cases like this:
  define <8 x i32> @cmpeq_v8i32(<8 x i32> %a) nounwind {
   %cmp = icmp eq <8 x i32> %a, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
    %res = sext <8 x i1> %cmp to <8 x i32>
   ret <8 x i32> %res
  }

  define <8 x i32> @cmpne_v8i32(<8 x i32> %a) nounwind {
    %cmp = icmp ne <8 x i32> %a, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
    %res = sext <8 x i1> %cmp to <8 x i32>
    ret <8 x i32> %res
  }

  define <8 x i32> @sub1_v8i32(<8 x i32> %a) nounwind {
    %add = add <8 x i32> %a, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
    ret <8 x i32> %add
  }

================
Comment at: lib/Target/X86/X86InstrSSE.td:7758
 def : Pat<(v8i32 immAllOnesV),
-          (VINSERTF128rr
-           (INSERT_SUBREG (v8i32 (IMPLICIT_DEF)), (V_SETALLONES), sub_xmm),
-           (V_SETALLONES), 1)>;
+          (VCMPPSYrri (AVX_SET0), (AVX_SET0), 15)>;

----------------
It's not clear why we require a zero operand. Would a dummy (undef) register also work? Should we allow that when optimizing for size so the vxorps is not needed?

================
Comment at: test/CodeGen/X86/vector-pcmp.ll:156-158
+; AVX1-NEXT:    vxorps %ymm1, %ymm1, %ymm1
+; AVX1-NEXT:    vcmptrueps %ymm1, %ymm1, %ymm1
 ; AVX1-NEXT:    vxorps %ymm1, %ymm0, %ymm0
----------------
That's an interesting case...that we probably can't answer at the DAG level. Would it be better to use two 128-bit vpxor instructions instead of incurring a potential domain-crossing penalty with the one 256-bit vxorps?

https://reviews.llvm.org/D32416