[PATCH] D37446: [x86] eliminate unnecessary vector compare for AVX masked store

Tue Sep 5 07:01:10 PDT 2017

aymanmus added inline comments.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:33185
+    SDValue Mask = Mst->getMask();
+    if (Mask.getOpcode() == X86ISD::PCMPGT &&
+        ISD::isBuildVectorAllZeros(Mask.getOperand(0).getNode())) {
----------------
spatel wrote:
> RKSimon wrote:
> > aymanmus wrote:
> > > Is there any canonical form of compare-with-all-zeros that can be guaranteed here? Or should the pattern with (pcmplt X, 0) be added also?
> > Add X86ISD::PCMPGTM support?
> Waiting until this is PCMPGT is a kind of canonicalization (compared to the general setcc node) because SSE/AVX don't have any other compare predicates. Ie, there's no other simple way to encode this; there is no PCMPLT node.
100%, my fault.

================
Comment at: test/CodeGen/X86/masked_memop.ll:1158
 ; SKX-LABEL: trunc_mask:
 ; SKX:       ## BB#0:
 ; SKX-NEXT:    vpxor %xmm1, %xmm1, %xmm1
----------------
spatel wrote:
> aymanmus wrote:
> > I think the optimal code for SKX is:
> > vpmovd2m %xmm2, %k1
> > vmovups %xmm0, (%rdi) {%k1}
> > 
> Ok - let me try to shake that out of here. To be clear, we're saying this is the optimal sequence for any CPU with avx512vl/avx512bw. SKX is just an implementation of those ISAs.
- The IACA tool shows same throughput for both sequences, but the one I suggested has one less uop and one less register.
- Actually the needed features for vpmovb2m/vpmovw2m are avx512vl+avx512bw and for vpmovd2m/vpmovq2m are avx512vl+avx512dq (which skx also includes)
- The %y test's operand not used.

https://reviews.llvm.org/D37446