[PATCH] D46957: [x86] Lower some trunc + shuffle patterns to vpmov[q|d][b|w]

Mon Jun 11 06:21:33 PDT 2018

mike.dvoretsky added inline comments.

================
Comment at: llvm/lib/Target/X86/X86ISelLowering.cpp:9483
+  if (OriginalVector.getSimpleValueType().getVectorElementType() == MVT::i16 &&
+      VT == MVT::v16i8 && !Subtarget.hasBWI())
+    return SDValue();
----------------
GBuella wrote:
> RKSimon wrote:
> > GBuella wrote:
> > > RKSimon wrote:
> > > > Why is it just 16i8 and not 32i8 as well for _mm512_cvtepi16_epi8 ?
> > > This part is only about truncations, where the result must be filled with extra zeros, due to the (narrower tan 128bits) result being in an xmm register.
> > > The  _mm512_cvtepi16_epi8 one truncates from a 512bit vector into a 256bit vector, that is already recognized without this patch.
> > > 
> > > The check here is about _mm_cvtepi16_epi8  (which requires avx512vl & avx512bw). It truncates from v8i16 -> v8i8, but the vpmovwb instruction actually sets a whole xmm register, so the actual result is going to be v16i8, with other 8 bytes set to zero.
> > > 
> > > Ok, perhaps these details should be explained in comments around here.
> > Shouldn't it handle this case? https://godbolt.org/g/Yxw7nE
> Probably that could also be implemented here, we just didn't think about it so far.
> There is/was a patch for those using __builtin_convertvector 
> https://reviews.llvm.org/D46742
> This patch was originally intended to handle these cases, which can't be don with __builtin_convertvector.
> 
> But if it is not a lot of extra work, the shufflevector equivalents of those convertvector ones could be detected here.
The example doesn't contain explicit truncations, so it isn't handled by this particular function anyway and there's no need to check for features for it here.

Correct me if I'm wrong, but I don't think that the aim of this patch is to catch every possible VPMOV pattern, just as many as convenient for subsequent intrinsic lowering, and in the cases of 128-bit-or-wider outputs, where there's no need to zero out the upper parts of xmm registers, a pattern more complex than (in this case)

```
%res = trunc <32 x i16> %v to <32 x i8>
```
is just a needless complication.

https://reviews.llvm.org/D46957