[PATCH] D35700: DAGCombiner: Extend reduceBuildVecToTrunc to handle non-zero offset

Mon Jul 24 14:58:18 PDT 2017

zvi added inline comments.

================
Comment at: include/llvm/Target/TargetLowering.h:2779
+  //  -->
+  // v4i32 truncate (bitcast (shuffle<1,u,3,u,4,u,5,u,6,u,7,u> V, u) to v4i64)
+  virtual bool isDesirableToCombineStridedeBuildVectorToShuffleTruncate(
----------------
guyblank wrote:
> shuffle mask should be <1,u,3,u,5,u,7,u>
right

================
Comment at: include/llvm/Target/TargetLowering.h:2782
+      int Stride, int Offset, EVT SrcVT, EVT BuildVecVT) const {
+    return false;
+  }
----------------
RKSimon wrote:
> It would possibly make sense to change this to
> ```
> isDesirableToCombineBuildVectorToShuffleTruncate(ArrayRef<int> ShuffleMask, EVT SrcVT, EVT TruncVT)
> ```
> checking with a general shuffle mask - I'd expect any non-lane crossing shuffle mask to be valid on AVX512.
Thanks. Will look into this suggestion.

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:14433

   // The first BUILD_VECTOR operand must be an an extract from index zero
   // (assuming no undef and little-endian).
----------------
guyblank wrote:
> update comment
ok

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:14442
+  // Check for profitability before proceeding with more expensive checks.
+  if (!TLI.isDesirableToCombineStridedeBuildVectorToShuffleTruncate(
+          Stride, Offset, ExtractedFromVec.getValueType(), VT))
----------------
guyblank wrote:
> I think you should add 
> Offset != 0
> to the condition
Makes sense. To remain consistent with current flow.

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:14473
+    }
+    if (!DAG.getTargetLoweringInfo().isShuffleMaskLegal(Mask, ExtractedVT))
+      return SDValue();
----------------
guyblank wrote:
> you can use TLI
ok

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:35835
+      Offset == 0 ||
+      // For 32-bit elements VPERMD is better than shuffle+truncate
+      (SrcVT.getScalarSizeInBits() != 32 && Subtarget.hasAVX2());
----------------
guyblank wrote:
> what about VPERMW for 16-bit elements?
We need to improve lowerBuildVector to make use of VPERMW. Until then we should go with this combine. Will add a TODO.

================
Comment at: test/CodeGen/X86/shuffle-strided-with-offset-256.ll:39
+; AVX512F-NEXT:    vpmovsxwd %ymm0, %zmm0
+; AVX512F-NEXT:    vpmovdb %zmm0, %xmm0
 ; AVX512F-NEXT:    vmovdqa %xmm0, (%rsi)
----------------
guyblank wrote:
> should this be 
> 
> ```
> vpmovdb %zmm0, (%rsi)
> ```
> 
> same for AVX512VL
Look like a follow-up work needed here.

https://reviews.llvm.org/D35700