[PATCH] D95217: [LoopVectorize] Fix VPRecipeBuilder::createEdgeMask to correctly generate the mask

Wed Feb 10 12:17:29 PST 2021

fhahn added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8249
+  if (SrcMask) { // Otherwise block in-mask is all-one, no need to AND.
+    // The condition is 'SrcMask && EdgeMask', which is
+    // 'select i1 SrcMask, i1 EdgeMask, i1 false'.
----------------
nit: which is equivalent to.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8251
+    // 'select i1 SrcMask, i1 EdgeMask, i1 false'.
+    // It cannot be 'and i1 SrcMask, i1 EdgeMask' because if EdgeMask was
+    // poison it may introduce undefined behavior.
----------------
nit: perhaps something like `The select version does not introduce new UB if EdgeMask is poison.` I guess the key here is that if `SrcMask` is true and `EdgeMask` is poison, the original program already had UB.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8256
+    EdgeMask =
+        Builder.createNaryOp(Instruction::Select, {SrcMask, EdgeMask, False});
+  }
----------------
nit: Just use `Builder.CreateSelect`?

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/masked_load_store.ll:2462
 ; AVX1-NEXT:    [[TMP51:%.*]] = xor <4 x i1> [[TMP43]], <i1 true, i1 true, i1 true, i1 true>
-; AVX1-NEXT:    [[TMP52:%.*]] = and <4 x i1> [[TMP48]], [[TMP28]]
-; AVX1-NEXT:    [[TMP53:%.*]] = and <4 x i1> [[TMP49]], [[TMP29]]
-; AVX1-NEXT:    [[TMP54:%.*]] = and <4 x i1> [[TMP50]], [[TMP30]]
-; AVX1-NEXT:    [[TMP55:%.*]] = and <4 x i1> [[TMP51]], [[TMP31]]
+; AVX1-NEXT:    [[TMP52:%.*]] = select <4 x i1> [[TMP28]], <4 x i1> [[TMP48]], <4 x i1> zeroinitializer
+; AVX1-NEXT:    [[TMP53:%.*]] = select <4 x i1> [[TMP29]], <4 x i1> [[TMP49]], <4 x i1> zeroinitializer
----------------
It's a shame that this version makes the IR a bit less readable :(

================
Comment at: llvm/test/Transforms/LoopVectorize/pr48832.ll:1
+; RUN: opt -mtriple=s390x-unknown-linux -mcpu=z13 -loop-vectorize -S -o - < %s | FileCheck %s
+%arrayt = type [64 x i32]
----------------
If the test needs the systemZ triple, it needs to go into the SystemZ subdirectory I think, otherwise it will fail on systems that do not build the SystemZ backend.

But It should not be needed, can the same behavior be reproduced by using `-force-vector-width=4`?

================
Comment at: llvm/test/Transforms/LoopVectorize/pr48832.ll:7
+; Since the program has well defined behavior, it should not introduce store poison
+; CHECK-NOT: store <4 x i32> poison,
+
----------------
bjope wrote:
> Maybe safer to use a regexp instead of `<4 x i32>` or explicitly check that we get `store <4 x i32> zeroinitializer`. Or you need to force the vector factor.
> Otherwise the test case is fragile to changes in cost model etc (it would for example pass for a different vector factor and then it wouldn't be useful as a regression test any longer).
I think we should explicitly check for the branch conditions that are generated.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D95217/new/

https://reviews.llvm.org/D95217