[PATCH] Masked Vector Load/Store Intrinsics

Wed Nov 12 02:07:38 PST 2014

Hi Hal, Adam,

Thank you for the review. Please see my comments.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:273-278
@@ -272,1 +272,8 @@
 
+  /// \brief Return true if the target works with masked instruction
+  /// AVX2 allows masks for consecutive load and store for i32 and i64 elements.
+  /// AVX-512 architecture will also allow masks for non-consecutive memory
+  /// accesses.
+  virtual bool isLegalPredicatedStore(Type *DataType, int Consecutive) const;
+  virtual bool isLegalPredicatedLoad (Type *DataType, int Consecutive) const;
+
----------------
anemet wrote:
> Should Consecutive be bool?
I call to isConsecutivePtr() which returns "int" - (-1) for reverse.

  bool canPredicateStore(Type *DataType, Value *Ptr) {
    return TTI->isLegalPredicatedStore(DataType, isConsecutivePtr(Ptr));
  }

================
Comment at: include/llvm/CodeGen/SelectionDAGNodes.h:1974
@@ +1973,3 @@
+  const SDValue &getBasePtr() const { return getOperand(1); }
+  const SDValue &getMask() const { return getOperand(2); }
+  const SDValue &getSrc0() const { return getOperand(3); }
----------------
hfinkel wrote:
> All subclasses of MaskedLoadStoreSDNode will have a getMask -- move that into the base class too. Add a comment about the data type of the mask.
All done in SelectionDAGNodes.h. Thanks.

================
Comment at: include/llvm/IR/IRBuilder.h:434
@@ +433,3 @@
+  /// Masked intrinsic has only one overloaded type - data type.
+  CallInst *CreateMaskedIntrinsic(unsigned Id, ArrayRef<Value *> Ops,
+                                  Type *DataTy);
----------------
hfinkel wrote:
> The pattern of this class interface is to have a function per intrinsic, so that the intrinsic id and operand structure can be hidden within the implementation. Please split this into CreateMaskedLoad and CreateMaskedStore, each with appropriate parameters.
Sounds reasonable, done.

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:4787
@@ +4786,3 @@
+    // Check if any splitting is required.
+    if (TLI.getTypeAction(*DAG.getContext(), Data.getValueType()) !=
+        TargetLowering::TypeSplitVector)
----------------
hfinkel wrote:
> Vector splitting is a legalization action, why is this being handled here in DAGCombine (as opposed to in LegalizeVectorTypes) -- shouldn't this be a part of DAGTypeLegalizer::SplitVecRes_MSTORE?
It has to be done here, due to the same reason as VSELECT,  this is the comment from VSELECT:
  // If the VSELECT result requires splitting and the mask is provided by a
  // SETCC, then split both nodes and its operands before legalization. This
  // prevents the type legalizer from unrolling SETCC into scalar comparisons
  // and enables future optimizations (e.g. min/max pattern matching on X86).
I'll put the same comment here as well.

I put the split-code in the DAGTypeLegalizer for the completeness, but it is not executed, otherwise the generated assembly code is too bad.

================
Comment at: lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp:1021
@@ +1020,3 @@
+                         MachineMemOperand::MOLoad,  HiMemVT.getStoreSize(),
+                         Alignment, MLD->getAAInfo(), MLD->getRanges());
+
----------------
hfinkel wrote:
> This alignment is not right -- if the full load as alignment Alignment, then the upper half can have only half that.
Usually alignment is not bigger than element size in this case.
We started, for example, from
%res = load i32 *%ptr, align 4
and created a call to intrinsic
%res.v = call <16 x i32> @llvm.masked.load(%ptr, %mask, undef, 4) - alignment is still 4 
not we decided to split into two, alignment should remain the same.
So, vectorizer can't create masked load for
%res = load i32 *%ptr, align 64. Or I'm wrong?

Another question, what happens if user writes IR and specifies another alignment?
%res.v = call <16 x i32> @llvm.masked.load(%ptr, %mask, undef, 64)
Theoretically, if alignment is bigger than original size - splitting is wrong.
I can spit alignment if alignment is equal to the original size.

Look at SplitVecRes_LOAD() - nobody splits alignment there.


================
Comment at: lib/IR/IRBuilder.cpp:176
@@ -175,2 +175,3 @@
 
+
 CallInst *IRBuilderBase::CreateAssumption(Value *Cond) {
----------------
hfinkel wrote:
> Don't add blank line.
Removed.

================
Comment at: lib/Target/X86/X86InstrAVX512.td:2102-2136
@@ -2101,2 +2101,37 @@
 
+def: Pat<(masked_store addr:$ptr, VK8WM:$mask, (v8f32 VR256:$src)),
+         (VMOVUPSZmrk addr:$ptr,
+         (v16i1 (COPY_TO_REGCLASS VK8WM:$mask, VK16WM)),
+         (INSERT_SUBREG (v16f32 (IMPLICIT_DEF)), VR256:$src, sub_ymm))>;
+
+def: Pat<(v8f32 (masked_load addr:$ptr, VK8WM:$mask, undef)),
+         (v8f32 (EXTRACT_SUBREG (v16f32 (VMOVUPSZrmkz 
+          (v16i1 (COPY_TO_REGCLASS VK8WM:$mask, VK16WM)), addr:$ptr)), sub_ymm))>;
+
+def: Pat<(masked_store addr:$ptr, VK16WM:$mask, (v16f32 VR512:$src)),
+         (VMOVUPSZmrk addr:$ptr, VK16WM:$mask, VR512:$src)>;
+
+def: Pat<(masked_store addr:$ptr, VK8WM:$mask, (v8f64 VR512:$src)),
+         (VMOVUPDZmrk addr:$ptr, VK8WM:$mask, VR512:$src)>;
+
+def: Pat<(v16f32 (masked_load addr:$ptr, VK16WM:$mask, undef)),
+         (VMOVUPSZrmkz VK16WM:$mask, addr:$ptr)>;
+
+def: Pat<(v16f32 (masked_load addr:$ptr, VK16WM:$mask,
+                              (bc_v16f32 (v16i32 immAllZerosV)))),
+         (VMOVUPSZrmkz VK16WM:$mask, addr:$ptr)>;
+
+def: Pat<(v16f32 (masked_load addr:$ptr, VK16WM:$mask, (v16f32 VR512:$src0))),
+         (VMOVUPSZrmk VR512:$src0, VK16WM:$mask, addr:$ptr)>;
+
+def: Pat<(v8f64 (masked_load addr:$ptr, VK8WM:$mask, undef)),
+         (VMOVUPDZrmkz VK8WM:$mask, addr:$ptr)>;
+
+def: Pat<(v8f64 (masked_load addr:$ptr, VK8WM:$mask,
+                             (bc_v8f64 (v16i32 immAllZerosV)))),
+         (VMOVUPDZrmkz VK8WM:$mask, addr:$ptr)>;
+
+def: Pat<(v8f64 (masked_load addr:$ptr, VK8WM:$mask, (v8f64 VR512:$src0))),
+         (VMOVUPDZrmk VR512:$src0, VK8WM:$mask, addr:$ptr)>;
+
 defm VMOVDQA32 : avx512_load_vl<0x6F, "vmovdqa32", "alignedload", "i", "32",
----------------
anemet wrote:
> There's got to be a better way to write this and the store later.
> 
> My preference would be to only add a few (one?) for now in order to test the functionality.  Then we figure out a way to have this be part of a new AVX512_maskable class (e.g. AVX512_maskable_trapping)
I implemented load and store together. I have to generated code and check correctness.That's why I put the whole code in.
I know that I still have a lot of work here.
Let we optimize the .td file later, in the one of the next patches.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:623
@@ -620,1 +622,3 @@
 
+    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i1,  2 },
+    { ISD::SINT_TO_FP,  MVT::v16f32, MVT::v16i8,  2 },
----------------
hfinkel wrote:
> This is a separate change, than can be committed and tested separately.
Ok, no problem. I'll also add a test.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:886
@@ -873,2 +885,3 @@
   SmallPtrSet<Value *, 8> StrideSet;
+  std::set<const Instruction*> MaskedOp;
 };
----------------
anemet wrote:
> Please comment what this is for.
I collect here memory operations that have to be masked on vectorization. If the block is predicated, it does not mean that all loads and stores requires masks. Sometimes the pointer is safe, sometimes there is a configuration flag. All these options are checked before.
The last check is against target readiness to work with masked operations. In this case I put the instruction in the MaskedOp.
I'll add a comment in the code.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:5341
@@ -5340,3 +5406,2 @@
   }
-
   return true;
----------------
hfinkel wrote:
> Don't remove this line.
Ok.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:5352
@@ -5303,2 +5351,3 @@
     // We might be able to hoist the load.
+    
     if (it->mayReadFromMemory()) {
----------------
anemet wrote:
> Don't add a new line.
removed.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:5355-5364
@@ -5305,5 +5354,12 @@
       LoadInst *LI = dyn_cast<LoadInst>(it);
-      if (!LI || !SafePtrs.count(LI->getPointerOperand()))
+      if (!LI)
         return false;
+      if (!SafePtrs.count(LI->getPointerOperand())) {
+        if (canPredicateLoad(LI->getType(), LI->getPointerOperand())) {
+          MaskedOp.insert(LI);
+          continue;
+        }
+        return false;
+      }
     }
 
----------------
anemet wrote:
> I read this far and I still don't understand the setMaskedOp business.  Can you please explain.
I explained before.
Again, during vectorization I have to know for which instructions I have to make a call to masked intrinsics.

================
Comment at: test/CodeGen/X86/masked_memop.ll:66
@@ +65,3 @@
+}
+
+
----------------
hfinkel wrote:
> Are there tests in this file that demonstrate complete scalarization? If not, please add some. If so, please note them.
I did not write the complete scalarization yet. It will be in a next patch.

http://reviews.llvm.org/D6191