[PATCH] D99723: [ARM] Transforming memcpy to Tail predicated Loop

Thu Apr 8 03:07:14 PDT 2021

dmgreen added a comment.

I'm a little worried that WLSTP is going to cause problems, with it not used anywhere else. Lets at least add an option for disabling it needed.

================
Comment at: llvm/lib/Target/ARM/ARMBaseInstrInfo.h:370
+           MI->getOpcode() == ARM::t2DoLoopStartTP ||
+           MI->getOpcode() == ARM::t2WhileLoopStartLR;
   }
----------------
Please split this out into a separate review. I think it makes sense (I'm pretty sure I remember writing it, it must have been lost in a refactoring).

================
Comment at: llvm/lib/Target/ARM/ARMISelLowering.cpp:11282
+    TpExit = BB->splitAt(MI, false);
+    if (TpExit == BB) {
+      assert(BB->canFallThrough() &&
----------------
When will this happen?

================
Comment at: llvm/lib/Target/ARM/ARMISelLowering.cpp:11301
+
+    // Reorder for better readability of generated MIR
+    TpLoopBody->moveAfter(TpEntry);
----------------
-> "for a more natural layout"?

I think there may be benefits from getting the order roughly correct at this stage, if we are relying on WLS branches. They can be fixed up later, but if we get them more correct at this point, that can only help.

================
Comment at: llvm/lib/Target/ARM/ARMISelLowering.h:55
     // ARM Specific DAG Nodes
-    enum NodeType : unsigned {
-      // Start the numbering where the builtin ops and target ops leave off.
-      FIRST_NUMBER = ISD::BUILTIN_OP_END,
-
-      Wrapper,      // Wrapper - A wrapper node for TargetConstantPool,
-                    // TargetExternalSymbol, and TargetGlobalAddress.
-      WrapperPIC,   // WrapperPIC - A wrapper node for TargetGlobalAddress in
-                    // PIC mode.
-      WrapperJT,    // WrapperJT - A wrapper node for TargetJumpTable
-
-      // Add pseudo op to model memcpy for struct byval.
-      COPY_STRUCT_BYVAL,
-
-      CALL,         // Function call.
-      CALL_PRED,    // Function call that's predicable.
-      CALL_NOLINK,  // Function call with branch not branch-and-link.
-      tSECALL,      // CMSE non-secure function call.
-      BRCOND,       // Conditional branch.
-      BR_JT,        // Jumptable branch.
-      BR2_JT,       // Jumptable branch (2 level - jumptable entry is a jump).
-      RET_FLAG,     // Return with a flag operand.
-      SERET_FLAG,   // CMSE Entry function return with a flag operand.
-      INTRET_FLAG,  // Interrupt return with an LR-offset and a flag operand.
-
-      PIC_ADD,      // Add with a PC operand and a PIC label.
-
-      ASRL,         // MVE long arithmetic shift right.
-      LSRL,         // MVE long shift right.
-      LSLL,         // MVE long shift left.
-
-      CMP,          // ARM compare instructions.
-      CMN,          // ARM CMN instructions.
-      CMPZ,         // ARM compare that sets only Z flag.
-      CMPFP,        // ARM VFP compare instruction, sets FPSCR.
-      CMPFPE,       // ARM VFP signalling compare instruction, sets FPSCR.
-      CMPFPw0,      // ARM VFP compare against zero instruction, sets FPSCR.
-      CMPFPEw0,     // ARM VFP signalling compare against zero instruction, sets FPSCR.
-      FMSTAT,       // ARM fmstat instruction.
-
-      CMOV,         // ARM conditional move instructions.
-      SUBS,         // Flag-setting subtraction.
-
-      SSAT,         // Signed saturation
-      USAT,         // Unsigned saturation
-
-      BCC_i64,
-
-      SRL_FLAG,     // V,Flag = srl_flag X -> srl X, 1 + save carry out.
-      SRA_FLAG,     // V,Flag = sra_flag X -> sra X, 1 + save carry out.
-      RRX,          // V = RRX X, Flag     -> srl X, 1 + shift in carry flag.
-
-      ADDC,         // Add with carry
-      ADDE,         // Add using carry
-      SUBC,         // Sub with carry
-      SUBE,         // Sub using carry
-      LSLS,         // Shift left producing carry
-
-      VMOVRRD,      // double to two gprs.
-      VMOVDRR,      // Two gprs to double.
-      VMOVSR,       // move gpr to single, used for f32 literal constructed in a gpr
-
-      EH_SJLJ_SETJMP,         // SjLj exception handling setjmp.
-      EH_SJLJ_LONGJMP,        // SjLj exception handling longjmp.
-      EH_SJLJ_SETUP_DISPATCH, // SjLj exception handling setup_dispatch.
-
-      TC_RETURN,    // Tail call return pseudo.
-
-      THREAD_POINTER,
-
-      DYN_ALLOC,    // Dynamic allocation on the stack.
-
-      MEMBARRIER_MCR, // Memory barrier (MCR)
-
-      PRELOAD,      // Preload
-
-      WIN__CHKSTK,  // Windows' __chkstk call to do stack probing.
-      WIN__DBZCHK,  // Windows' divide by zero check
-
-      WLS,          // Low-overhead loops, While Loop Start branch. See t2WhileLoopStart
-      WLSSETUP,     // Setup for the iteration count of a WLS. See t2WhileLoopSetup.
-      LOOP_DEC,     // Really a part of LE, performs the sub
-      LE,           // Low-overhead loops, Loop End
-
-      PREDICATE_CAST, // Predicate cast for MVE i1 types
-      VECTOR_REG_CAST, // Reinterpret the current contents of a vector register
-
-      VCMP,         // Vector compare.
-      VCMPZ,        // Vector compare to zero.
-      VTST,         // Vector test bits.
-
-      // Vector shift by vector
-      VSHLs,        // ...left/right by signed
-      VSHLu,        // ...left/right by unsigned
-
-      // Vector shift by immediate:
-      VSHLIMM,      // ...left
-      VSHRsIMM,     // ...right (signed)
-      VSHRuIMM,     // ...right (unsigned)
-
-      // Vector rounding shift by immediate:
-      VRSHRsIMM,    // ...right (signed)
-      VRSHRuIMM,    // ...right (unsigned)
-      VRSHRNIMM,    // ...right narrow
-
-      // Vector saturating shift by immediate:
-      VQSHLsIMM,    // ...left (signed)
-      VQSHLuIMM,    // ...left (unsigned)
-      VQSHLsuIMM,   // ...left (signed to unsigned)
-      VQSHRNsIMM,   // ...right narrow (signed)
-      VQSHRNuIMM,   // ...right narrow (unsigned)
-      VQSHRNsuIMM,  // ...right narrow (signed to unsigned)
-
-      // Vector saturating rounding shift by immediate:
-      VQRSHRNsIMM,  // ...right narrow (signed)
-      VQRSHRNuIMM,  // ...right narrow (unsigned)
-      VQRSHRNsuIMM, // ...right narrow (signed to unsigned)
-
-      // Vector shift and insert:
-      VSLIIMM,      // ...left
-      VSRIIMM,      // ...right
-
-      // Vector get lane (VMOV scalar to ARM core register)
-      // (These are used for 8- and 16-bit element types only.)
-      VGETLANEu,    // zero-extend vector extract element
-      VGETLANEs,    // sign-extend vector extract element
-
-      // Vector move immediate and move negated immediate:
-      VMOVIMM,
-      VMVNIMM,
-
-      // Vector move f32 immediate:
-      VMOVFPIMM,
-
-      // Move H <-> R, clearing top 16 bits
-      VMOVrh,
-      VMOVhr,
-
-      // Vector duplicate:
-      VDUP,
-      VDUPLANE,
-
-      // Vector shuffles:
-      VEXT,         // extract
-      VREV64,       // reverse elements within 64-bit doublewords
-      VREV32,       // reverse elements within 32-bit words
-      VREV16,       // reverse elements within 16-bit halfwords
-      VZIP,         // zip (interleave)
-      VUZP,         // unzip (deinterleave)
-      VTRN,         // transpose
-      VTBL1,        // 1-register shuffle with mask
-      VTBL2,        // 2-register shuffle with mask
-      VMOVN,        // MVE vmovn
-
-      // MVE Saturating truncates
-      VQMOVNs,      // Vector (V) Saturating (Q) Move and Narrow (N), signed (s)
-      VQMOVNu,      // Vector (V) Saturating (Q) Move and Narrow (N), unsigned (u)
-
-      // MVE float <> half converts
-      VCVTN,        // MVE vcvt f32 -> f16, truncating into either the bottom or top lanes
-      VCVTL,        // MVE vcvt f16 -> f32, extending from either the bottom or top lanes
-
-      // Vector multiply long:
-      VMULLs,       // ...signed
-      VMULLu,       // ...unsigned
-
-      VQDMULH,      // MVE vqdmulh instruction
-
-      // MVE reductions
-      VADDVs,       // sign- or zero-extend the elements of a vector to i32,
-      VADDVu,       //   add them all together, and return an i32 of their sum
-      VADDVps,      // Same as VADDV[su] but with a v4i1 predicate mask
-      VADDVpu,
-      VADDLVs,      // sign- or zero-extend elements to i64 and sum, returning
-      VADDLVu,      //   the low and high 32-bit halves of the sum
-      VADDLVAs,     // Same as VADDLV[su] but also add an input accumulator
-      VADDLVAu,     //   provided as low and high halves
-      VADDLVps,     // Same as VADDLV[su] but with a v4i1 predicate mask
-      VADDLVpu,
-      VADDLVAps,    // Same as VADDLVp[su] but with a v4i1 predicate mask
-      VADDLVApu,
-      VMLAVs,       // sign- or zero-extend the elements of two vectors to i32, multiply them
-      VMLAVu,       //   and add the results together, returning an i32 of their sum
-      VMLAVps,      // Same as VMLAV[su] with a v4i1 predicate mask
-      VMLAVpu,
-      VMLALVs,      // Same as VMLAV but with i64, returning the low and
-      VMLALVu,      //   high 32-bit halves of the sum
-      VMLALVps,     // Same as VMLALV[su] with a v4i1 predicate mask
-      VMLALVpu,
-      VMLALVAs,     // Same as VMLALV but also add an input accumulator
-      VMLALVAu,     //   provided as low and high halves
-      VMLALVAps,    // Same as VMLALVA[su] with a v4i1 predicate mask
-      VMLALVApu,
-      VMINVu,        // Find minimum unsigned value of a vector and register
-      VMINVs,        // Find minimum signed value of a vector and register
-      VMAXVu,        // Find maximum unsigned value of a vector and register
-      VMAXVs,        // Find maximum signed value of a vector and register
-
-      SMULWB,       // Signed multiply word by half word, bottom
-      SMULWT,       // Signed multiply word by half word, top
-      UMLAL,        // 64bit Unsigned Accumulate Multiply
-      SMLAL,        // 64bit Signed Accumulate Multiply
-      UMAAL,        // 64-bit Unsigned Accumulate Accumulate Multiply
-      SMLALBB,      // 64-bit signed accumulate multiply bottom, bottom 16
-      SMLALBT,      // 64-bit signed accumulate multiply bottom, top 16
-      SMLALTB,      // 64-bit signed accumulate multiply top, bottom 16
-      SMLALTT,      // 64-bit signed accumulate multiply top, top 16
-      SMLALD,       // Signed multiply accumulate long dual
-      SMLALDX,      // Signed multiply accumulate long dual exchange
-      SMLSLD,       // Signed multiply subtract long dual
-      SMLSLDX,      // Signed multiply subtract long dual exchange
-      SMMLAR,       // Signed multiply long, round and add
-      SMMLSR,       // Signed multiply long, subtract and round
-
-      // Single Lane QADD8 and QADD16. Only the bottom lane. That's what the b stands for.
-      QADD8b,
-      QSUB8b,
-      QADD16b,
-      QSUB16b,
-
-      // Operands of the standard BUILD_VECTOR node are not legalized, which
-      // is fine if BUILD_VECTORs are always lowered to shuffles or other
-      // operations, but for ARM some BUILD_VECTORs are legal as-is and their
-      // operands need to be legalized.  Define an ARM-specific version of
-      // BUILD_VECTOR for this purpose.
-      BUILD_VECTOR,
-
-      // Bit-field insert
-      BFI,
-
-      // Vector OR with immediate
-      VORRIMM,
-      // Vector AND with NOT of immediate
-      VBICIMM,
-
-      // Pseudo vector bitwise select
-      VBSP,
-
-      // Pseudo-instruction representing a memory copy using ldm/stm
-      // instructions.
-      MEMCPY,
-
-      // V8.1MMainline condition select
-      CSINV, // Conditional select invert.
-      CSNEG, // Conditional select negate.
-      CSINC, // Conditional select increment.
-
-      // Vector load N-element structure to all lanes:
-      VLD1DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,
-      VLD2DUP,
-      VLD3DUP,
-      VLD4DUP,
-
-      // NEON loads with post-increment base updates:
-      VLD1_UPD,
-      VLD2_UPD,
-      VLD3_UPD,
-      VLD4_UPD,
-      VLD2LN_UPD,
-      VLD3LN_UPD,
-      VLD4LN_UPD,
-      VLD1DUP_UPD,
-      VLD2DUP_UPD,
-      VLD3DUP_UPD,
-      VLD4DUP_UPD,
-
-      // NEON stores with post-increment base updates:
-      VST1_UPD,
-      VST2_UPD,
-      VST3_UPD,
-      VST4_UPD,
-      VST2LN_UPD,
-      VST3LN_UPD,
-      VST4LN_UPD,
-
-      // Load/Store of dual registers
-      LDRD,
-      STRD
-    };
+  enum NodeType : unsigned {
+    // Start the numbering where the builtin ops and target ops leave off.
----------------
Don't format any of this - it's unrelated.

================
Comment at: llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp:136
+      !DAG.getMachineFunction().getFunction().hasOptNone()) {
+    if ((!ConstantSize && (Alignment >= Align(4))) ||
+        (ConstantSize &&
----------------
Can we add an option that turns this inline memcpy on/off. If the option is true, we always use the MEMCPYLOOP, if it's false we never do, and if it's unset we use this default logic.

Also consider pulling the if logic into a lambda for readability.

================
Comment at: llvm/test/CodeGen/Thumb2/mve_tp_loop.ll:9
+
+declare void @llvm.memcpy.p0i8.p0i8.i32(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i32, i1 immarg) #1
+
----------------
Can you make sure there are tests where the i32 is a different type, one that is not legal like an i64.

================
Comment at: llvm/test/CodeGen/Thumb2/mve_tp_loop.mir:136
+  - { reg: '$r2', virtual-reg: '%2' }
+frameInfo:
+  isFrameAddressTaken: false
----------------
Some of this can be removed, to help keep the test smaller.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99723/new/

https://reviews.llvm.org/D99723