[PATCH] D99723: [ARM] Transforming memcpy to Tail predicated Loop

Fri Apr 9 09:24:53 PDT 2021

malharJ added a comment.

> I'm a little worried that WLSTP is going to cause problems ...

Would it better to use DLSTP in that case ? or perhaps a command line option
for choosing between DLSTP/WLSTP implementations ?

================
Comment at: llvm/lib/Target/ARM/ARMBaseInstrInfo.h:370
+           MI->getOpcode() == ARM::t2DoLoopStartTP ||
+           MI->getOpcode() == ARM::t2WhileLoopStartLR;
   }
----------------
dmgreen wrote:
> Please split this out into a separate review. I think it makes sense (I'm pretty sure I remember writing it, it must have been lost in a refactoring).
Do you mean just this single line update as a separate review ?

================
Comment at: llvm/lib/Target/ARM/ARMISelLowering.cpp:11282
+    TpExit = BB->splitAt(MI, false);
+    if (TpExit == BB) {
+      assert(BB->canFallThrough() &&
----------------
dmgreen wrote:
> When will this happen?
This happens if MVE_MEMCPYLOOPINST is the only instruction in the block.
splitAt() returns the same block if there is nothing after the instruction at which the split is done..

This happens when for loops are implicitly converted to memcpys, the memcpy call
ends up being the only instruction in the preheader.

There is already a test case for this as test2 in llvm/test/CodeGen/Thumb2/mve_tp_loop.mir

================
Comment at: llvm/lib/Target/ARM/ARMISelLowering.h:55
     // ARM Specific DAG Nodes
-    enum NodeType : unsigned {
-      // Start the numbering where the builtin ops and target ops leave off.
-      FIRST_NUMBER = ISD::BUILTIN_OP_END,
-
-      Wrapper,      // Wrapper - A wrapper node for TargetConstantPool,
-                    // TargetExternalSymbol, and TargetGlobalAddress.
-      WrapperPIC,   // WrapperPIC - A wrapper node for TargetGlobalAddress in
-                    // PIC mode.
-      WrapperJT,    // WrapperJT - A wrapper node for TargetJumpTable
-
-      // Add pseudo op to model memcpy for struct byval.
-      COPY_STRUCT_BYVAL,
-
-      CALL,         // Function call.
-      CALL_PRED,    // Function call that's predicable.
-      CALL_NOLINK,  // Function call with branch not branch-and-link.
-      tSECALL,      // CMSE non-secure function call.
-      BRCOND,       // Conditional branch.
-      BR_JT,        // Jumptable branch.
-      BR2_JT,       // Jumptable branch (2 level - jumptable entry is a jump).
-      RET_FLAG,     // Return with a flag operand.
-      SERET_FLAG,   // CMSE Entry function return with a flag operand.
-      INTRET_FLAG,  // Interrupt return with an LR-offset and a flag operand.
-
-      PIC_ADD,      // Add with a PC operand and a PIC label.
-
-      ASRL,         // MVE long arithmetic shift right.
-      LSRL,         // MVE long shift right.
-      LSLL,         // MVE long shift left.
-
-      CMP,          // ARM compare instructions.
-      CMN,          // ARM CMN instructions.
-      CMPZ,         // ARM compare that sets only Z flag.
-      CMPFP,        // ARM VFP compare instruction, sets FPSCR.
-      CMPFPE,       // ARM VFP signalling compare instruction, sets FPSCR.
-      CMPFPw0,      // ARM VFP compare against zero instruction, sets FPSCR.
-      CMPFPEw0,     // ARM VFP signalling compare against zero instruction, sets FPSCR.
-      FMSTAT,       // ARM fmstat instruction.
-
-      CMOV,         // ARM conditional move instructions.
-      SUBS,         // Flag-setting subtraction.
-
-      SSAT,         // Signed saturation
-      USAT,         // Unsigned saturation
-
-      BCC_i64,
-
-      SRL_FLAG,     // V,Flag = srl_flag X -> srl X, 1 + save carry out.
-      SRA_FLAG,     // V,Flag = sra_flag X -> sra X, 1 + save carry out.
-      RRX,          // V = RRX X, Flag     -> srl X, 1 + shift in carry flag.
-
-      ADDC,         // Add with carry
-      ADDE,         // Add using carry
-      SUBC,         // Sub with carry
-      SUBE,         // Sub using carry
-      LSLS,         // Shift left producing carry
-
-      VMOVRRD,      // double to two gprs.
-      VMOVDRR,      // Two gprs to double.
-      VMOVSR,       // move gpr to single, used for f32 literal constructed in a gpr
-
-      EH_SJLJ_SETJMP,         // SjLj exception handling setjmp.
-      EH_SJLJ_LONGJMP,        // SjLj exception handling longjmp.
-      EH_SJLJ_SETUP_DISPATCH, // SjLj exception handling setup_dispatch.
-
-      TC_RETURN,    // Tail call return pseudo.
-
-      THREAD_POINTER,
-
-      DYN_ALLOC,    // Dynamic allocation on the stack.
-
-      MEMBARRIER_MCR, // Memory barrier (MCR)
-
-      PRELOAD,      // Preload
-
-      WIN__CHKSTK,  // Windows' __chkstk call to do stack probing.
-      WIN__DBZCHK,  // Windows' divide by zero check
-
-      WLS,          // Low-overhead loops, While Loop Start branch. See t2WhileLoopStart
-      WLSSETUP,     // Setup for the iteration count of a WLS. See t2WhileLoopSetup.
-      LOOP_DEC,     // Really a part of LE, performs the sub
-      LE,           // Low-overhead loops, Loop End
-
-      PREDICATE_CAST, // Predicate cast for MVE i1 types
-      VECTOR_REG_CAST, // Reinterpret the current contents of a vector register
-
-      VCMP,         // Vector compare.
-      VCMPZ,        // Vector compare to zero.
-      VTST,         // Vector test bits.
-
-      // Vector shift by vector
-      VSHLs,        // ...left/right by signed
-      VSHLu,        // ...left/right by unsigned
-
-      // Vector shift by immediate:
-      VSHLIMM,      // ...left
-      VSHRsIMM,     // ...right (signed)
-      VSHRuIMM,     // ...right (unsigned)
-
-      // Vector rounding shift by immediate:
-      VRSHRsIMM,    // ...right (signed)
-      VRSHRuIMM,    // ...right (unsigned)
-      VRSHRNIMM,    // ...right narrow
-
-      // Vector saturating shift by immediate:
-      VQSHLsIMM,    // ...left (signed)
-      VQSHLuIMM,    // ...left (unsigned)
-      VQSHLsuIMM,   // ...left (signed to unsigned)
-      VQSHRNsIMM,   // ...right narrow (signed)
-      VQSHRNuIMM,   // ...right narrow (unsigned)
-      VQSHRNsuIMM,  // ...right narrow (signed to unsigned)
-
-      // Vector saturating rounding shift by immediate:
-      VQRSHRNsIMM,  // ...right narrow (signed)
-      VQRSHRNuIMM,  // ...right narrow (unsigned)
-      VQRSHRNsuIMM, // ...right narrow (signed to unsigned)
-
-      // Vector shift and insert:
-      VSLIIMM,      // ...left
-      VSRIIMM,      // ...right
-
-      // Vector get lane (VMOV scalar to ARM core register)
-      // (These are used for 8- and 16-bit element types only.)
-      VGETLANEu,    // zero-extend vector extract element
-      VGETLANEs,    // sign-extend vector extract element
-
-      // Vector move immediate and move negated immediate:
-      VMOVIMM,
-      VMVNIMM,
-
-      // Vector move f32 immediate:
-      VMOVFPIMM,
-
-      // Move H <-> R, clearing top 16 bits
-      VMOVrh,
-      VMOVhr,
-
-      // Vector duplicate:
-      VDUP,
-      VDUPLANE,
-
-      // Vector shuffles:
-      VEXT,         // extract
-      VREV64,       // reverse elements within 64-bit doublewords
-      VREV32,       // reverse elements within 32-bit words
-      VREV16,       // reverse elements within 16-bit halfwords
-      VZIP,         // zip (interleave)
-      VUZP,         // unzip (deinterleave)
-      VTRN,         // transpose
-      VTBL1,        // 1-register shuffle with mask
-      VTBL2,        // 2-register shuffle with mask
-      VMOVN,        // MVE vmovn
-
-      // MVE Saturating truncates
-      VQMOVNs,      // Vector (V) Saturating (Q) Move and Narrow (N), signed (s)
-      VQMOVNu,      // Vector (V) Saturating (Q) Move and Narrow (N), unsigned (u)
-
-      // MVE float <> half converts
-      VCVTN,        // MVE vcvt f32 -> f16, truncating into either the bottom or top lanes
-      VCVTL,        // MVE vcvt f16 -> f32, extending from either the bottom or top lanes
-
-      // Vector multiply long:
-      VMULLs,       // ...signed
-      VMULLu,       // ...unsigned
-
-      VQDMULH,      // MVE vqdmulh instruction
-
-      // MVE reductions
-      VADDVs,       // sign- or zero-extend the elements of a vector to i32,
-      VADDVu,       //   add them all together, and return an i32 of their sum
-      VADDVps,      // Same as VADDV[su] but with a v4i1 predicate mask
-      VADDVpu,
-      VADDLVs,      // sign- or zero-extend elements to i64 and sum, returning
-      VADDLVu,      //   the low and high 32-bit halves of the sum
-      VADDLVAs,     // Same as VADDLV[su] but also add an input accumulator
-      VADDLVAu,     //   provided as low and high halves
-      VADDLVps,     // Same as VADDLV[su] but with a v4i1 predicate mask
-      VADDLVpu,
-      VADDLVAps,    // Same as VADDLVp[su] but with a v4i1 predicate mask
-      VADDLVApu,
-      VMLAVs,       // sign- or zero-extend the elements of two vectors to i32, multiply them
-      VMLAVu,       //   and add the results together, returning an i32 of their sum
-      VMLAVps,      // Same as VMLAV[su] with a v4i1 predicate mask
-      VMLAVpu,
-      VMLALVs,      // Same as VMLAV but with i64, returning the low and
-      VMLALVu,      //   high 32-bit halves of the sum
-      VMLALVps,     // Same as VMLALV[su] with a v4i1 predicate mask
-      VMLALVpu,
-      VMLALVAs,     // Same as VMLALV but also add an input accumulator
-      VMLALVAu,     //   provided as low and high halves
-      VMLALVAps,    // Same as VMLALVA[su] with a v4i1 predicate mask
-      VMLALVApu,
-      VMINVu,        // Find minimum unsigned value of a vector and register
-      VMINVs,        // Find minimum signed value of a vector and register
-      VMAXVu,        // Find maximum unsigned value of a vector and register
-      VMAXVs,        // Find maximum signed value of a vector and register
-
-      SMULWB,       // Signed multiply word by half word, bottom
-      SMULWT,       // Signed multiply word by half word, top
-      UMLAL,        // 64bit Unsigned Accumulate Multiply
-      SMLAL,        // 64bit Signed Accumulate Multiply
-      UMAAL,        // 64-bit Unsigned Accumulate Accumulate Multiply
-      SMLALBB,      // 64-bit signed accumulate multiply bottom, bottom 16
-      SMLALBT,      // 64-bit signed accumulate multiply bottom, top 16
-      SMLALTB,      // 64-bit signed accumulate multiply top, bottom 16
-      SMLALTT,      // 64-bit signed accumulate multiply top, top 16
-      SMLALD,       // Signed multiply accumulate long dual
-      SMLALDX,      // Signed multiply accumulate long dual exchange
-      SMLSLD,       // Signed multiply subtract long dual
-      SMLSLDX,      // Signed multiply subtract long dual exchange
-      SMMLAR,       // Signed multiply long, round and add
-      SMMLSR,       // Signed multiply long, subtract and round
-
-      // Single Lane QADD8 and QADD16. Only the bottom lane. That's what the b stands for.
-      QADD8b,
-      QSUB8b,
-      QADD16b,
-      QSUB16b,
-
-      // Operands of the standard BUILD_VECTOR node are not legalized, which
-      // is fine if BUILD_VECTORs are always lowered to shuffles or other
-      // operations, but for ARM some BUILD_VECTORs are legal as-is and their
-      // operands need to be legalized.  Define an ARM-specific version of
-      // BUILD_VECTOR for this purpose.
-      BUILD_VECTOR,
-
-      // Bit-field insert
-      BFI,
-
-      // Vector OR with immediate
-      VORRIMM,
-      // Vector AND with NOT of immediate
-      VBICIMM,
-
-      // Pseudo vector bitwise select
-      VBSP,
-
-      // Pseudo-instruction representing a memory copy using ldm/stm
-      // instructions.
-      MEMCPY,
-
-      // V8.1MMainline condition select
-      CSINV, // Conditional select invert.
-      CSNEG, // Conditional select negate.
-      CSINC, // Conditional select increment.
-
-      // Vector load N-element structure to all lanes:
-      VLD1DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,
-      VLD2DUP,
-      VLD3DUP,
-      VLD4DUP,
-
-      // NEON loads with post-increment base updates:
-      VLD1_UPD,
-      VLD2_UPD,
-      VLD3_UPD,
-      VLD4_UPD,
-      VLD2LN_UPD,
-      VLD3LN_UPD,
-      VLD4LN_UPD,
-      VLD1DUP_UPD,
-      VLD2DUP_UPD,
-      VLD3DUP_UPD,
-      VLD4DUP_UPD,
-
-      // NEON stores with post-increment base updates:
-      VST1_UPD,
-      VST2_UPD,
-      VST3_UPD,
-      VST4_UPD,
-      VST2LN_UPD,
-      VST3LN_UPD,
-      VST4LN_UPD,
-
-      // Load/Store of dual registers
-      LDRD,
-      STRD
-    };
+  enum NodeType : unsigned {
+    // Start the numbering where the builtin ops and target ops leave off.
----------------
dmgreen wrote:
> Don't format any of this - it's unrelated.
I had to fix it because patch was failing on clang-format error.

================
Comment at: llvm/test/CodeGen/Thumb2/mve_tp_loop.ll:9
+
+declare void @llvm.memcpy.p0i8.p0i8.i32(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i32, i1 immarg) #1
+
----------------
dmgreen wrote:
> Can you make sure there are tests where the i32 is a different type, one that is not legal like an i64.
Do you mean something like: 

```
define void @test(i8* noalias %X, i8* noalias %Y, i64 %n){
    call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %X, i8* align 4 %Y, i64 %n, i1 false)
    ret void
}
```

I get an error when I try to generate the assembly. Since i64 is illegal, 
what is the expectation here ?

As a side note, if I generate the IR from C code, the IR always truncates the memcpy size variable to a i32 
before calling llvm.memcpy( )

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99723/new/

https://reviews.llvm.org/D99723