[PATCH] D99723: [ARM] Transforming memcpy to Tail predicated Loop
Dave Green via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Apr 12 06:28:30 PDT 2021
dmgreen added inline comments.
================
Comment at: llvm/lib/Target/ARM/ARMBaseInstrInfo.h:370
+ MI->getOpcode() == ARM::t2DoLoopStartTP ||
+ MI->getOpcode() == ARM::t2WhileLoopStartLR;
}
----------------
malharJ wrote:
> dmgreen wrote:
> > Please split this out into a separate review. I think it makes sense (I'm pretty sure I remember writing it, it must have been lost in a refactoring).
> Do you mean just this single line update as a separate review ?
>
>
Yeah... it should preferably be a separate patch. Do you have a test case? Some reason why you changed it?
================
Comment at: llvm/lib/Target/ARM/ARMISelLowering.cpp:11282
+ TpExit = BB->splitAt(MI, false);
+ if (TpExit == BB) {
+ assert(BB->canFallThrough() &&
----------------
malharJ wrote:
> dmgreen wrote:
> > When will this happen?
> This happens if MVE_MEMCPYLOOPINST is the only instruction in the block.
> splitAt() returns the same block if there is nothing after the instruction at which the split is done..
>
> This happens when for loops are implicitly converted to memcpys, the memcpy call
> ends up being the only instruction in the preheader.
>
> There is already a test case for this as test2 in llvm/test/CodeGen/Thumb2/mve_tp_loop.mir
>
>
OK. I thought it was more eager about putting branches on the end of blocks, even if they are fallthroughs.
================
Comment at: llvm/lib/Target/ARM/ARMISelLowering.h:55
// ARM Specific DAG Nodes
- enum NodeType : unsigned {
- // Start the numbering where the builtin ops and target ops leave off.
- FIRST_NUMBER = ISD::BUILTIN_OP_END,
-
- Wrapper, // Wrapper - A wrapper node for TargetConstantPool,
- // TargetExternalSymbol, and TargetGlobalAddress.
- WrapperPIC, // WrapperPIC - A wrapper node for TargetGlobalAddress in
- // PIC mode.
- WrapperJT, // WrapperJT - A wrapper node for TargetJumpTable
-
- // Add pseudo op to model memcpy for struct byval.
- COPY_STRUCT_BYVAL,
-
- CALL, // Function call.
- CALL_PRED, // Function call that's predicable.
- CALL_NOLINK, // Function call with branch not branch-and-link.
- tSECALL, // CMSE non-secure function call.
- BRCOND, // Conditional branch.
- BR_JT, // Jumptable branch.
- BR2_JT, // Jumptable branch (2 level - jumptable entry is a jump).
- RET_FLAG, // Return with a flag operand.
- SERET_FLAG, // CMSE Entry function return with a flag operand.
- INTRET_FLAG, // Interrupt return with an LR-offset and a flag operand.
-
- PIC_ADD, // Add with a PC operand and a PIC label.
-
- ASRL, // MVE long arithmetic shift right.
- LSRL, // MVE long shift right.
- LSLL, // MVE long shift left.
-
- CMP, // ARM compare instructions.
- CMN, // ARM CMN instructions.
- CMPZ, // ARM compare that sets only Z flag.
- CMPFP, // ARM VFP compare instruction, sets FPSCR.
- CMPFPE, // ARM VFP signalling compare instruction, sets FPSCR.
- CMPFPw0, // ARM VFP compare against zero instruction, sets FPSCR.
- CMPFPEw0, // ARM VFP signalling compare against zero instruction, sets FPSCR.
- FMSTAT, // ARM fmstat instruction.
-
- CMOV, // ARM conditional move instructions.
- SUBS, // Flag-setting subtraction.
-
- SSAT, // Signed saturation
- USAT, // Unsigned saturation
-
- BCC_i64,
-
- SRL_FLAG, // V,Flag = srl_flag X -> srl X, 1 + save carry out.
- SRA_FLAG, // V,Flag = sra_flag X -> sra X, 1 + save carry out.
- RRX, // V = RRX X, Flag -> srl X, 1 + shift in carry flag.
-
- ADDC, // Add with carry
- ADDE, // Add using carry
- SUBC, // Sub with carry
- SUBE, // Sub using carry
- LSLS, // Shift left producing carry
-
- VMOVRRD, // double to two gprs.
- VMOVDRR, // Two gprs to double.
- VMOVSR, // move gpr to single, used for f32 literal constructed in a gpr
-
- EH_SJLJ_SETJMP, // SjLj exception handling setjmp.
- EH_SJLJ_LONGJMP, // SjLj exception handling longjmp.
- EH_SJLJ_SETUP_DISPATCH, // SjLj exception handling setup_dispatch.
-
- TC_RETURN, // Tail call return pseudo.
-
- THREAD_POINTER,
-
- DYN_ALLOC, // Dynamic allocation on the stack.
-
- MEMBARRIER_MCR, // Memory barrier (MCR)
-
- PRELOAD, // Preload
-
- WIN__CHKSTK, // Windows' __chkstk call to do stack probing.
- WIN__DBZCHK, // Windows' divide by zero check
-
- WLS, // Low-overhead loops, While Loop Start branch. See t2WhileLoopStart
- WLSSETUP, // Setup for the iteration count of a WLS. See t2WhileLoopSetup.
- LOOP_DEC, // Really a part of LE, performs the sub
- LE, // Low-overhead loops, Loop End
-
- PREDICATE_CAST, // Predicate cast for MVE i1 types
- VECTOR_REG_CAST, // Reinterpret the current contents of a vector register
-
- VCMP, // Vector compare.
- VCMPZ, // Vector compare to zero.
- VTST, // Vector test bits.
-
- // Vector shift by vector
- VSHLs, // ...left/right by signed
- VSHLu, // ...left/right by unsigned
-
- // Vector shift by immediate:
- VSHLIMM, // ...left
- VSHRsIMM, // ...right (signed)
- VSHRuIMM, // ...right (unsigned)
-
- // Vector rounding shift by immediate:
- VRSHRsIMM, // ...right (signed)
- VRSHRuIMM, // ...right (unsigned)
- VRSHRNIMM, // ...right narrow
-
- // Vector saturating shift by immediate:
- VQSHLsIMM, // ...left (signed)
- VQSHLuIMM, // ...left (unsigned)
- VQSHLsuIMM, // ...left (signed to unsigned)
- VQSHRNsIMM, // ...right narrow (signed)
- VQSHRNuIMM, // ...right narrow (unsigned)
- VQSHRNsuIMM, // ...right narrow (signed to unsigned)
-
- // Vector saturating rounding shift by immediate:
- VQRSHRNsIMM, // ...right narrow (signed)
- VQRSHRNuIMM, // ...right narrow (unsigned)
- VQRSHRNsuIMM, // ...right narrow (signed to unsigned)
-
- // Vector shift and insert:
- VSLIIMM, // ...left
- VSRIIMM, // ...right
-
- // Vector get lane (VMOV scalar to ARM core register)
- // (These are used for 8- and 16-bit element types only.)
- VGETLANEu, // zero-extend vector extract element
- VGETLANEs, // sign-extend vector extract element
-
- // Vector move immediate and move negated immediate:
- VMOVIMM,
- VMVNIMM,
-
- // Vector move f32 immediate:
- VMOVFPIMM,
-
- // Move H <-> R, clearing top 16 bits
- VMOVrh,
- VMOVhr,
-
- // Vector duplicate:
- VDUP,
- VDUPLANE,
-
- // Vector shuffles:
- VEXT, // extract
- VREV64, // reverse elements within 64-bit doublewords
- VREV32, // reverse elements within 32-bit words
- VREV16, // reverse elements within 16-bit halfwords
- VZIP, // zip (interleave)
- VUZP, // unzip (deinterleave)
- VTRN, // transpose
- VTBL1, // 1-register shuffle with mask
- VTBL2, // 2-register shuffle with mask
- VMOVN, // MVE vmovn
-
- // MVE Saturating truncates
- VQMOVNs, // Vector (V) Saturating (Q) Move and Narrow (N), signed (s)
- VQMOVNu, // Vector (V) Saturating (Q) Move and Narrow (N), unsigned (u)
-
- // MVE float <> half converts
- VCVTN, // MVE vcvt f32 -> f16, truncating into either the bottom or top lanes
- VCVTL, // MVE vcvt f16 -> f32, extending from either the bottom or top lanes
-
- // Vector multiply long:
- VMULLs, // ...signed
- VMULLu, // ...unsigned
-
- VQDMULH, // MVE vqdmulh instruction
-
- // MVE reductions
- VADDVs, // sign- or zero-extend the elements of a vector to i32,
- VADDVu, // add them all together, and return an i32 of their sum
- VADDVps, // Same as VADDV[su] but with a v4i1 predicate mask
- VADDVpu,
- VADDLVs, // sign- or zero-extend elements to i64 and sum, returning
- VADDLVu, // the low and high 32-bit halves of the sum
- VADDLVAs, // Same as VADDLV[su] but also add an input accumulator
- VADDLVAu, // provided as low and high halves
- VADDLVps, // Same as VADDLV[su] but with a v4i1 predicate mask
- VADDLVpu,
- VADDLVAps, // Same as VADDLVp[su] but with a v4i1 predicate mask
- VADDLVApu,
- VMLAVs, // sign- or zero-extend the elements of two vectors to i32, multiply them
- VMLAVu, // and add the results together, returning an i32 of their sum
- VMLAVps, // Same as VMLAV[su] with a v4i1 predicate mask
- VMLAVpu,
- VMLALVs, // Same as VMLAV but with i64, returning the low and
- VMLALVu, // high 32-bit halves of the sum
- VMLALVps, // Same as VMLALV[su] with a v4i1 predicate mask
- VMLALVpu,
- VMLALVAs, // Same as VMLALV but also add an input accumulator
- VMLALVAu, // provided as low and high halves
- VMLALVAps, // Same as VMLALVA[su] with a v4i1 predicate mask
- VMLALVApu,
- VMINVu, // Find minimum unsigned value of a vector and register
- VMINVs, // Find minimum signed value of a vector and register
- VMAXVu, // Find maximum unsigned value of a vector and register
- VMAXVs, // Find maximum signed value of a vector and register
-
- SMULWB, // Signed multiply word by half word, bottom
- SMULWT, // Signed multiply word by half word, top
- UMLAL, // 64bit Unsigned Accumulate Multiply
- SMLAL, // 64bit Signed Accumulate Multiply
- UMAAL, // 64-bit Unsigned Accumulate Accumulate Multiply
- SMLALBB, // 64-bit signed accumulate multiply bottom, bottom 16
- SMLALBT, // 64-bit signed accumulate multiply bottom, top 16
- SMLALTB, // 64-bit signed accumulate multiply top, bottom 16
- SMLALTT, // 64-bit signed accumulate multiply top, top 16
- SMLALD, // Signed multiply accumulate long dual
- SMLALDX, // Signed multiply accumulate long dual exchange
- SMLSLD, // Signed multiply subtract long dual
- SMLSLDX, // Signed multiply subtract long dual exchange
- SMMLAR, // Signed multiply long, round and add
- SMMLSR, // Signed multiply long, subtract and round
-
- // Single Lane QADD8 and QADD16. Only the bottom lane. That's what the b stands for.
- QADD8b,
- QSUB8b,
- QADD16b,
- QSUB16b,
-
- // Operands of the standard BUILD_VECTOR node are not legalized, which
- // is fine if BUILD_VECTORs are always lowered to shuffles or other
- // operations, but for ARM some BUILD_VECTORs are legal as-is and their
- // operands need to be legalized. Define an ARM-specific version of
- // BUILD_VECTOR for this purpose.
- BUILD_VECTOR,
-
- // Bit-field insert
- BFI,
-
- // Vector OR with immediate
- VORRIMM,
- // Vector AND with NOT of immediate
- VBICIMM,
-
- // Pseudo vector bitwise select
- VBSP,
-
- // Pseudo-instruction representing a memory copy using ldm/stm
- // instructions.
- MEMCPY,
-
- // V8.1MMainline condition select
- CSINV, // Conditional select invert.
- CSNEG, // Conditional select negate.
- CSINC, // Conditional select increment.
-
- // Vector load N-element structure to all lanes:
- VLD1DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,
- VLD2DUP,
- VLD3DUP,
- VLD4DUP,
-
- // NEON loads with post-increment base updates:
- VLD1_UPD,
- VLD2_UPD,
- VLD3_UPD,
- VLD4_UPD,
- VLD2LN_UPD,
- VLD3LN_UPD,
- VLD4LN_UPD,
- VLD1DUP_UPD,
- VLD2DUP_UPD,
- VLD3DUP_UPD,
- VLD4DUP_UPD,
-
- // NEON stores with post-increment base updates:
- VST1_UPD,
- VST2_UPD,
- VST3_UPD,
- VST4_UPD,
- VST2LN_UPD,
- VST3LN_UPD,
- VST4LN_UPD,
-
- // Load/Store of dual registers
- LDRD,
- STRD
- };
+ enum NodeType : unsigned {
+ // Start the numbering where the builtin ops and target ops leave off.
----------------
malharJ wrote:
> dmgreen wrote:
> > Don't format any of this - it's unrelated.
> I had to fix it because patch was failing on clang-format error.
That's fine. We can ignore the precommit bot where it's more noisy than helpful.
================
Comment at: llvm/test/CodeGen/Thumb2/mve_tp_loop.ll:9
+
+declare void @llvm.memcpy.p0i8.p0i8.i32(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i32, i1 immarg) #1
+
----------------
malharJ wrote:
> dmgreen wrote:
> > Can you make sure there are tests where the i32 is a different type, one that is not legal like an i64.
> Do you mean something like:
>
> ```
> define void @test(i8* noalias %X, i8* noalias %Y, i64 %n){
> call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %X, i8* align 4 %Y, i64 %n, i1 false)
> ret void
> }
> ```
>
> I get an error when I try to generate the assembly. Since i64 is illegal,
> what is the expectation here ?
>
> As a side note, if I generate the IR from C code, the IR always truncates the memcpy size variable to a i32
> before calling llvm.memcpy( )
Yeah sure. I think they can still come up, from other places creating memcpy calls.
You can probably use DAG.getZExtOrTrunc(Size, MVT::i32), instead of using the size directly.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D99723/new/
https://reviews.llvm.org/D99723
More information about the llvm-commits
mailing list