[llvm-commits] [PATCH 2/2] Convert VLDR/VSTR on demand for base update optimization

Mon Sep 17 10:42:59 PDT 2012

Hi David,

Thanks for the update! Unfortunately, there are still some significant problems. Comments below.

> From 7a15dd0b727ddc7806624bf5d77f7178aa3dbb8d Mon Sep 17 00:00:00 2001
> From: David Peixotto <dpeixott at codeaurora.org>
> Date: Fri, 24 Aug 2012 18:10:36 -0700
> Subject: [PATCH 2/2] Convert VLDR/VSTR on demand for base update optimization
> 
> This commit allows the base-address update optimization to work for VLDR and
> VSTR. There is no writeback version of VLDR/VSTR, so we cannot directly fold
> the update of the base address into the instruction itself. We must first
> transform the VLDR/VSTR to VLDM/VSTM and then use the writeback version of
> those instructions. For example we perform the following transformation,
> 
> VLDR D0, [R0]
> ADD  R0, R0, #8
> ==>
> VLDM R0!, {D0}
> 
> The VLDR/VSTR instructions will only be changed to VLDM/VSTM if an update of
> the base address register can be folded into the instruction. Other VLDR/VSTR
> instructions are left undisturbed.
> ---
>  lib/Target/ARM/ARMLoadStoreOptimizer.cpp |  163 +++++++++++++++++++++++++++++-
>  test/CodeGen/ARM/vlsm_upd.ll             |   59 +++++++++++
>  2 files changed, 219 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/Target/ARM/ARMLoadStoreOptimizer.cpp b/lib/Target/ARM/ARMLoadStoreOptimizer.cpp
> index d364b57..439e2c1 100644
> --- a/lib/Target/ARM/ARMLoadStoreOptimizer.cpp
> +++ b/lib/Target/ARM/ARMLoadStoreOptimizer.cpp
> @@ -74,6 +74,10 @@ EnableSPIncrementFolding("enable-lsm-sp-inc", cl::Hidden,
>     "Enable sp increment folding in the lsm-with-auto-increment optimization"),
>   cl::init(false));
>  
> +static cl::opt<bool>
> +EnableLSMWriteBackConvert("enable-lsm-writeback-convert", cl::Hidden,
> +  cl::desc("Enable conversion of vldr/vstr to vldm/vstm for writeback opt"),

Can you re-phrase this a bit? Not sure I have a good suggestion, but as is, it reads kinda awkwardly.

> +  cl::init(true));
>  
>  /// ARMAllocLoadStoreOpt - Post- register allocation pass the combine
>  /// load / store instructions to form ldm / stm instructions.
> @@ -1601,13 +1605,146 @@ static bool memOpHasWriteBackOption(const MachineInstr& MI) {
>    return false;
>  }
>  
> +// Check to see if we can convert the instruction to a version that
> +// supports writeback to the base register.
> +// We can convert a VLDR/VSTR to VLDM/VSTM as long as the VLDR/VSTR do
> +// not have any pre-increment in the instruction (i.e. the imm in the
> +// instruction should be zero).
> +static bool canConvertToWriteBackInstr(const MachineInstr& MI) {
> +  unsigned Opcode = MI.getOpcode();
> +  switch (Opcode) {
> +  default: return false;
> +  case ARM::VLDRS:
> +  case ARM::VSTRS:
> +  case ARM::VLDRD:
> +  case ARM::VSTRD:{
> +    // Check to make sure we have an instruction of the form
> +    // VLDR <Dd>, [<Rn> {, #+-<imm>]
> +    if (MI.getNumOperands() < 3)
> +      return false;
> +    if (!MI.getOperand(1).isReg()) // Base address
> +      return false;

What conditions are these statements trying to check for? If they ever fire, something has gone seriously wrong, as that would indicate a malformed instruction.

> +    // Make sure the imm is zero
Period on the end.
> +    const MachineOperand& PreIncOp = MI.getOperand(2);

'&' goes on the identifier, not the type.

> +    return PreIncOp.isImm() && (PreIncOp.getImm() == 0);

Similar to the above, why are you checking isImm() here? If that operand isn't an immediate, something is very, very wrong.

> +    }
> +  }

A comment here that the switch statement won't fall through may help readability a bit.

> +}
> +
> +// Convert a VLDR/VSTR to VLDMIA/VSTMIA
Period at the end of the sentence.

> +// useful for converting an memory instruction to a variant that can
> +// update the base address as a side-effect of the instruction.
Please rephrase this so it's a complete sentence.

> +static MachineInstr* convertVLSRtoVLSMIA(MachineBasicBlock &MBB,
> +                                         MachineInstr *MI,
> +                                         const TargetInstrInfo *TII) {
> +  unsigned Opcode = MI->getOpcode();
> +  unsigned NewOpc;
> +  bool isLoad = false;
> +  switch (Opcode) {
> +  default: llvm_unreachable("Unhandled opcode");
> +  case ARM::VLDRS:
> +    isLoad = true;
> +    NewOpc = ARM::VLDMSIA;
> +    break;
> +  case ARM::VSTRS:
> +    NewOpc = ARM::VSTMSIA;
> +    break;
> +  case ARM::VLDRD:
> +    isLoad = true;
> +    NewOpc = ARM::VLDMDIA;
> +    break;
> +  case ARM::VSTRD:
> +    NewOpc = ARM::VSTMDIA;
> +    break;
> +  }

You want the updating forms of these instructions, right? These are the non-updating forms. You want the _UPD suffixed instructions, almost certainly.

For example,
enkidu: ~/tmp $ cat x.s
_foo:

  vldmia r1!, {d2}
enkidu: ~/tmp $ llvm-mc -triple=armv7-apple-darwin < x.s -show-inst
	.section	__TEXT,__text,regular,pure_instructions
_foo:

	vldmia	r1!, {d2}               @ <MCInst #1201 VLDMDIA_UPD
                                        @  <MCOperand Reg:61>
                                        @  <MCOperand Reg:61>
                                        @  <MCOperand Imm:14>
                                        @  <MCOperand Reg:0>
                                        @  <MCOperand Reg:5>>

> +  MachineBasicBlock::iterator MBBI(MI);
> +  MachineInstrBuilder MIB = BuildMI(MBB, MBBI, MI->getDebugLoc(),
> +                                    TII->get(NewOpc));
> +
> +  assert(MI->getOperand(2).isImm() && MI->getOperand(2).getImm() == 0 &&
> +         "VLDR has non-zero pre-increment value");

As above, no need to check isImm() here. Also, "VLDR/VSTR".

> +  MachineOperand& VPRReg = MI->getOperand(0);
> +  if (isLoad)
> +    VPRReg.setIsDef(true);
> +  MIB.addOperand(MI->getOperand(1)); // Base address
> +  MIB.addOperand(MI->getOperand(3)); // Predicate Imm
> +  MIB.addOperand(MI->getOperand(4)); // Predicate Register
> +  MIB.addOperand(VPRReg);            // VPR register

This isn't right. See above.

Have a very close look in the .td files for how these instructions, and the VLDR/VSTR instruction operands are defined. You can also look at the generated code in the .inc files (InstrInfo and AsmMatcher are useful, in my experience). I also find it handy to verify my understanding by writing the instructions in a .s file and using llvm-mc to print out the MCOperands for it (which mirror the MachineOperands in type and order).

> +
> +  // Transfer any extra operands (e.g. implicit defs/kills)
> +  for (unsigned int i = 5; i < MI->getNumOperands(); ++i)
> +    MIB.addOperand(MI->getOperand(i));
> +

What cases have you encountered where a VLDR/VSTR instruction has any additional operands that need to be copied?

> +  // Transfer memoperands.
> +  MIB->setMemRefs(MI->memoperands_begin(), MI->memoperands_end());
> +
> +  // Remove the VLDR instruction
> +  MBB.erase(MBBI);
> +  return MIB;
> +}
> +
> +// Convert a VLDMIA/VSTMIA to VLDR/VSTR
> +// useful for undoing the conversion we performed in the opposite direction so
> +// that we could  potentially merge the update of the base address into the
> +// instruction.

Ditto of above for this whole function.

> +static MachineInstr* convertVLSMIAtoVLSR(MachineBasicBlock &MBB,
> +                                         MachineInstr *MI,
> +                                         const TargetInstrInfo *TII) {
> +  unsigned Opcode = MI->getOpcode();
> +  unsigned NewOpc;
> +  bool isLoad = false;
> +  switch (Opcode) {
> +  default: llvm_unreachable("Unhandled opcode");
> +  case ARM::VLDMSIA:
> +    isLoad = true;
> +    NewOpc = ARM::VLDRS;
> +    break;
> +  case ARM::VSTMSIA:
> +    NewOpc = ARM::VSTRS;
> +    break;
> +  case ARM::VLDMDIA:
> +    isLoad = true;
> +    NewOpc = ARM::VLDRD;
> +    break;
> +  case ARM::VSTMDIA:
> +    NewOpc = ARM::VSTRD;
> +    break;
> +  }
> +  MachineBasicBlock::iterator MBBI(MI);
> +  MachineInstrBuilder MIB = BuildMI(MBB, MBBI, MI->getDebugLoc(),
> +                                    TII->get(NewOpc));
> +
> +  // There should be only one VPR register since this instruction was
> +  // originally converted from a VLDR/VSTR
> +  MachineOperand& VPRReg = MI->getOperand(3);
> +  if (isLoad)
> +    VPRReg.setIsDef(true);
> +  MIB.addOperand(VPRReg);            // VPR register
> +  MIB.addOperand(MI->getOperand(0)); // Base address
> +  MIB.addImm(0);                     // Offset
> +  MIB.addOperand(MI->getOperand(1)); // Predicate Imm
> +  MIB.addOperand(MI->getOperand(2)); // Predicate Register
> +
> +  // Transfer any extra operands (e.g. implicit defs/kills)
> +  for (unsigned int i = 4; i < MI->getNumOperands(); ++i)
> +    MIB.addOperand(MI->getOperand(i));
> +
> +  // Transfer memoperands.
> +  MIB->setMemRefs(MI->memoperands_begin(), MI->memoperands_end());
> +
> +  // Remove the VLDM instruction
> +  MBB.erase(MBBI);
> +  return MIB;
> +}
> +
>  /// LSMWriteBackOptimzation- An optimization pass to turn LDM / STM
>  /// ops followed/preceeded by an increment/decrement of the base address to the
>  /// writeback version of the opcode.

This isn't a pass. It's part of the bigger load-store-optimization pass. The comment should be reworded to be more clear and to not be a sentence fragment.

>  bool ARMLoadStoreOpt::LSMWriteBackOptimization(MachineBasicBlock &MBB,
>                                                 const MISet &NewLSMInstrs) {
> -  typedef SmallVector<MachineInstr*,8> Worklist;
> +  typedef SmallVector<std::pair<MachineInstr*,bool>,8> Worklist;
>    int NumMerges = 0;
> +  const bool NeedsConversion = true; // flag for when we need to convert
What's the purpose of this variable?

>  
>    // Find all the load/store multiple instructions where we can potentially
>    // merge an update of the base address into the instruction

Here and everywhere following, please update to use complete sentences including punctuation.

> @@ -1627,17 +1764,37 @@ bool ARMLoadStoreOpt::LSMWriteBackOptimization(MachineBasicBlock &MBB,
>        continue;
>  
>      if(memOpHasWriteBackOption(*MBBI)) {
> -      MemOps.push_back(MBBI);
> +      MemOps.push_back(std::make_pair(MBBI, !NeedsConversion));
> +    }
> +    else if (EnableLSMWriteBackConvert && canConvertToWriteBackInstr(*MBBI)) {
> +      MemOps.push_back(std::make_pair(MBBI, NeedsConversion));

There is no memOpHasWriteBackOption() function in top of tree, nor has there been that I can find. Is this patch complete? Using the memory operands for that check is not the right way to go. Reference the instructions themselves.

>      }
>    }
>  
>    // Work the list to try and merge base update into the load/store operation
>    for (Worklist::iterator WI = MemOps.begin(), WE = MemOps.end();
>          WI != WE; ++WI) {
> -    MachineInstr *MI = *WI;
> +    MachineInstr *MI     = WI->first;
> +    bool NeedsConversion = WI->second;
>      DEBUG(dbgs() << "Trying to merge increment for " << *MI);
> +
> +    // Change vldr/vstr to vldmia/vstmia to try and merge increment
> +    if (NeedsConversion) {
> +      DEBUG(dbgs() << "  first converting to op with writeback mode\n");
> +      MI = convertVLSRtoVLSMIA(MBB, MI, TII);
> +      DEBUG(dbgs() << "  converted to: " << *MI);
> +    }
> +
> +    // Try and merge base update into the load/store multiple instruction
>      if(MergeBaseUpdateLSMultiple(MBB, MI))
>        ++NumMerges;
> +
> +    // Undo change of vldr/vstr to vldmia/vstmia if merge failed
> +    else if (NeedsConversion) {
> +      DEBUG(dbgs() << "  merge failed, unconverting to vldr/vstr\n");
> +      MI = convertVLSMIAtoVLSR(MBB, MI, TII);
> +      DEBUG(dbgs() << "  unconverted to: " << *MI);
> +    }
>    }
>  
>    NumIncMerges += NumMerges; // update the statistic
> diff --git a/test/CodeGen/ARM/vlsm_upd.ll b/test/CodeGen/ARM/vlsm_upd.ll
> index 4e28dce..c24d439 100644
> --- a/test/CodeGen/ARM/vlsm_upd.ll
> +++ b/test/CodeGen/ARM/vlsm_upd.ll
> @@ -1,5 +1,7 @@
>  ; RUN: llc < %s -march=arm -mattr=+neon\
>  ; RUN:          -enable-lsm-with-auto-increment=true\
> +; RUN:          -enable-lsm-writeback-convert=true\
> +; RUN:          -verify-machineinstrs\
>  ; RUN:     | FileCheck %s
>  ; Test cases for the enable-lsm-with-auto-increment optimization that folds the
>  ; increment of the base address into the load/store multiple instruction
> @@ -24,5 +26,62 @@ define void @addVec_4x4() nounwind {
>    ret void
>  }
>  
> + at B2x4 = common global [100 x <2 x float>] zeroinitializer, align 16
> + at A2x4 = common global [100 x <2 x float>] zeroinitializer, align 16
> +declare void @v2x4(<2 x float>*, <2 x float>*)
> +
> +;CHECK: addVec_2x4:
> +define void @addVec_2x4() nounwind {
> +  %bi = getelementptr inbounds [100 x <2 x float>]* @B2x4, i32 0, i32 0
> +;CHECK: vldmia r1!
> +  %v1 = load <2 x float>* %bi, align 16, !tbaa !0
> +  %ai = getelementptr inbounds [100 x <2 x float>]* @A2x4, i32 0, i32 0
> +  %v2 = load <2 x float>* %ai, align 16, !tbaa !0
> +  %v3 = fadd <2 x float> %v1, %v2
> +;CHECK: vstmia r0!
> +  store <2 x float> %v3, <2 x float>* %ai, align 16, !tbaa !0
> +  %bi_1 = getelementptr [100 x <2 x float>]* @B2x4, i32 0, i32 1
> +  %ai_1 = getelementptr [100 x <2 x float>]* @A2x4, i32 0, i32 1
> +  call void @v2x4(<2 x float>* %ai_1, <2 x float>* %bi_1);
> +  ret void
> +}
> +
> + at B1x4 = common global [100 x <1 x float>] zeroinitializer, align 16
> + at A1x4 = common global [100 x <1 x float>] zeroinitializer, align 16
> +declare void @v1x4(<1 x float>*, <1 x float>*)
> +
> +;CHECK: addVec_1x4:
> +define void @addVec_1x4() nounwind {
> +  %bi = getelementptr inbounds [100 x <1 x float>]* @B1x4, i32 0, i32 0
> +;CHECK: vldmia r1!
> +  %v1 = load <1 x float>* %bi, align 16, !tbaa !0
> +  %ai = getelementptr inbounds [100 x <1 x float>]* @A1x4, i32 0, i32 0
> +  %v2 = load <1 x float>* %ai, align 16, !tbaa !0
> +  %v3 = fadd <1 x float> %v1, %v2
> +;CHECK: vstmia r0!
> +  store <1 x float> %v3, <1 x float>* %ai, align 16, !tbaa !0
> +  %bi_1 = getelementptr [100 x <1 x float>]* @B1x4, i32 0, i32 1
> +  %ai_1 = getelementptr [100 x <1 x float>]* @A1x4, i32 0, i32 1
> +  call void @v1x4(<1 x float>* %ai_1, <1 x float>* %bi_1);
> +  ret void
> +}
> +
> +; Make sure we correctly output the vldr/vstr instructions if we
> +; do not convert them to vldm/vstm for autoincrement optimization
> +;CHECK: addVec_1x4_roundtrip:
> +define void @addVec_1x4_roundtrip() nounwind {
> +  %bi = getelementptr inbounds [100 x <1 x float>]* @B1x4, i32 0, i32 0
> +;CHECK: vldr
> +;CHECK-NOT: vldmia r1!
> +  %v1 = load <1 x float>* %bi, align 16, !tbaa !0
> +  %ai = getelementptr inbounds [100 x <1 x float>]* @A1x4, i32 0, i32 0
> +  %v2 = load <1 x float>* %ai, align 16, !tbaa !0
> +  %v3 = fadd <1 x float> %v1, %v2
> +;CHECK: vstr
> +;CHECK-NOT: vstmia r0!
> +  store <1 x float> %v3, <1 x float>* %ai, align 16, !tbaa !0
> +  ret void
> +}
> +
>  !0 = metadata !{metadata !"omnipotent char", metadata !1}
>  !1 = metadata !{metadata !"Simple C/C++ TBAA"}
> -- 
> 1.7.8.3
> 

On Sep 8, 2012, at 2:19 PM, David Peixotto <dpeixott at codeaurora.org> wrote:

> Ok, I've addressed the comments from the review. I've attached a new patch
> to this email.
> 
> I also fixed all the issues found by -verify-machineinstrs. Those fixes
> belonged the in other patch in this patch set (see [PATCH 1/2] Improve
> optimization to fold base update into LDM/STM), and the patch in this email
> already depends on that other patch
> 
> I've also uploaded a new version of the other patch. Once it is reviewed and
> approved both patches are ready to be applied.
> 
> Thanks,
> Dave
> 
> -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by The Linux Foundation
> 
>> -----Original Message-----
>> From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-
>> bounces at cs.uiuc.edu] On Behalf Of David Peixotto
>> Sent: Thursday, September 06, 2012 6:44 PM
>> To: 'Jim Grosbach'
>> Cc: llvm-commits at cs.uiuc.edu
>> Subject: Re: [llvm-commits] [PATCH 2/2] Convert VLDR/VSTR on demand for
>> base update optimization
>> 
>> Thanks for the feedback Jim!
>> 
>> Thanks for the tip about -verify-machineinstrs. It found a few things to
>> complain about and I clean those up. I'll also apply your suggestions
> below
>> and submit a new patch. Thanks!
>> 
>> -Dave
>> 
>> -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>> hosted by The Linux Foundation
>> 
>> -----Original Message-----
>> From: Jim Grosbach [mailto:grosbach at apple.com]
>> Sent: Wednesday, September 05, 2012 5:21 PM
>> To: David Peixotto
>> Cc: llvm-commits at cs.uiuc.edu
>> Subject: Re: [llvm-commits] [PATCH 2/2] Convert VLDR/VSTR on demand for
>> base update optimization
>> 
>> Nice!
>> 
>> Style note:
>> +static MachineInstr* convertVLSRtoVLSMIA(MachineBasicBlock &MBB,
>> +                                         MachineInstr *MI,
>> +                                         const TargetInstrInfo *TII) {
>> 
>> The '{' should go on the preceding line, not on a new line. Similarly for
> the
>> other functions.
>> 
>> 
>> In convertVLSRtoVLSMIA:
>> 
>> +  MachineBasicBlock::iterator MBBI(MI);  MachineInstrBuilder MIB =
>> + BuildMI(MBB, MBBI, MI->getDebugLoc(),
>> +                                    TII->get(NewOpc));
>> +
>> +  assert(MI->getOperand(2).isImm() && MI->getOperand(2).getImm() == 0
>> &&
>> +         "VLDR has non-zero pre-increment value");
>> + MIB.addOperand(MI->getOperand(1)); // Base address
>> + MIB.addOperand(MI->getOperand(3)); // Predicate Imm
>> + MIB.addOperand(MI->getOperand(4)); // Predicate Register
>> + MIB.addOperand(MI->getOperand(0)); // VPR register
>> 
>> The VPR register should probably be marked as a def for loads.
>> 
>> In convertVLSMIAtoVLSR:
>> 
>> +  MachineBasicBlock::iterator MBBI(MI);  MachineInstrBuilder MIB =
>> + BuildMI(MBB, MBBI, MI->getDebugLoc(),
>> +                                    TII->get(NewOpc));
>> +
>> +  // There should be only one VPR register since this instruction was
>> + // originally converted from a VLDR/VSTR
>> + MIB.addOperand(MI->getOperand(3)); // VPR register
>> + MIB.addOperand(MI->getOperand(0)); // Base address
>> +  MIB.addImm(0);                     // Offset
>> +  MIB.addOperand(MI->getOperand(1)); // Predicate Imm
>> + MIB.addOperand(MI->getOperand(2)); // Predicate Register
>> +
>> 
>> The VPR register should definitely be marked as a def for loads.
>> 
>> If you haven't already, make sure to run tests with the machine verifier
>> enabled after the pass runs. It'l help catch lots of things.
>> 
>> 
>> LGTM with those updates.
>> 
>> -Jim
>> 
>> 
>> On Aug 31, 2012, at 11:02 AM, David Peixotto <dpeixott at codeaurora.org>
>> wrote:
>> 
>>> Please review the attached patch. Thanks!
>>> 
>>> This commit allows the base-address update optimization to work for
>>> VLDR and VSTR. There is no writeback version of VLDR/VSTR, so we
>>> cannot directly fold the update of the base address into the
>>> instruction itself. We must first transform the VLDR/VSTR to VLDM/VSTM
>>> and then use the writeback version of those instructions. For example
>>> we perform the following transformation,
>>> 
>>> VLDR D0, [R0]
>>> ADD  R0, R0, #8
>>> ==>
>>> VLDM R0!, {D0}
>>> 
>>> The VLDR/VSTR instructions will only be changed to VLDM/VSTM if an
>>> update of the base address register can be folded into the
>>> instruction. Other VLDR/VSTR instructions are left undisturbed.
>>> 
>>> -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>> hosted by The Linux Foundation
>>> 
>>> 
>>> 
>>> <0002-Convert-VLDR-VSTR-on-demand-for-base-update-
>> optimiza.patch>_____
>>> __________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> <0002-Convert-VLDR-VSTR-on-demand-for-base-update-optimiza.patch>