[PATCH] Add Cortex-A9 scheduling classes for vldm/vstm instructions that access more than 32 bytes

Silviu Baranga Silviu.Baranga at arm.com
Wed Sep 4 10:08:56 PDT 2013


Great! Thanks for your help. Committed as r189958.

- Silviu

From: Andrew Trick [mailto:atrick at apple.com]
Sent: 04 September 2013 17:22
To: Silviu Baranga
Cc: Arnold Schwaighofer; llvm commits; Renato Golin
Subject: Re: [PATCH] Add Cortex-A9 scheduling classes for vldm/vstm instructions that access more than 32 bytes


On Sep 4, 2013, at 6:46 AM, Silviu Baranga <Silviu.Baranga at arm.com<mailto:Silviu.Baranga at arm.com>> wrote:


Hi Andy,

Thanks for that great description! The explanation makes sense and I would like to a stab at this (patch attached).

That looks great. Please commit. And thanks a lot!
-Andy



If it's completely wrong, I'll hand it over to you.

Thanks,
Silviu


From: Andrew Trick [mailto:atrick at apple.com]
Sent: 04 September 2013 02:04
To: Silviu Baranga
Cc: Arnold Schwaighofer; llvm commits; Renato Golin
Subject: Re: [PATCH] Add Cortex-A9 scheduling classes for vldm/vstm instructions that access more than 32 bytes


On Sep 3, 2013, at 2:02 AM, Silviu Baranga <Silviu.Baranga at arm.com<mailto:Silviu.Baranga at arm.com>> wrote:



That was a dumb mistake.. Should be fixed now. Is it OK to commit?

@Renato: thanks for your help progressing this!

Hi Silviu,

I apologize for missing this when you first sent it. Thanks, Renato for getting on my case.

I really appreciate that you provided a test case and improved tablegen syntax using A9WriteLMOpsListType and sequence ranges. I know this particular problem has been spotted before but we must have forgotten to fix it. When I wrote this part of the A9 model, I expected getNumLDMAddresses() to return 8 for 64-byte VLDM (each address loads 64-bits)--hence the assert that you see. Now it actually returns 16 (for 32-bit loads). Unfortunately your patch is propagating the bug to the extent that we no longer notice it. Changing getNumLDMAddresses() again is probably not worth doing, given that it's properly implemented for Swift, but it shouldn't be hard to fix the A9 machine model. Let me explain how it works:

The LDM machine model is a nightmare for a few reasons. Mainly because the PostRA form of the instruction was never ported to the new register tuple framework. Partly because we have no good way of determining the size of the load. Partly because the machine model simultaneously handles LDM and VLDM, S and D register form, and resource and latency. Let me explain how it works:

Looking at the PostRA form, which is what you're concerned with:

          VLDMDIA %SP, pred:14, pred:%noreg, %D16<def>, %D17<def>, %D18<def>, %D19<def>, %D20<def>, %D21<def>, %D22<def>, %D23<def>, %Q8_Q9_Q10_Q11<imp-def>; mem:LD64

The machine model dictates that we define a list of SchedWrite types at least as long as the list of explicit def operands (>= 8). Ideally we have:

def A9WriteLMfpPostRA : SchedWriteVariant<[
...
  SchedVar<A9LMAdr16Pred, [A9WriteLMfp1,
                                    A9WriteLMfp2,
                                    A9WriteLMfp3,
                                    A9WriteLMfp4,
                                    A9WriteLMfp5,
                                    A9WriteLMfp6,
                                    A9WriteLMfp7,
                                    A9WriteLMfp8]>,
...

Where the latency of each A9WriteLMfp#N is N cycles, and the total instruction resources are the sum of each def's resources: 8 LoadStore units and 8 FP units.

Now, 64-byte VLDM could in theory contain 16 S-register defs. We want to allow the machine model handle this case, but don't really care about the accuracy. We can just reuse the same 8 D-register scheduling class as such:

  SchedVar<A9LMAdr16Pred, [A9WriteLMfp1,
                                    A9WriteLMfp2,
                                    A9WriteLMfp3,
                                    A9WriteLMfp4,
                                    A9WriteLMfp5,
                                    A9WriteLMfp6,
                                    A9WriteLMfp7,
                                    A9WriteLMfp8,
                           A9WriteLMfp5Hi,
                                    A9WriteLMfp5Hi,
                                    A9WriteLMfp6Hi,
                                    A9WriteLMfp6Hi,
                                    A9WriteLMfp7Hi,
                                    A9WriteLMfp7Hi,
                                    A9WriteLMfp8Hi,
                                    A9WriteLMfp8Hi]>,

The SchedWrite types with "Hi" suffix do not take any processor resources. They only convey the latency to the register def operand at that position--in the D register case they won't have any effect. The first four odd-number S registers have an extra cycle of latency at the cost of reusing the model. You could express a perfect model, but would need even more predicates.

Another way to cut the number of scheduling classes in half is to notice that the number of resources used are the same for even/odd LDM/VLDM pairs.

def A9LMAdr#NumAddr#Pred :
  SchedPredicate<"(TII->getNumLDMAddresses(MI)+1)/2 == "#NumAddr>;
...
  SchedVar<A9LMAdr1Pred, [A9WriteLMfp1,
                          A9WriteLMfp1Hi]>,
  SchedVar<A9LMAdr2Pred, [A9WriteLMfp1,
                                   A9WriteLMfp2,
                          A9WriteLMfp2Hi,
                                   A9WriteLMfp2Hi]>,
  SchedVar<A9LMAdr3Pred, [A9WriteLMfp1,
                                   A9WriteLMfp2,
                          A9WriteLMfp3,
                          A9WriteLMfp2Hi,
                          A9WriteLMfp3Hi,
                                   A9WriteLMfp3Hi]>,
...

Then you're back to only 8 predicates, with A9LMAdr8Pred covering your 64-byte case (which is what I originally intended). The only difference from the current model is the fix to the NumLDMAddresses predicate, and to optimize for the D-register case, not the S-register case (plus test case and tablegen niceness that you added).

If my explanation actually makes sense to you, please take a crack at fixing it this way. Otherwise, feel free to hand it off to me.

-Andy

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
<A9SchedLDM.diff>


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130904/386b62a9/attachment.html>


More information about the llvm-commits mailing list