[PATCH] Add Cortex-A9 scheduling classes for vldm/vstm instructions that access more than 32 bytes

Tue Sep 3 18:04:04 PDT 2013

On Sep 3, 2013, at 2:02 AM, Silviu Baranga <Silviu.Baranga at arm.com> wrote:

> That was a dumb mistake.. Should be fixed now. Is it OK to commit?
> 
> @Renato: thanks for your help progressing this!

Hi Silviu,

I apologize for missing this when you first sent it. Thanks, Renato for getting on my case.

I really appreciate that you provided a test case and improved tablegen syntax using A9WriteLMOpsListType and sequence ranges. I know this particular problem has been spotted before but we must have forgotten to fix it. When I wrote this part of the A9 model, I expected getNumLDMAddresses() to return 8 for 64-byte VLDM (each address loads 64-bits)--hence the assert that you see. Now it actually returns 16 (for 32-bit loads). Unfortunately your patch is propagating the bug to the extent that we no longer notice it. Changing getNumLDMAddresses() again is probably not worth doing, given that it's properly implemented for Swift, but it shouldn't be hard to fix the A9 machine model. Let me explain how it works:

The LDM machine model is a nightmare for a few reasons. Mainly because the PostRA form of the instruction was never ported to the new register tuple framework. Partly because we have no good way of determining the size of the load. Partly because the machine model simultaneously handles LDM and VLDM, S and D register form, and resource and latency. Let me explain how it works:

Looking at the PostRA form, which is what you're concerned with:

	VLDMDIA %SP, pred:14, pred:%noreg, %D16<def>, %D17<def>, %D18<def>, %D19<def>, %D20<def>, %D21<def>, %D22<def>, %D23<def>, %Q8_Q9_Q10_Q11<imp-def>; mem:LD64

The machine model dictates that we define a list of SchedWrite types at least as long as the list of explicit def operands (>= 8). Ideally we have:

def A9WriteLMfpPostRA : SchedWriteVariant<[
...
  SchedVar<A9LMAdr16Pred, [A9WriteLMfp1,
		  	   A9WriteLMfp2,
			   A9WriteLMfp3,
			   A9WriteLMfp4,
			   A9WriteLMfp5,
			   A9WriteLMfp6,
			   A9WriteLMfp7,
			   A9WriteLMfp8]>,
...

Where the latency of each A9WriteLMfp#N is N cycles, and the total instruction resources are the sum of each def's resources: 8 LoadStore units and 8 FP units.

Now, 64-byte VLDM could in theory contain 16 S-register defs. We want to allow the machine model handle this case, but don't really care about the accuracy. We can just reuse the same 8 D-register scheduling class as such:

  SchedVar<A9LMAdr16Pred, [A9WriteLMfp1,
		  	   A9WriteLMfp2,
			   A9WriteLMfp3,
			   A9WriteLMfp4,
			   A9WriteLMfp5,
			   A9WriteLMfp6,
			   A9WriteLMfp7,
			   A9WriteLMfp8,
                           A9WriteLMfp5Hi,
		  	   A9WriteLMfp5Hi,
			   A9WriteLMfp6Hi,
			   A9WriteLMfp6Hi,
			   A9WriteLMfp7Hi,
			   A9WriteLMfp7Hi,
			   A9WriteLMfp8Hi,
			   A9WriteLMfp8Hi]>,

The SchedWrite types with "Hi" suffix do not take any processor resources. They only convey the latency to the register def operand at that position--in the D register case they won't have any effect. The first four odd-number S registers have an extra cycle of latency at the cost of reusing the model. You could express a perfect model, but would need even more predicates.

Another way to cut the number of scheduling classes in half is to notice that the number of resources used are the same for even/odd LDM/VLDM pairs.

def A9LMAdr#NumAddr#Pred :
  SchedPredicate<"(TII->getNumLDMAddresses(MI)+1)/2 == "#NumAddr>;
...
  SchedVar<A9LMAdr1Pred, [A9WriteLMfp1,
                          A9WriteLMfp1Hi]>,
  SchedVar<A9LMAdr2Pred, [A9WriteLMfp1,
		  	  A9WriteLMfp2,
                          A9WriteLMfp2Hi,
		  	  A9WriteLMfp2Hi]>,
  SchedVar<A9LMAdr3Pred, [A9WriteLMfp1,
		  	  A9WriteLMfp2,
                          A9WriteLMfp3,
                          A9WriteLMfp2Hi,
                          A9WriteLMfp3Hi,
		  	  A9WriteLMfp3Hi]>,
...

Then you're back to only 8 predicates, with A9LMAdr8Pred covering your 64-byte case (which is what I originally intended). The only difference from the current model is the fix to the NumLDMAddresses predicate, and to optimize for the D-register case, not the S-register case (plus test case and tablegen niceness that you added).

If my explanation actually makes sense to you, please take a crack at fixing it this way. Otherwise, feel free to hand it off to me.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130903/79d0df17/attachment.html>