[llvm-dev] Incorrect Cortex-R4/R4F/R5 ProcessorModel in ARM.td

Fri Oct 22 10:20:57 PDT 2021

Hello Benson,

Having to give a quick reply as I'm about to go on vacation for a couple of weeks. Again hoping my Arm colleagues can correct me if I'm wrong as I'm interpreting what is in the TRM, which usually abstracts somewhat from the HW.

If we take out dual issue then my understanding is that the sequence below would take 4 cycles 3 for VDIV and 1 for the add. If there were a data dependency, for example if the second instruction used d1 then the second instruction stalls for 63 (result latency) - 3 (cycles) 
VDIV.F64 d1, d2, d3 //. I'm assuming double precision registers here for VDIV.F64 as in table.
ADD r4, r5, r6

With dual issuing,  we can look into the permitted combinations which I don't think there is a match for 64-bit CDP instructions like VDIV.F64. With the 32 bit equivalent there is
VDIV.F32 s1, s2, s3
ADD r4, r5, r6
Would dual issue under case F1 b,m
| Any single precision CDP (exceptions...) | As for Case C (any data processing instruction) |

In this case the instruction with the fewest cycles is considered to take 0 cycles, so the total cycle count for
VDIV.F32 s1, s2, s3 // 2 cycles for VDIV.F32
ADD r4, r5, r6          // 1 cycle for ADD
I'd expect to be 2 cycles.

Hope that helps, and apologies for a hasty response

Peter

> -----Original Message-----
> From: Chu, Benson <b-chu1 at ti.com>
> Sent: 22 October 2021 17:18
> To: Peter Smith <Peter.Smith at arm.com>; Phipps, Alan <a-phipps at ti.com>;
> llvm-dev at lists.llvm.org
> Subject: RE: Incorrect Cortex-R4/R4F/R5 ProcessorModel in ARM.td
> 
> Hey Peter,
> 
> Thanks for the reply, I was able to flesh out most of the R5 model with the
> information you had provided.
> 
> However, I had a question about the R5 TRM regarding the meaning of "Issue
> Cycles". The description of Issue Cycles says "the minimum number of cycles
> required to issue an instruction". Do issue cycles indicate that no other
> instructions will be issued for that amount of time?
> 
> For example, here's an entry from the timings chapter:
> 
> | Instruction                                    | Cycles | Early Regs     | Result Latency |
> | VDIV.F64 <Dd>, <Dn>, <Dm>    | 3          | <Dn>, <Dm> | 63                       |
> 
> And let's say I have the following sequence:
> 
> VDIV r1, r2, r3
> ADD r4, r5, r6
> 
> Since there's no data dependence, these instructions should be issued right
> after one another. However, since VDIV has 3 "issue cycles", if VDIV is issued
> on cycle 0, does that mean ADD is issued on cycle 3? Or, are they both issued
> respectively on cycle 1 and 2, and "issue cycles" indicate something else?
> 
> (I am assuming that some aspects of the superscalar behavior come into play
> here, but I'm not sure how)
> 
> Thanks again!
> Benson
> 
> -----Original Message-----
> From: Peter Smith <Peter.Smith at arm.com>
> Sent: Thursday, October 14, 2021 11:24 AM
> To: Chu, Benson <b-chu1 at ti.com>; Phipps, Alan <a-phipps at ti.com>; llvm-
> dev at lists.llvm.org
> Subject: [EXTERNAL] RE: Incorrect Cortex-R4/R4F/R5 ProcessorModel in
> ARM.td
> 
> Hello,
> 
> I know just about enough to find the right file to describe the scheduling
> model, but I don't know much about the details myself. I'm hoping that one
> of my colleagues or someone knowing about scheduling in general can
> help/correct what I'm writing below.
> 
> From what I glean from:
> https://llvm.org/devmtg/2016-09/slides/Absar-SchedulingInOrder.pdf
> https://llvm.org/devmtg/2014-10/Slides/Estes-MISchedulerTutorial.pdf
> 
> The basics of superscalar modelling are the IssueWidth
> 
> def CortexR52Model : SchedMachineModel {
>   let MicroOpBufferSize = 0;  // R52 is in-order processor
>   let IssueWidth = 2;         // 2 micro-ops dispatched per cycle
>   let LoadLatency = 1;        // Optimistic, assuming no misses
>   let MispredictPenalty = 8;  // A branch direction mispredict, including PFU
>   let CompleteModel = 0;      // Covers instructions applicable to cortex-r52.
> }
> 
> I would expect the forwarding information to be useful as to dual issue
> certain pairs the dependencies would need to be available.
> 
> // Forwarding information - based on when an operand is read def :
> ReadAdvance<R52Read_ISS, 0>; def : ReadAdvance<R52Read_EX1, 1>; def :
> ReadAdvance<R52Read_EX2, 2>; def : ReadAdvance<R52Read_F0, 0>; def :
> ReadAdvance<R52Read_F1, 1>; def : ReadAdvance<R52Read_F2, 2>;
> 
> From https://llvm.org/devmtg/2016-09/slides/Absar-SchedulingInOrder.pdf
> assuming it still holds (5 years ago) LLVM Scheduler -What's missing?
> * Instructions with slot constraints
> ** Cannot issue in second slot - specification and pickNode changes
> ** Cannot issue with any other - micro-ops
> ** Cannot issue with specific another - reliance on resource constraint (not
> adequate)
> * Inter-lock constraint modelling
> ** Cannot slow down previous instruction
> * First-half, second-half and in-stage forwarding
> ** Further divide pipeline stages
> * Variadic instructions
> ** SchedPredicate, SchedVariant - an alternate compact representation
> necessary
> 
> It may be that more complex superscalar constraints cannot be modelled.
> 
> Hope that helps
> 
> Peter
> 
> > -----Original Message-----
> > From: Chu, Benson <b-chu1 at ti.com>
> > Sent: 14 October 2021 16:17
> > To: Peter Smith <Peter.Smith at arm.com>; Phipps, Alan <a-
> phipps at ti.com>;
> > llvm-dev at lists.llvm.org
> > Subject: RE: Incorrect Cortex-R4/R4F/R5 ProcessorModel in ARM.td
> >
> > Hey Peter,
> >
> > I've begun looking into adapting the model for the R52 into a model
> > for the R5.
> >
> > Tweaking the instruction timings and removing V8-r specific stuff has
> > been mostly straightforward, and I'm seeing about a 3% improvement in
> > benchmarks like coremark.
> >
> > However, the R5 rules on which instructions can be dual issued are
> > different from the R52, and I don't see how the superscalar behavior
> > is modeled in the existing R52 schedule.
> >
> > Would you happen to know what part of the R52 tablegen file is for
> > modeling the superscalar behavior?
> >
> > Thanks,
> > Benson
> >
> > -----Original Message-----
> > From: Peter Smith <Peter.Smith at arm.com>
> > Sent: Wednesday, September 23, 2020 11:55 AM
> > To: Phipps, Alan <a-phipps at ti.com>; llvm-dev at lists.llvm.org
> > Subject: [EXTERNAL] Re: Incorrect Cortex-R4/R4F/R5 ProcessorModel in
> > ARM.td
> >
> > Hello Alan,
> >
> > Looking at the public information for Cortex-R5
> > (https://developer.arm.com/ip-products/processors/cortex-r/cortex-r5)
> > and
> > Cortex-R52  (https://developer.arm.com/ip-products/processors/cortex-
> > r/cortex-r52) shows that both are in-order with similar length
> > pipelines. It is possible that the Cortex-R52 scheduling model may
> > match the Cortex-R5 more closely than the choices available at the
> > time that Cortex-R5 was upstreamed.
> >
> > I haven't written a schedule model myself. My understanding of the
> > process is that the technical reference manual or any other publicly
> > available information about the micro-architecure  is used to provide
> > initial values for the model. Then it is a matter of refinement
> > against as many benchmarks as you can run.
> >
> > I think if empirically the Cortex-R52 model is producing better
> > results than the Cortex-A8 then it could be possible to adapt the
> > model for the Cortex-R5 by removing the parts specific to V8-R and
> > tweaking parameters based on cycle times from the technical reference
> > manual (TRM). I'm sure we could find someone to review a patch if
> > there is good enough set of benchmarks showing that a model is better
> than the Cortex-A8.
> >
> > The technical reference manual for the Cortex-R5:
> > https://developer.arm.com/documentation/ddi0460/c/
> >
> > Peter
> >
> > ________________________________________
> > From: Phipps, Alan <a-phipps at ti.com>
> > Sent: 23 September 2020 17:24
> > To: Peter Smith; llvm-dev at lists.llvm.org
> > Subject: RE: Incorrect Cortex-R4/R4F/R5 ProcessorModel in ARM.td
> >
> > Thanks, Peter, for your response.  Right -- certainly not incorrect in
> > the sense of generating an incorrect schedule, but definitely seems
> suboptimal.
> >
> > I've also noticed that if I experimentally base the v7-r model on the
> > Cortex-
> > R52 ProcessModel (or even build for Cortex-R52), I achieve a better
> > schedule than if it were based on cortex-a8, and I see 2%-3%
> > performance improvement on benchmarks like Coremark running on
> cortex-r5 hardware.
> > Do you know why that might be the case?  Can you suggest other, more
> > straightforward ways one might improve performance scheduling for
> > cortex-
> > r5 if there aren't any plans to develop a custom model for v7-r?
> >
> > Thanks for your help,
> >
> > -Alan
> >
> > -----Original Message-----
> > From: Peter Smith [mailto:Peter.Smith at arm.com]
> > Sent: Wednesday, September 23, 2020 11:06 AM
> > To: llvm-dev at lists.llvm.org; Phipps, Alan
> > Subject: [EXTERNAL] Re: Incorrect Cortex-R4/R4F/R5 ProcessorModel in
> > ARM.td
> >
> > Hello Alan,
> >
> > Using a cortex-a8 scheduling model for v7-r CPUs may not be optimal
> > but I wouldn't go as far as to call it incorrect. The cortex-r4,
> > cortex-r4f and cortex-
> > r5 are in-order cores like cortex-a8 (another in-order core) is the
> > closest match. We don't have any current plans to develop a custom
> > scheduling model for r4, r4f or r5.
> >
> > Peter
> >
> > ________________________________________
> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Phipps,
> > Alan via llvm-dev <llvm-dev at lists.llvm.org>
> > Sent: 23 September 2020 15:27
> > To: llvm-dev at lists.llvm.org
> > Subject: [llvm-dev] Incorrect Cortex-R4/R4F/R5 ProcessorModel in
> > ARM.td
> >
> > In ARM.td, I see that the ProcessorModel for cortex-r4, cortex-r4f,
> > and
> > cortex-r5 (as well as r7 and r8) is based on "CortexA8Model", which
> > seems incorrect.  When this was added in 2015, there were also
> > comments associated with this configuration, such as "// FIXME: R5 has
> > currently the same ProcessorModel as A8" (later removed).  The
> > processor model for
> > Cortex-r52 appears to be correct and corresponds to an associated
> > "CortexR52Model".
> >
> > Does anyone know why r4/r4f/r5 were setup based on "CortexA8Model".
> >
> > Is there a plan to upstream a fix to correct this?
> >
> > Thanks!
> >
> > Alan Phipps