[PATCH] D103955: [MCA] Use LSU for the in-order pipeline

Wed Jul 7 04:25:29 PDT 2021

andreadb added inline comments.

================
Comment at: llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s:95-96
+
+# CHECK:      [0,0]     DeeeE.    ..   str	x1, [x10]
+# CHECK-NEXT: [0,1]     .DeeeE    ..   str	x1, [x10]
+# CHECK-NEXT: [0,2]     .DeeE.    ..   ldr	x2, [x10]
----------------
dmgreen wrote:
> andreadb wrote:
> > andreadb wrote:
> > > dmgreen wrote:
> > > > I think I would expect most CPU's to work like this, whether the addresses alias or not :)
> > > You mean the store sequence. Of course.
> > > 
> > > My concern was related to instructions that appear to commit out of order like the load and the nop after it.
> > > We have flag RetireOOO for cases where we want to allow it.
> > If instead you are concerned about whether this patch might end up delaying the second store, then don't worry. That's not how flag -noalias should work: it only affects interactions between loads and stores. It is about whether a younger load is allowed to pass an older store. It should not affect pairs of adjacent stores.
> Sorry, I was hoping to look into the schedule over the weekend to see what is going on, but didn't get the chance to look into the correct bit yet.
> 
> I believe there are 2 different optimizations that can happen here:
>  - Do two stores to the same address have some penalty.
>  - Do loads from the same address as a load have a penalty.
> The first sounds to me like it should almost always be no, and the second requires store->load forwarding which I believe is very common in most cpus of sufficient complexity.
> 
> It comes down to what does the latency of a store mean. I was under the impression that it didn't mean anything in normal llvm scheduling, but it appears that it does have some effect on the latency of an store to the end of the block (I think). In llvm-mca it means the latency of the write into L1 cache?
> The Cortex-A55 optimization guide specifies the latency of stores as 1, and that would probably be a better value to use in the A55 schedule model. I've put together a patch to do that in D105541.
Just to be clear: the noalias flag does NOT affect store pairs.

Regarding your point 2.
I guess you wanted to say: "Do loads from the same address as a STORE have a penalty.".

STLF assumes the presence of a store buffer, and that memory stores are not immediately propagated to the underlying caches. I don't know how common this is for in-order processors. However, I take from your comment that modern in-order may do a lot of out-of-order commit for store operations too. 

That being said, llvm-mca doesn't know if the simulated target implements a store buffer, nor it knows how to predict if a younger load would alias an older stores. Without that knowledge, it is not possible to correctly predict which are valid STLF candidates.

STLF also assumes the presence of a store buffer, and that values are not immediately committed in cache (which I honestly don't know how common it is for in-order processors).
When "noalias=true", we assume that there is no aliasing at all for loads and stores. There is no need to model STLF for this case, because - under that assumption - younger loads will never alias older stores.

When "noalias=false", we conservatively assume that younger loads may alias older stores. However, we don't know if they would partially overlap, or if operations are for misaligned addresses.
So we cannot always optimistically assume that STLF will eventually occur. STLF is subject to a number of constraints in hardware, and different subtargets might impose different restrictions.

In future, we could introduce code annotations/metadata to pass "hints" to llvm-mca. Something like: "assume no-alias"/ "assume perfect-alias" / "assume aligned"; etc. We could then extend the scheduling model in order to provide extra information about store buffers. That would allow us to model STLF.

For now, `noalias=false` is just a "worst-case scenario" where aliasing always occurs between loads and stores, and STLF is not simulated (so, it implicitly fails for "reasons" that we don't provide).

About the store latency:

In the presence of a store buffer, I'd expect the latency of a store to be 1.
It is literally the cost of placing the value in the store buffer (which I expect to be 1 for most targets).

Strictly speaking, llvm-mca doesn't specially handle latency of loads and stores.
llvm-mca literally ONLY uses whatever latency value is declared by each write.

In all upstream scheduling models, the latency of loads is often defined according to the "load-to-use latency"  defined by the vendor. But that's it. There is no special handling in llvm-mca. In future (at least for addressing modes that allow folded loads), it would be nice to distinguish the load contribution (i.e. load-to-use latency) from the total latency.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103955/new/

https://reviews.llvm.org/D103955