[PATCH] D117926: [SLP] Optionally preserve MemorySSA

Fri Jan 21 14:58:00 PST 2022

nikic added a comment.

In D117926#3262503 <https://reviews.llvm.org/D117926#3262503>, @reames wrote:

> In D117926#3262469 <https://reviews.llvm.org/D117926#3262469>, @nikic wrote:
>
>> Can you please explain what the larger context here is? What cases you trying to solve with MemorySSA?
>
> Sure, though I'm a bit limited in what I can say.  The original example is not public.
>
> Essentially, I have a case where we are spending a large fraction of total O2 <https://reviews.llvm.org/owners/package/2/> time inside SLP - specifically, inside the code which is figuring out which memory dependencies exist while trying to schedule.  (To prevent confusion, note that SLP scheduling subsumes several legality tests.)
>
> Specifically, the case which is hurting this example - which is machine generated code - is a very long basic block with a vectorizable pair of loads at the beginning, and a vectorizable pair of stores (consuming the loaded values) at the end.  There's multiple pairs, but the core detail is that the required scheduling window is basically the entire size of the huge basic block.

Is it possible to construct an artificial test case that can be shared? Just the loads/stores at the beginning and end and dummy instructions in between?

> The time is spent figuring out dependencies for *scalar* instructions - not even the ones we're trying to vectorize.  Since this is such a huge block, the current mssa-like memory chain ends up being very expensive.
>
> I'd explored options for limiting the scheduling window, but mssa felt like a more general answer, so I started there.

It sounds to me like a cutoff is what this mainly needs. We always run into degenerate cases when there is an unbounded instruction walk. (MSSA itself also limits instruction walks.)

Something you might want to try it use BatchAAResults. Assuming that all the alias checks happen without IR modifications in between, it would be safe to cache them.

>> I'm not sure it will be the right tool for the job, so I think we should discuss this before making any changes. We don't have MSSA available at SLP's pipeline position, and computing it just for SLP will make this pass much more expensive.
>
> I'm really surprised to hear you say that.  My understanding was that memory ssa was rather cheap to construct if you don't need an optimized form, and that optimization was done lazily.
>
> However, I see my memory of prior discussion on this topic is clearly wrong.  The constructor for memoryssa does appear to eagerly optimize.

Yes. We discussed adding a mode that does not eagerly optimize in the past, but didn't do so (yet) for lack of a use-case.

> Despite this, I don't see memoryssaanalysis showing up as expensive in the -time-passes-per-run output even with this change.  I see SLP itself slow down a lot, but I had put that down to the generic renaming instead of using specialized knowledge from the callsite.
>
> Edit: I confirmed the pass profiling result by nulling out MSSA immediately after the getResult call.  The runtime drops to basically nothing over the non-MSSA version (e.g. measurement noise).    So despite the optimization at construction, it really is the updates which are expensive in this case.  It's possibly my example is highly unrepresentative, but that seems questionable.  Any theories?

MemorySSA updates can be quite expensive -- some update operations are unexpectedly O(n). I believe the insertUse() and removeMemoryAccess() should be cheap, but insertDef() with RenameUses=true can be expensive, due to renaming. In some cases you can avoid the renaming, see for example D107702 <https://reviews.llvm.org/D107702>.

In D117926#3262537 <https://reviews.llvm.org/D117926#3262537>, @reames wrote:

> Ok, was able to spot the additional construction time.  It took about 15 ms.
>
> For context, the original example spends about 3,23 seconds in SLP w/o MSSA, and the (horribly unoptimized) preservation currently takes an additional 5.25 seconds on top of that.
>
> Quite literally, different orders of magnitudes.  If we can get SSA preservation down to something reasonable - again, incrementalism please - I'd argue using it here is entirely reasonable.

Sure, if you're looking at a degenerate case, MSSA construction will not be a dominating factor. What I have in mind here is the average case, where SLP will be practically free (and usually just not do anything), while MSSA construction still needs to happen.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D117926/new/

https://reviews.llvm.org/D117926