[llvm-commits] [PATCH] Make dependency DAG construction to use AA

Tue May 8 15:04:56 PDT 2012

Hi Sergei, I finally put together a real review for this... plenty of comments follow, but overall it looks pretty good.

On Apr 30, 2012, at 11:34 AM, Sergei Larin <slarin at codeaurora.org> wrote:
>   Presented is a non-heroic optimization intended to introduce alias
> analysis usage into dependency graph construction used for MI scheduling
> (pre- post- RA, bundling etc.).
> Currently any memory objects are conservatively considered dependent on each
> other (aliasing) and as such are serialized via chain dependencies.
>  Attached patch is an attempt to introduce alias analysis into the process
> to eliminate chain dependencies known to be false. 

+static inline bool isGlobalMemoryObject(AliasAnalysis *AA, MachineInstr *MI) {
+  if (MI->isCall() || MI->hasUnmodeledSideEffects() ||
+      (MI->hasVolatileMemoryRef() &&
+       (!MI->mayLoad() || !MI->isInvariantLoad(AA))))

I don't understand why !MI->mayLoad() makes a volatile reference safe to move, when MI->mayStore() could still be true. But you're correct to leave the existing logic in tact. Maybe someone else cares to comment.

+  if (!getUnderlyingObjectForInstr(MIa, MFI, MayAlias) ||
+      !getUnderlyingObjectForInstr(MIb, MFI, MayAlias))
+    return true;

I'm not sure why you're doing this. What are the actual preconditions that we need before calling AliasAnalysis?

+  if (MIa->mayLoad() && MIb->mayLoad())

As Hal mentioned, wouldn't it be safer to check !mayStore.

+  MachineMemOperand *MMOa = *MIa->memoperands_begin();

Please either check MI->hasOneMemOperand() or better yet, loop through all memory operands. I know getUnderlyingObjectForInstr currently covers this but its bad to have such implicit assumptions.

+  int64_t MinOffset = std::min(MMOa->getOffset(), MMOb->getOffset());
+  int64_t Overlapa = MMOa->getSize() + MMOa->getOffset() - MinOffset;
+  int64_t Overlapb = MMOb->getSize() + MMOb->getOffset() - MinOffset;

It took me a while to figure out what you're doing and why. We can hide this behind an API eventually (I can do it post checkin), but for now can you add comments with any cases you could think of? Please also comment that it's taken from DAGCombiner. Apparently DAGCombiner thinks it's a good idea to normalize offsets relative to each other by subtracting MinOffset. I don't understand why this is a good idea and can't prove to myself that it is safe. For example, what if we have an IR-level offset from the same base pointer smaller than the MachineOperand offset? We would seem to miss the overlap. Maybe Dan can comment on this.

Hal also mentioned negative offsets, but it looks to me like that would work assuming positive offsets work!

Before calling IR-level AliasAnalysis, we have an opportunity to detect that memory access at different MachineMemOperand offsets with non-overlapping offset+size do not alias. DAGCombiner::isAlias does this. If you don't want to implement that, please add a FIXME.

+  // Remember visited nodes.
+  for (SmallPtrSet<const SUnit*, 16>::const_iterator I = Visited.begin(),
+       E = Visited.end(); I != E; ++I)
+    if (SUb == *I)
+      return *Depth;
+  Visited.insert(SUb);

Please remove the for loop. Just check the return value of Visisted.insert().

+ /// This function assumes that "downward" from SUb there exist

Stale comment, as Hal said.

+          addChainDependency(AA, MFI, SU, I->second, RejectMemNodes);

There are a few cases where we used to create a chain dependency with STORE_LOAD_LATENCY, where now you create a zero latency edge.

A couple style/comments issues that I could easily fix myself post-checkin...

Why does the unit test need to be disabled?

+    cl::desc("Enable use of AA during MI GAD construction"));
                                          ^^^

+  AliasAnalysis::AliasResult AAResult = AA->alias(
+  AliasAnalysis::Location(MMOa->getValue(), Overlapa,
+                          MMOa->getTBAAInfo()),
+  AliasAnalysis::Location(MMOb->getValue(), Overlapb,
+                          MMOb->getTBAAInfo()));

Needs a bit of indentation.

> -	Theoretically, the whole task of chain edge detection could become
> quadratic to the number of memory objects in a basic block with this patch,
> but in practice it never does. It is further kept in check by max depth
> search budget that prevents even theoretical possibility of any corner
> cases.

As we've already discussed, I'm concerned about the worst case cost of this approach. But to be fair, building the memory dependence DAG was already quadratic, so this isn't terrible. Since you haven't found any issues in your test suite, I think it's fine to go in under a flag. When we decide to make it default, we'll have to consider whether we can really handle the worst situations (aggressively unrolled loops) without blowing up. An alternative approach would be to precompute alias sets and have the scheduler maintain a chain for each set, but that's conceptually more complicated.

>  During preliminary review it has been suggested that worklist could
> perform better, but I was trying to do all processing in place and with
> minimal reiteration, so I will leave this point up to collective reviewer
> opinion.

+  // Iterate over chain dependencies only.
+  for (SUnit::const_succ_iterator I = SUb->Succs.begin(), E = SUb->Succs.end();
+       I != E; ++I)
+    if (I->isCtrl())
+      iterateChainSucc (AA, MFI, SUa, I->getSUnit(), ExitSU, Depth, Visited);

We're currently trying to remove recursive DAG walking code to ensure that LLVM does not trigger stack overflow. You've added the Depth check, which should be sufficient. But for the record, a pre-order DAG walk like this is trivial to implement with an iterative worklist.

>  The feature is currently under a general flag, off by default. It is up to
> individual back-end maintainers whether it is desirable feature for them.
> One side effect of this patch is a massive release of ILP in some cases,
> which on occasion could clogs current schedulers, and can increase reg
> pressure on some targets. My quick test of test-suite on x86 showed
> fractional _decrease_ of average compile time (0.5%). There is also a
> spectrum of performance differences. Different targets might see different
> results.
> 
>  Test cases for instruction scheduling are notoriously tricky, so I took a
> somewhat direct shortcut. I am not sure how it would be accepted by the
> community, but as I see it, this is the most reliable way to verify that
> check actually takes place. The test is written for the Hexagon back end,
> and since Hexagon is sort of in flux right now, I disable it for the time
> being. Obviously regular check-all passes (with and without optimization
> enabled) on X86 and Hexagon.

It would be good to have at least one unit test that is enabled for most developers running make check on their builds. How about a simple example compiled for X86? You just need a few memory access that cannot be reordered, and CHECK that they aren't reordered. It's not a great test, but better than nothing.

-Andy

>  As always, comments are very welcome, and thank you for your time and
> effort.
> 
> Sergei Larin
> 
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.
> 
> <patch_aa_in_dag.patch>