[llvm-commits] [PATCH] Make dependency DAG construction to use AA

Wed May 9 10:37:40 PDT 2012

This is a great example of community at work. With the rest of the thread, I
see that the most concerns are expressed about AA invocation, and not about
actual dependency filtering (besides the worst case scenario complexity
concern).

Below is my quick reply to Andy's original review.

Sergei

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.

> -----Original Message-----
> From: Andrew Trick [mailto:atrick at apple.com]
> Sent: Tuesday, May 08, 2012 5:05 PM
> To: Sergei Larin
> Cc: 'Hal Finkel'; 'Evan Cheng'; llvm-commits at cs.uiuc.edu
> Subject: Re: [llvm-commits] [PATCH] Make dependency DAG construction to
> use AA
> 
> Hi Sergei, I finally put together a real review for this... plenty of
> comments follow, but overall it looks pretty good.
> 
> On Apr 30, 2012, at 11:34 AM, Sergei Larin <slarin at codeaurora.org>
> wrote:
> >   Presented is a non-heroic optimization intended to introduce alias
> > analysis usage into dependency graph construction used for MI
> > scheduling
> > (pre- post- RA, bundling etc.).
> > Currently any memory objects are conservatively considered dependent
> > on each other (aliasing) and as such are serialized via chain
> dependencies.
> >  Attached patch is an attempt to introduce alias analysis into the
> > process to eliminate chain dependencies known to be false.
> 
> +static inline bool isGlobalMemoryObject(AliasAnalysis *AA,
> MachineInstr
> +*MI) {
> +  if (MI->isCall() || MI->hasUnmodeledSideEffects() ||
> +      (MI->hasVolatileMemoryRef() &&
> +       (!MI->mayLoad() || !MI->isInvariantLoad(AA))))
> 
> I don't understand why !MI->mayLoad() makes a volatile reference safe
> to move, when MI->mayStore() could still be true. But you're correct to
> leave the existing logic in tact. Maybe someone else cares to comment.

[Larin, Sergei] Indeed, this is the original logic I have not touched, and
it is currently used to guard insertion of ch edges. I separated it to a
function to avoid code duplication. Now for the logic itself - the way I
interpret it, this is an early check whether a memory object should not be
reasoned about. At the same time, this is _not_ the only check, and (I am
guessing) it allows some volatile object to go to the next level in hope
some could be proven independent. I wonder if the original intent had
anything to do with this passage
(http://llvm.org/docs/LangRef.html#volatile):

"... The optimizers may change the order of volatile operations relative to
non-volatile operations. This is not Java's "volatile" and has no
cross-thread synchronization behavior."

  The next check for volatility is at the getUnderlyingObjectForInstr():

  if (!MI->hasOneMemOperand() ||
      !(*MI->memoperands_begin())->getValue() ||
      (*MI->memoperands_begin())->isVolatile())
    return 0;

  This will actually cause conservative treatment to _all_ volatile objects,
but not before some of them were allowed to be reasoned about elsewhere.

As action item - I intent to keep this logic intact for now unless someone
has strong opinion about it.

> 
> +  if (!getUnderlyingObjectForInstr(MIa, MFI, MayAlias) ||
> +      !getUnderlyingObjectForInstr(MIb, MFI, MayAlias))
> +    return true;
> 
> I'm not sure why you're doing this. What are the actual preconditions
> that we need before calling AliasAnalysis?

[Larin, Sergei] I was trying to be as unadventurous as possible here, and
yet again err on the side of conservatism. Previously here was a complex set
of checks for volatility, presence of just one memory objects (I convinced
myself to start with the simplest case of hasOneMemOperand() to facilitate
review and comprehension of the patch, and extend if needed later. Also my
back end never produces multiple mem operands, so I do not yet feel
comfortable to proceed without testing. I'll put a FIXME here). There were
also more checks for analyzability of the memory operand.
Anyway, once I compared all these checks to the functionality of
getUnderlyingObjectForInstr(), it was 99% match, so I reduced code
duplication by reusing existing logic.

If you feel like this is unjustified here, I can reinstantiate original
checks.

> 
> +  if (MIa->mayLoad() && MIb->mayLoad())
> 
> As Hal mentioned, wouldn't it be safer to check !mayStore.

[Larin, Sergei] Totally agree. Will change in the next patch.

> 
> +  MachineMemOperand *MMOa = *MIa->memoperands_begin();
> 
> Please either check MI->hasOneMemOperand() or better yet, loop through
> all memory operands. I know getUnderlyingObjectForInstr currently
> covers this but its bad to have such implicit assumptions.

[Larin, Sergei] As I eluded above, I already have a check for
hasOneMemOperand() via getUnderlyingObjectForInstr(), and will extend this
in the (near?) future. I just need to get a test case for it somehow. I will
also add comments to explicitly state this assumption.

> 
> +  int64_t MinOffset = std::min(MMOa->getOffset(), MMOb->getOffset());
> + int64_t Overlapa = MMOa->getSize() + MMOa->getOffset() - MinOffset;
> + int64_t Overlapb = MMOb->getSize() + MMOb->getOffset() - MinOffset;
> 
> It took me a while to figure out what you're doing and why. We can hide
> this behind an API eventually (I can do it post checkin), but for now
> can you add comments with any cases you could think of? Please also
> comment that it's taken from DAGCombiner. Apparently DAGCombiner thinks
> it's a good idea to normalize offsets relative to each other by
> subtracting MinOffset. I don't understand why this is a good idea and
> can't prove to myself that it is safe. For example, what if we have an
> IR-level offset from the same base pointer smaller than the
> MachineOperand offset? We would seem to miss the overlap. Maybe Dan can
> comment on this.

[Larin, Sergei] This received the most comments so far, but interestingly
enough, it looks like it is still rather acceptable solution, so besides
commenting extensively on that, I am going to keep the core logic the same.
FIXME is due for API changes and for preliminary (local) checks.

> 
> Hal also mentioned negative offsets, but it looks to me like that would
> work assuming positive offsets work!
> 

[Larin, Sergei] Again, according to Dan it should be totally fine... 

> Before calling IR-level AliasAnalysis, we have an opportunity to detect
> that memory access at different MachineMemOperand offsets with non-
> overlapping offset+size do not alias. DAGCombiner::isAlias does this.
> If you don't want to implement that, please add a FIXME.

[Larin, Sergei] Agree. Will add FIXME and will look at this logic more in
the near future. I have strong suspicion it will only save some compile
time, but unlikely to catch any "missing" cases, but let me try it first.

> 
> +  // Remember visited nodes.
> +  for (SmallPtrSet<const SUnit*, 16>::const_iterator I =
> Visited.begin(),
> +       E = Visited.end(); I != E; ++I)
> +    if (SUb == *I)
> +      return *Depth;
> +  Visited.insert(SUb);
> 
> Please remove the for loop. Just check the return value of
> Visisted.insert().
> 
> + /// This function assumes that "downward" from SUb there exist
> 
> Stale comment, as Hal said.

[Larin, Sergei] Done.

> 
> +          addChainDependency(AA, MFI, SU, I->second, RejectMemNodes);
> 
> There are a few cases where we used to create a chain dependency with
> STORE_LOAD_LATENCY, where now you create a zero latency edge.
> 
> A couple style/comments issues that I could easily fix myself post-
> checkin...
> 
> Why does the unit test need to be disabled?

[Larin, Sergei] When I submitted the patch the most of Hexagon code was
reverted, but I think now it is safe to reenable. I'll turn it on.

> 
> +    cl::desc("Enable use of AA during MI GAD construction"));
>                                           ^^^
> 
> +  AliasAnalysis::AliasResult AAResult = AA->alias(
> + AliasAnalysis::Location(MMOa->getValue(), Overlapa,
> +                          MMOa->getTBAAInfo()),
> + AliasAnalysis::Location(MMOb->getValue(), Overlapb,
> +                          MMOb->getTBAAInfo()));
> 
> Needs a bit of indentation.
[Larin, Sergei] Sure.
> 
> > -	Theoretically, the whole task of chain edge detection could
> become
> > quadratic to the number of memory objects in a basic block with this
> > patch, but in practice it never does. It is further kept in check by
> > max depth search budget that prevents even theoretical possibility of
> > any corner cases.
> 
> As we've already discussed, I'm concerned about the worst case cost of
> this approach. But to be fair, building the memory dependence DAG was
> already quadratic, so this isn't terrible. Since you haven't found any
> issues in your test suite, I think it's fine to go in under a flag.
> When we decide to make it default, we'll have to consider whether we
> can really handle the worst situations (aggressively unrolled loops)
> without blowing up. An alternative approach would be to precompute
> alias sets and have the scheduler maintain a chain for each set, but
> that's conceptually more complicated.

[Larin, Sergei] Totally agree, and it should be done. What is the best way
to ensure that it will not be forgotten? 

> 
> >  During preliminary review it has been suggested that worklist could
> > perform better, but I was trying to do all processing in place and
> > with minimal reiteration, so I will leave this point up to collective
> > reviewer opinion.
> 
> +  // Iterate over chain dependencies only.
> +  for (SUnit::const_succ_iterator I = SUb->Succs.begin(), E = SUb-
> >Succs.end();
> +       I != E; ++I)
> +    if (I->isCtrl())
> +      iterateChainSucc (AA, MFI, SUa, I->getSUnit(), ExitSU, Depth,
> + Visited);
> 
> We're currently trying to remove recursive DAG walking code to ensure
> that LLVM does not trigger stack overflow. You've added the Depth
> check, which should be sufficient. But for the record, a pre-order DAG
> walk like this is trivial to implement with an iterative worklist.

[Larin, Sergei] I thought a lot about it, and it still escapes me on how to
actually implement it. I am sure it is entirely my failing, but the way I
see this problem might be different. The issue is that the topology walk is
_different_ every time you do it, depending on what you are currently
comparing with, so precalculating a set becomes a superset construction with
a limited degree of reuse... Recursive descend seems to be "in place" and
optimal without any revisits... and worklist itself does not (again in my
limited understanding) solves worst case scenario - you can limit number of
nodes to check on this list, but what makes latency to explode is the _path_
through them, not there number, so a depth check will still be needed...
Once again, I am sure I miss something important here, and would love to
discuss it more.

> 
> >  The feature is currently under a general flag, off by default. It is
> > up to individual back-end maintainers whether it is desirable feature
> for them.
> > One side effect of this patch is a massive release of ILP in some
> > cases, which on occasion could clogs current schedulers, and can
> > increase reg pressure on some targets. My quick test of test-suite on
> > x86 showed fractional _decrease_ of average compile time (0.5%).
> There
> > is also a spectrum of performance differences. Different targets
> might
> > see different results.
> >
> >  Test cases for instruction scheduling are notoriously tricky, so I
> > took a somewhat direct shortcut. I am not sure how it would be
> > accepted by the community, but as I see it, this is the most reliable
> > way to verify that check actually takes place. The test is written
> for
> > the Hexagon back end, and since Hexagon is sort of in flux right now,
> > I disable it for the time being. Obviously regular check-all passes
> > (with and without optimization
> > enabled) on X86 and Hexagon.
> 
> It would be good to have at least one unit test that is enabled for
> most developers running make check on their builds. How about a simple
> example compiled for X86? You just need a few memory access that cannot
> be reordered, and CHECK that they aren't reordered. It's not a great
> test, but better than nothing.

[Larin, Sergei] I will think about it. Originally I did not feel authorized
to suggest tests outside my back end domain. But I think it would be easier
for me to suggest an ARM based test - but let me try first.

> 
> -Andy

[Larin, Sergei] Thank you all for the review. It will take me a little time
to revise and test the new patch, so please stay tuned.

> 
> >  As always, comments are very welcome, and thank you for your time
> and
> > effort.
> >
> > Sergei Larin
> >
> > --
> > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum.
> >
> > <patch_aa_in_dag.patch>