[llvm-commits] cxa_guard elimination

Sun Feb 12 02:44:04 PST 2012

On Feb 12, 2012, at 1:38 AM, Nick Lewycky wrote:
> Ok.  Some thoughts:
>> 
>> 
>>      virtual void getAnalysisUsage(AnalysisUsage&AU) const {
>> +      AU.addRequired<DominatorTree>();
>> 
>> Is this *really really* required?  This will be the first module pass that requires a function pass in mainline IIRC.
> 
> We really just need to show that the load/store to the global can't precede the top of the guard. Replacing it with a unique-pred walk will work for cases where it isn't in or preceded by a loop. That's nice, but it isn't good enough.
> 
> What I don't need out of DomTree is building out the tree for the entirety of the function, particularly the parts dominated by the guard-top. I'd be happy with a subset. SSAUpdater does this nifty trick where it creates a subset of the CFG then builds a DomTree on that alone. I think that idea would work here, but that's too complicated for me to implement. Refactor-o?

I'm going to continue to worry about this for a bit.  To avoid blocking forward progress, could you go with "uniq dominator chain" for now, and we can talk about switching to dominators as a follow-on patch?

When the time comes, I'd be curious to get numbers about how often it matters. Given that the patterns we're looking for are simple and this is happening before inlining, my guess is that there is a negligible win here.

> 
>> Does the passmanager only compute DominatorTree for functions lazily when asking for it?
> 
> Yep! This actually works!

Ok, that's good at least :)

>> Are you sure that GlobalOpt isn't changing CFGs in a way that would invalidate DomTree?
> 
> I've done more than a cursory glance to see that it doesn't play with TerminatorInsts, but I haven't taken the effort to seriously prove it to myself.
> 
> I was assuming that calling getAnalysis<DominatorTree>(F) twice would actually run the domtree analysis on F twice, especially since we don't claim to preserve DominatorTree. I haven't checked though.

I'm pretty sure that is not the case.  getAnalysis<> is supposed to compute and then cache it.  The passmgr assumes that if passA uses passB that passA preserves passB across its own execution or across all queries it will do to it.

>> It's not your bug, but that loop really really should use 'continue' to make the control flow more obvious.
> 
> I'm not sure about this. Are you asking for the tests to be flipped around and call continue in the case everything is okay, then have an unconditional "return true" as the fall-through? I'm not sure that would be better?

I'm saying the control flow should be something like this:

  for () {

     if (something ok) {
        ...
        continue;
     }

     if (something else ok) {
       ...
        continue;
     }

     // can't handle this.
     return false;
}

> 
>> +static Instruction *UseToInst(Value *V) {
>> 
>> Perhaps this is prevented by how it is called, but you can have an infinite constant loop through a global variable and its initializer.
> 
> static char *x = (char*)&x; ?!  :-(

Yeah, GlobalValue isa<Constant> after all.

> Good catch, I haven't fixed this yet locally but I'll get to it. I think I want to do away with UseToInst entirely.

That would be nice :)

> Digression. I'd recently noticed that we have a number of different little IR interpreters all over, and started making a list. There's globalopt, and there interpreter executionengine, then there's another in jump-threading and SCEV has one (getSCEV) or two (EvaluateExpression) depending on how you count. I suspect there's more, but I'm not actively looking for them, just writing them down as I see them.

I would rather just remove the EE interpreter than worry about it.  It is completely broken in many different ways.

> Would it help to refactor them? We could do so, where the user provides an InstVisitor implementation and we provide the ability to ask for the result of a given instruction (lazily computing the things it depends on) or forwards computation, or a mix of both. I think it would help. I just didn't want to do all this before committing cxa_guard elimination.

I'm not sure that this is a good idea.  To me this sounds like a constant folding problem :)

>> The EvaluateBlock/EvaluateFunction interfaces really need to be split out into methods on a helper class that maintain the state, instead of passing the state through a dozen arguments into each recursive call.  Can you split this out into its own patch?
>> 
>> 
>> There are various blocks of logic in OptimizeGuardedInitialization that really need some block comments explaining what is going on :)
>> 
>> 
>>       InstResult = ConstantExpr::get(BO->getOpcode(),
>> -                                     getVal(Values, BO->getOperand(0)),
>> -                                     getVal(Values, BO->getOperand(1)));
>> +      Constant *Op0 = getVal(Values, BO->getOperand(0));
>> +      Constant *Op1 = getVal(Values, BO->getOperand(1));
>> +      if (!Op0 || !Op1) return false;
>> 
>> It's hard for me to understand what is going on here.  Why did getVal suddenly start being able to fail?  Can you split this out into its own patch?
> 
> This is the ugly part; the constant evaluator for guard elimination shouldn't be the same as the constant evaluator for running global constructors.
> 
> When running the global constructors, you start at the beginning of every function, and you run every instruction. You know all function arguments, because the outer-most function call takes void, and everything else is either read from a global or it's an Instruction that dominates it. Furthermore, because this is before-main, you're allowed to assume that any initializer in a global is the value it has at this point,

> even if the global isn't constant.

I'm not sure that this last part is true.  Do we care about other ctors known to run earlier with init_priority?  What about static ctors in .dll files that are loaded post-main (dlopen)?

> The way that the cxa_guard elimination works is by finding the first block that only runs on the first execution of the guard, and then simulating from there. Because we haven't simulated the blocks leading up to that point, none of the local variables are present. Furthermore, we can't really rely on ComputeLoadResult because it will look inside non-Constant global variable and trust that the initializer is correct, but we're simulating code past-main so that isn't true any more. The attached patch handles this by making ComputeLoadResult record what GVs it looked at when computing its result, then verifying that there were all constant or set inside the once-initialized code first.

> An alternative way to do cxa_guard elimination using the evaluator in globalopt as it exists today would be to start the function and make decisions that would lead you towards the cxa_guard_acquire. This would let us support cases like Bullet returning an identity matrix by alloca'ing the elements, then in the once storing to them and running the copy-constructor. Currently that fails because we haven't executed the allocas. How you actually pick the right values is tricky (especially when both paths in the CFG work), but still possible.

This is definitely interesting, but it is prone to its own sorts of issues.  The code leading up to the static constructor might be completely unrelated to the constructor and may cause evaluation to fail, .e.g:

static int foo(int y) { return y+4 }

void bar() {
  int a = 4
  printf();  // can't be evaluated
  static int x = foo(a);  // should be evaluatable!
  ...
}

This is starting to vaguely remind me of "code slicing" problems.  Would it be possible to split this into two steps: 1) identify the IR in the outermost function that needs to be simulated, and 2) simulating it?  Right now #1 is implicitly the linear sequence of code between the guards.  However, to handle Bullet case (and the contrived case above), you could start with these instructions, and build up a set of other instructions in the function that are referenced by that initial seed set.  In the case of bar above, that would tag the alloca, and then the store to the alloca of 4.

Once all of the instructions are identified, you have sort of a "masked evaluate" operation that would take a SmallPtrSet of Instructions and start simulating from the top of the function.  One interesting/tricky aspect here is getting the mod/ref sets right for function calls.  Here we have to know that printf can't mod/ref 'a', so the first pass doesn't need to add it to the simulation set.

> Anywho... yes I can split this into its own patch, assuming that's still what you want me to do instead of not abusing globalopt like this. :)

The first right step in this direction (regardless of what approach you take above) is to refactor the existing EvaluateFunction/Block stuff into a class that contains the state.  That will make it a lot easier to reason about, and then these guys won't have to take 57 arguments :)

>> Overall, I'm really glad you've tackled this.  It would help a lot if you could do some of the other patches to make it more clear and straight-forward.  It's hard to understand the actual transformation as is :)
> 
> Thanks! One of the other cleanups is that I'd like is to make this code use ConstantFoldLoadFromConstPtr somehow. Reusing the constant folder is nice because it knows lots of tricks to loading constants that ComputeLoadResult doesn't, but the gotcha is that bails on globals that aren't marked constant. GlobalOpt wants to use the definitive initializers, even if they aren't constant.

I haven't looked at this code for some time, but isn't the just like constantfolding an extractvalue from the global's initializer?

> I could go through the ConstantExpr I'm given and temporarily mark all globals constant, call into constant folding, then mark them non-const afterwards? Eww?

Definite eww.  :)

-Chris