[llvm-commits] cxa_guard elimination

Mon Feb 20 04:06:04 PST 2012

Chris Lattner wrote:
> On Feb 12, 2012, at 1:38 AM, Nick Lewycky wrote:
>> Ok.  Some thoughts:
>>>
>>>
>>>       virtual void getAnalysisUsage(AnalysisUsage&AU) const {
>>> +      AU.addRequired<DominatorTree>();
>>>
>>> Is this *really really* required?  This will be the first module pass that requires a function pass in mainline IIRC.
>>
>> We really just need to show that the load/store to the global can't precede the top of the guard. Replacing it with a unique-pred walk will work for cases where it isn't in or preceded by a loop. That's nice, but it isn't good enough.
>>
>> What I don't need out of DomTree is building out the tree for the entirety of the function, particularly the parts dominated by the guard-top. I'd be happy with a subset. SSAUpdater does this nifty trick where it creates a subset of the CFG then builds a DomTree on that alone. I think that idea would work here, but that's too complicated for me to implement. Refactor-o?
>
> I'm going to continue to worry about this for a bit.  To avoid blocking forward progress, could you go with "uniq dominator chain" for now, and we can talk about switching to dominators as a follow-on patch?

I don't think that works. I need to prove that the acquire happens 
before any use of the global. If I only look up unique preds of the 
acquire block, I would either see a use of the global (certainly 
unsafe!) or not. If not, was the use before or after the acquire block? 
I can't tell.

What I'd need is the set of all blocks reachable from entry without 
going through the acquire (or the inverse, which is what I'm using 
DomTree for now).

> When the time comes, I'd be curious to get numbers about how often it matters. Given that the patterns we're looking for are simple and this is happening before inlining, my guess is that there is a negligible win here.

Okay. If we want to avoid the DomTree requirement, the way to do it is 
to require that the only uses of the global be the ones we simulate. 
Obviously that's very limiting (we only work on dead variables, that are 
dead before any other optimizer has run...).

>>> Does the passmanager only compute DominatorTree for functions lazily when asking for it?
>>
>> Yep! This actually works!
>
> Ok, that's good at least :)
>
>>> Are you sure that GlobalOpt isn't changing CFGs in a way that would invalidate DomTree?
>>
>> I've done more than a cursory glance to see that it doesn't play with TerminatorInsts, but I haven't taken the effort to seriously prove it to myself.
>>
>> I was assuming that calling getAnalysis<DominatorTree>(F) twice would actually run the domtree analysis on F twice, especially since we don't claim to preserve DominatorTree. I haven't checked though.
>
> I'm pretty sure that is not the case.  getAnalysis<>  is supposed to compute and then cache it.  The passmgr assumes that if passA uses passB that passA preserves passB across its own execution or across all queries it will do to it.
>
>>> It's not your bug, but that loop really really should use 'continue' to make the control flow more obvious.
>>
>> I'm not sure about this. Are you asking for the tests to be flipped around and call continue in the case everything is okay, then have an unconditional "return true" as the fall-through? I'm not sure that would be better?
>
> I'm saying the control flow should be something like this:
>
>    for () {
>
>       if (something ok) {
>          ...
>          continue;
>       }
>
>       if (something else ok) {
>         ...
>          continue;
>       }
>
>       // can't handle this.
>       return false;
> }

Gotcha. That's on my todo list now.

>>> +static Instruction *UseToInst(Value *V) {
>>>
>>> Perhaps this is prevented by how it is called, but you can have an infinite constant loop through a global variable and its initializer.
>>
>> static char *x = (char*)&x; ?!  :-(
>
> Yeah, GlobalValue isa<Constant>  after all.
>
>> Good catch, I haven't fixed this yet locally but I'll get to it. I think I want to do away with UseToInst entirely.
>
> That would be nice :)
>
>> Digression. I'd recently noticed that we have a number of different little IR interpreters all over, and started making a list. There's globalopt, and there interpreter executionengine, then there's another in jump-threading and SCEV has one (getSCEV) or two (EvaluateExpression) depending on how you count. I suspect there's more, but I'm not actively looking for them, just writing them down as I see them.
>
> I would rather just remove the EE interpreter than worry about it.  It is completely broken in many different ways.
>
>> Would it help to refactor them? We could do so, where the user provides an InstVisitor implementation and we provide the ability to ask for the result of a given instruction (lazily computing the things it depends on) or forwards computation, or a mix of both. I think it would help. I just didn't want to do all this before committing cxa_guard elimination.
>
> I'm not sure that this is a good idea.  To me this sounds like a constant folding problem :)
>
>>> The EvaluateBlock/EvaluateFunction interfaces really need to be split out into methods on a helper class that maintain the state, instead of passing the state through a dozen arguments into each recursive call.  Can you split this out into its own patch?
>>>
>>>
>>> There are various blocks of logic in OptimizeGuardedInitialization that really need some block comments explaining what is going on :)
>>>
>>>
>>>        InstResult = ConstantExpr::get(BO->getOpcode(),
>>> -                                     getVal(Values, BO->getOperand(0)),
>>> -                                     getVal(Values, BO->getOperand(1)));
>>> +      Constant *Op0 = getVal(Values, BO->getOperand(0));
>>> +      Constant *Op1 = getVal(Values, BO->getOperand(1));
>>> +      if (!Op0 || !Op1) return false;
>>>
>>> It's hard for me to understand what is going on here.  Why did getVal suddenly start being able to fail?  Can you split this out into its own patch?
>>
>> This is the ugly part; the constant evaluator for guard elimination shouldn't be the same as the constant evaluator for running global constructors.
>>
>> When running the global constructors, you start at the beginning of every function, and you run every instruction. You know all function arguments, because the outer-most function call takes void, and everything else is either read from a global or it's an Instruction that dominates it. Furthermore, because this is before-main, you're allowed to assume that any initializer in a global is the value it has at this point,
>
>
>> even if the global isn't constant.
>
> I'm not sure that this last part is true.  Do we care about other ctors known to run earlier with init_priority?  What about static ctors in .dll files that are loaded post-main (dlopen)?

We don't care. We don't support ctor priority at all. Within a TU the 
constructors run in source order, but the order between TUs is undefined.

>> The way that the cxa_guard elimination works is by finding the first block that only runs on the first execution of the guard, and then simulating from there. Because we haven't simulated the blocks leading up to that point, none of the local variables are present. Furthermore, we can't really rely on ComputeLoadResult because it will look inside non-Constant global variable and trust that the initializer is correct, but we're simulating code past-main so that isn't true any more. The attached patch handles this by making ComputeLoadResult record what GVs it looked at when computing its result, then verifying that there were all constant or set inside the once-initialized code first.
>
>
>> An alternative way to do cxa_guard elimination using the evaluator in globalopt as it exists today would be to start the function and make decisions that would lead you towards the cxa_guard_acquire. This would let us support cases like Bullet returning an identity matrix by alloca'ing the elements, then in the once storing to them and running the copy-constructor. Currently that fails because we haven't executed the allocas. How you actually pick the right values is tricky (especially when both paths in the CFG work), but still possible.
>
> This is definitely interesting, but it is prone to its own sorts of issues.  The code leading up to the static constructor might be completely unrelated to the constructor and may cause evaluation to fail, .e.g:
>
> static int foo(int y) { return y+4 }
>
> void bar() {
>    int a = 4
>    printf();  // can't be evaluated
>    static int x = foo(a);  // should be evaluatable!
>    ...
> }
>
> This is starting to vaguely remind me of "code slicing" problems.  Would it be possible to split this into two steps: 1) identify the IR in the outermost function that needs to be simulated, and 2) simulating it?  Right now #1 is implicitly the linear sequence of code between the guards.

Yes, but this is exactly what we do with static constructors now. A 
printf will break the simulation. We just think it would be more rare in 
static constructors than in a function between the start and the guard 
acquire.

   However, to handle Bullet case (and the contrived case above), you 
could start with these instructions, and build up a set of other 
instructions in the function that are referenced by that initial seed 
set.  In the case of bar above, that would tag the alloca, and then the 
store to the alloca of 4.

Another way to say this is a lazy evaluation plus a scan to eagerly 
execute the relevant memory operations.

> Once all of the instructions are identified, you have sort of a "masked evaluate" operation that would take a SmallPtrSet of Instructions and start simulating from the top of the function.  One interesting/tricky aspect here is getting the mod/ref sets right for function calls.  Here we have to know that printf can't mod/ref 'a', so the first pass doesn't need to add it to the simulation set.
>
>
>> Anywho... yes I can split this into its own patch, assuming that's still what you want me to do instead of not abusing globalopt like this. :)
>
> The first right step in this direction (regardless of what approach you take above) is to refactor the existing EvaluateFunction/Block stuff into a class that contains the state.  That will make it a lot easier to reason about, and then these guys won't have to take 57 arguments :)

Done! :)

>>> Overall, I'm really glad you've tackled this.  It would help a lot if you could do some of the other patches to make it more clear and straight-forward.  It's hard to understand the actual transformation as is :)
>>
>> Thanks! One of the other cleanups is that I'd like is to make this code use ConstantFoldLoadFromConstPtr somehow. Reusing the constant folder is nice because it knows lots of tricks to loading constants that ComputeLoadResult doesn't, but the gotcha is that bails on globals that aren't marked constant. GlobalOpt wants to use the definitive initializers, even if they aren't constant.
>
> I haven't looked at this code for some time, but isn't the just like constantfolding an extractvalue from the global's initializer?

Yes, they both look at the value of the initializer, but GlobalOpt only 
checks hasDefiniteInitializer() while constant folding checks 
hasDefiniteInitializer() && isConstant().

For each TU, the FE emits a single function which calls all the dynamic 
initializers in order, and then add that function to llvm.global.ctors. 
GlobalOpt then either simulates all of a TU's initializers, or none of 
them because the simulation failed. In C++ these is no ordering of the 
initializers between multiple TUs, so even if we skip the first function 
in llvm.global.ctors, we can go ahead and simulate the second one thanks 
to the rule that any order is okay.

It's safe for GlobalOpt to read initializers of non-constant globals 
because both the order of the TUs is arbitrary and because we bail on 
entire TUs instead of initializing some variables but not others, and 
because the value of a dynamically initialized global is defined to be 
zero before its initializer runs.

A TU like this:

   static int a = f();
   static int b;
   int f() { return b + 1 }

is required to set b to 1, and globalopt will get that right. If we only 
looked at constant globals, we wouldn't be able to optimize this example.

As for running part of a TU, we need to make sure that we don't leave 
things in an inconsistent state. If I understand correctly, something 
like this:

   extern void doit();
   int f() { doit(); return 1; }
   int x = f();

allows another translation to check whether doit() has been called by 
looking at the value of 'x'. We can't just statically initialize x = 1 
and leave the doit() to be called when the dynamic initializer really 
happens.

Nick