[llvm-commits] Enable early dup for small bb, take2

Wed Jun 15 08:39:58 PDT 2011

On Jun 13, 2011, at 11:16 AM, Rafael Avila de Espindola wrote:

> On 11-06-13 01:29 PM, Bob Wilson wrote:
>> It seems to me that this isn't a clear win.  It helps some cases but
>> hurts others.
> 
> The cases I have looked at, the hurt is by luck (different registers, same code) or because other passes do something silly when given a reduced IL. Take a look at the two logs:
> 
> http://people.mozilla.com/~respindola/patch.log.bz2
> http://people.mozilla.com/~respindola/trunk.log.bz2
> 
> It is actually funny :-)

I looked but I'm not sure what to look for.  There are huge differences but most of them are just block renumbering.

> 
>> As you've seen, updating PHIs for tail duplication is tricky.  I'd
>> really prefer to avoid that.  If we only run the taildup pass after
>> regalloc, we can remove all that complexity.  Something similar would
>> still be needed in the separate indirect branch duplication pass
>> (that I'm still working on), but at least we wouldn't have to do it
>> in taildup as well.
>> 
>> How important do you think it is to do this?  Am I misreading your
>> data?
>> 
> 
> I do think it is important. The way I read the data is that there is useful cleanup that duplicating small blocks can do. Some passes run afterwards can currently make bad decisions on the new input, but that is a problem that should be fixed on them.
> 
> Ideally, the blocks the early pass is duplicating are the same ones the late one would. So this is really just cleaning it up.

Well, ideally, if the early and late passes are duplicating the same code, then we should get the same results.  Now we know that isn't true for register allocation, at least with linear scan, but it is a nice goal.  I wonder if there are other things besides that going on.

> 
> One thing that was surprising even to me was the clang became a tiny bit faster. I guess because it is passing fewer blocks down the pipeline.
> 
> I started looking at this because my old patch (duplicating indirectbr in clang) shows that having more cleanup happening from the duplication to the register allocator can help firefox.
> 
> Note that the speed improvement in firefox was measured in a full js benchmark. I can run instruments on it if you are curious on what the impact was on the JS interpreter only.
> 
> As for correctness, I would argue that it is safer to have code that is executed (and therefor tested) more often. The issues I fixed were found by increasing the dup size limit to 8 and bootstrapping clang. The bugs were there and are real, it is just hard to trigger then with an indirectbr only pass (as early dup is right now). When someone does hit them, they would have been incredibly harder to debug.

So, again, my preference is to work toward eliminating all the tricky phi-updating code in taildup (assuming that we end up with a separate and more general version of that code in an indirect branch duplicating pass).  If we're not going to do that, maybe the best thing would be to generalize the phi-updating code into a separate "duplicate code region" utility that could be used for both tail dup and indirect branch dup.

It sounds like either way I need to get back to working on my indirect branch dup pass.....