[llvm-commits] [PATCH] X86: Turn cmovs into branches when profitable.

Sat May 5 05:55:15 PDT 2012

On 04.05.2012, at 01:35, Evan Cheng wrote:

> 
> On Apr 29, 2012, at 10:00 AM, Benjamin Kramer wrote:
> 
>> 
>> On 27.04.2012, at 07:48, Evan Cheng wrote:
>> 
>>> 
>>> 
>>> On Apr 26, 2012, at 12:30 PM, Benjamin Kramer <benny.kra at googlemail.com> wrote:
>>> 
>>>> 
>>>> On 26.04.2012, at 08:04, Evan Cheng wrote:
>>>> 
>>>>> Hi Benjamin,
>>>>> 
>>>>> You are right. LLVM likes to canonicalize to select instructions and that can really hurt us on some modern cpu's. I've wanted a way to undo llvm selects for a while. 
>>>>> 
>>>>> That said. I'm not sure this is the right approach. I think it's better to turn selects back into control flows at llvm ir level. I'm envision something that's done around codegen prep time. IMHO, there are a few potential benefits. 
>>>>> 
>>>>> 1. It will be target independent. I don't want to duplicate this kind of patterns for different targets. 
>>>>> 2. It will be possible to use better and more sophisticated heuristics (look at how the MI level if-converter compute the profitability). It should be able to take advantage of profile info if it's available. 
>>>>> 
>>>>> I also think the isel approach feels wrong. Isel really shouldn't use these complex predicates to drive isel decisions. It hurts compile time and it just generally goes against the design. Expanding pseudo instructions into control flows is also kinda yucky. We should only use it when there isn't a better design. 
>>>>> 
>>>>> Can I interest you into writing a llvm it de-select pass? :) We'd be more than happy in helping you with performance benchmarking and analysis. 
>>>> 
>>>> My first idea was to do it during selection DAG formation but that was even uglier, it makes sense to do it at IR level though.
>>>> 
>>>> Attached is a basic pass that is run by codegen the same way CodeGenPrepare runs. It uses the (dumb) cmp-with-load heuristic I came up with for the x86 backend patch. I didn't run the test-suite this time (no non-noisy builder at my hands) but the improvement on richards_benchmark is still measurable. Testing the patch is easy, just pass -enable-select2branch to llc.
>>> 
>>> Thanks. We should run some tests. 
>>> 
>>>> 
>>>> Some questions remain open. Should it be merged into CodeGenPrepare? We need some kind of target hook to avoid doing the optimization on CPUs that are unlikely to benefit, like Atom and pre-A9 ARM cores. Of course benchmarking is key, it's just hard to dig up the hardware ;)
>>> 
>>> I think it should be merged into codegenprepare. Given the primitive heuristics, we should add a target hook so it's only run when it's explicitly opted in.
>> 
>> Attached is a version in CodeGenPrepare, the option is now "-enable-cgp-select2branch". There are some tests that fail because they assembly that they expect more cmovs. I've yet to figure out how to wire up a target hook into CodeGenPrepare.
>> <cmov-into-branch-2.patch>
> 
> 
> I tried the patch and saw some very small speedup on a SandyBridge Mac. I think this can go in but it requires a target hook (add to TargetLowering). It's probably best to leave it off for now, we should only consider turning it on when simplifycfg propagate probability info onto select's when it forms them.

Committed with a target hook that matches !Atom X86 and A9 ARM in r156234. It's currently disabled by default but the current dumb heuristic is probably conservative enough to just turn it on by default. I'll see what's needed to get simplifycfg to preserve probabilities when forming selects.

>>>> 
>>>> Coming up with good heuristics is hard. I'd love to use BranchProbabilityInfo here, but it doesn't understand selects. We also need some clever way to break up long select chains.
>>> 
>>> That seems like a BranchProbabilityInfo deficiency. Andy, what do you think? Benjamin, can you point us to examples of long select chains? Are they in real world benchmarks?
>> 
>> One of the examples that I found in the test-suite (though not in a hot path) was something like the following loop:
>> 
>> unsigned foo(unsigned *x) {
>> unsigned i, max = 0;
>> for (i = 0; i != 7; ++i) {
>>   if (x[i] > max) max = x[i];
>> }
>> return max;
>> }
>> 
>> We unroll it and form a 7-instruction long chain of cmovs which is obviously bad. I haven't found a better example yet but I'm sure there is.
> 
> ICC generates the same code though.

It's not a good example, but I've seen ICC generate really bad code on many occasions ;)

- Ben
> 
> Evan
> 
>> 
>> - Ben
>