[PATCH] D109368: [LV] Don't vectorize if we can prove RT + vector cost >= scalar cost.

Wed Oct 27 13:19:08 PDT 2021

lebedev.ri added a comment.

In D109368#3089809 <https://reviews.llvm.org/D109368#3089809>, @dmgreen wrote:

>> Can you point me at the test where that happens?
>
> Hmm I don't know if there is a test. This should hopefully show it: https://godbolt.org/z/6Th4o1s5K
>
> If you print the costs for the runtime checks, you can see they are unsimplified, with the umul being the largest part of the cost:
>
>   Cost of 0 for RTCheck   %4 = trunc i64 %0 to i32                                              
>   Cost of 10 for RTCheck   %mul31 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 1, i32 %4)
>   Cost of 0 for RTCheck   %mul.result = extractvalue { i32, i1 } %mul31, 0                      
>   Cost of 0 for RTCheck   %mul.overflow = extractvalue { i32, i1 } %mul31, 1                    
>   Cost of 1 for RTCheck   %5 = add i32 %2, %mul.result                                          
>   Cost of 1 for RTCheck   %6 = sub i32 %2, %mul.result                                          
>   Cost of 1 for RTCheck   %7 = icmp ugt i32 %6, %2                                              
>   Cost of 1 for RTCheck   %8 = icmp ult i32 %5, %2                                              
>   Cost of 1 for RTCheck   %9 = select i1 false, i1 %7, i1 %8                                    
>   Cost of 1 for RTCheck   %10 = icmp ugt i64 %0, 4294967295                                     
>   Cost of 1 for RTCheck   %11 = or i1 %9, %10                                                   
>   Cost of 1 for RTCheck   %12 = or i1 %11, %mul.overflow                                        
>   Cost of 1 for RTCheck   %13 = or i1 false, %12                                                
>   LV: Minimum required TC for runtime checks to be profitable:28                                
>
> I'm not sure if they should be simplified by the builder during construction, simplified prior to costing or the code to create them needs to be more precise.

So good and bad news. While the `@llvm.umul.with.overflow` case
was straight-forward (done in 156f10c840a0 <https://reviews.llvm.org/rG156f10c840a07034a6bd638b5912054100365741>), there is still a significant number
of inefficiencies in the IR for these checks. I wasn't particularly looking forward to
arriving at the answer, but it is pretty obvious: if we really want to minimize
the estimated cost for these checks, we have to run instsimplify (or even instcombine)
on them first. The caveat here is that we first need to defuse `SCEVExpanderCleaner`,
because simplification will lead to dead instructions, and leaving them will again
lead to artificial cost. I feel like that is an improvement that is best done after
this change itself, even though i'm not quite sure yet how to approach it.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109368/new/

https://reviews.llvm.org/D109368