[PATCH] D45748: [RISCV] Add peepholes for Global Address lowering patterns

Thu May 24 00:46:27 PDT 2018

sabuasal added a subscriber: efriedma.
sabuasal added a comment.

In https://reviews.llvm.org/D45748#1110644, @asb wrote:

> The point I was trying to make is that even with these peepholes, there are trivial cases that aren't caught, such as the example I shared:
>
>   @foo = global [6 x i16] [i16 1, i16 2, i16 3, i16 4, i16 5, i16 0], align 2                                     
>                                                                                                                  
>   define i32 @main() nounwind {                                                                                   
>   entry:                                                                                                          
>     %0 = load i16, i16* getelementptr inbounds ([6 x i16], [6 x i16]* @foo, i32 0, i32 4), align 2                
>     %cmp = icmp eq i16 %0, 140                                                                                    
>     br i1 %cmp, label %if.end, label %if.then                                                                     
>                                                                                                                  
>   if.then:                                                                                                        
>     tail call void @abort()                                                                                       
>     unreachable                                                                                                   
>                                                                                                                  
>   if.end:                                                                                                         
>     ret i32 0                                                                                                     
>   }                                                                                                               
>                                                                                                                  
>   declare void @abort()
>
>
> This generates 3 instructions for the global access when you'd obviously prefer to use two:
>
>   lui     a0, %hi(foo)
>   addi    a0, a0, %lo(foo)
>   lhu     a0, 8(a0)
>   
>
> I proposed one way of adding a peephole, but don't see a way of differentiating between the case above and more complex cases where the peephole is counter-productive (such as in the longer IR sample I shared). Possible the solution would be to have an IR pass perform common subexpression elimination on the getelementptr calculations - possibly such a pass already exists.

Hi alex,

Actually, I took a second look today. Here is what is happening with this example:

- Base and offset are split.
- Your LoadStoreADDI peephole kicks in and it merges the ADDI of the offset in the Immediate field.
- My peephole will not kick is, simply because it has no tail node ADDI (i.e a node that adds an offset to base like I described above).

This case is a bit tricky because there is only one use for the GlobalNode and your peephole can even optimize the non-split (base + offset) in the ADDI by folding it in the immediate of lhu., creating:

  lui     a0, %hi(foo +8)
  lhu     a0, %lo(foo+8)(a0)

I could possibly update my peephole to detect a new kind of "Tail", I can regard an load\store with an immediate field as a tail node and get the offset there and merge it back in the Target lowering of base + offset.

> Anyway, is it really important? Probably not going to register in code profiling and probably a tiny impact on code size. I mainly raise it as I know we will all be poring over assembly output to look for missed optimisation opportunities and cases like the above stick out like a sore thumb. It's also the sort of thing compiler users do, and then confidently conclude that the compiler must be dumb.

I agree!

> I'm happy to land this patch as-is. It generates good code in a wide variety of cases (more so than many other backends). But please add the above example to your test file with a note about the missed optimisation opportunity.

The example is actually already added to the test (define dso_local i32 @load_half() nounwind).

On the other hand, after some chat with @efriedma  I am starting to think that this whole thing could be better handled with a machine function Pass that runs after MachineCSE.  Your point in your original longer example (and the example I have in the test case called control_flow) shows that we are getting different notes for the lowered global address (just run llc with --stop-before=machine--cse), hasUseOnce always considers the uses within the basic block you are on, not the whole function (which makes sense). The reason these two examples worked (and by worked I mean NOT having the offset merged back into the global address lowering) is because the ADDI generated for the offset was folded into the Load\STore and my peephole found no Tail to kick in. If we add a MachineFunctionPasss we we will have:

1. A global view of the whole function so the uses for an MI will return the "True" number of uses over all the blocks.
2. Since we are running after CSE we will have known any potential value of the separating of base and offset. for the test case we have here, we will break even (in terms of Instruction count) if the Base lowering had  two uses at least:

  BB0:
           lui     a0, %hi(foo)
           addi    a0, a0, %lo(foo)
  BB1:
           lhu     a1, 8(a0)
            ...
  BB2:
           lhu     a2, 16(a0)

VS

  BB0:
  BB1:
           lui     a0, %hi(foo+8)
           lhu     a1, %lo(foo+8)(a0)
            ...
  BB2:
           lui     a0, %hi(foo+16)
           lhu     a1, %lo(foo+16)(a0)

And win for cases more than that.

I think we can do these checks with virtual registers in an MF pass.
if you think it is worth it I can give it a shot in an MF Pass :)

https://reviews.llvm.org/D45748