[PATCH] D45748: [RISCV] Separate base from offset in lowerGlobalAddress

Wed May 16 10:31:40 PDT 2018

sabuasal added a comment.

In https://reviews.llvm.org/D45748#1100079, @asb wrote:

> I looked at this in some more detail. I have a few very rough statistics on the effect of the various peepholes on the GCC torture suite. I don't claim this is representative of real code, but of course global accesses are highly codebase dependent anyway. Stats are just on the total static instruction count, i.e. it doesn't take into account if instructions are moved from hot to cold basic blocks or vice versa.
>
> Improvement of your proposed peephole over your patch with the peephole disabled:
>
> - 26 files have reduced number of instructions
> - 10 have an increased number of instructions
> - 6 have changed output but the same static instruction count
>
>   If you then add my proposed peephole on top of that, then you see the following changes (baseline = this patch including your peephole)
> - 23 files have reduced number of instructions
> - 11 have increased number of instructions
> - 2 have changed output but the same static instruction count

So you are saying your addition hurts size, right?

> If it were doable, the ideal starting point would be the following:
> 
> - If the global base has only a single reference across the whole function, or every reference has the same offset then combine the global with offset
> - If the global base has multiple references with different offsets, then never combine the global with the offset

This is exactly what I am doing in my peephole:

- Catch a tail node (RISCV::ADD or RISCV::ADD)
- Look at the operands of the node t detect a global address lowering sequence (LUI  %hifollowed by and ADDI %lo)
- Look at the other operands of the tail node to catch the and reconstruct the offset. (it can be just an operand if the tail is an ADDI, or you might have to reconstruct from an LUI +  ADDI pair if the offset is big or just an LUI)
- Merge the offset back into the GlobalAddress if the global address only had one use. We can trust that at this point because ALL the blocks have RISCV target nodes.

> I think the above two rules would be sensible regardless of whether you are linking statically or e.g. dynamic linking with PIC. Unfortunately we don't have access to that information at the point we'd like to make a decision about combining an offset. A particular lui+addi of a global may have only a single use, but that global may be referenced many times in the function. New TargetGlobalAddress nodes are introduced each time rather than reusing them. In these cases, MachineCSE can remove redundant instructions as long as different offsets aren't folded into the global. Unless we can force the reuse of identical TargetGlobalAddress nodes we can't easily write a peephole pass that respects the two rules proposed above. This input shows the problem well:
> 
>   @a = internal unnamed_addr global [4 x i32] zeroinitializer, align 4
>   
>   ; Function Attrs: noreturn nounwind
>   define dso_local i32 @main() local_unnamed_addr #0 {
>   entry:
>     %0 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 0), align 4
>     %cmp = icmp eq i32 %0, 0
>     br i1 %cmp, label %if.end, label %if.then
>   
>   if.then:                                          ; preds = %entry
>     tail call void @abort() #3
>     unreachable
>   
>   if.end:                                           ; preds = %entry
>     %1 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 1), align 4
>     %cmp1 = icmp eq i32 %1, 3
>     br i1 %cmp1, label %if.end3, label %if.then2
>   
>   if.then2:                                         ; preds = %if.end
>     tail call void @abort() #3
>     unreachable
>   
>   if.end3:                                          ; preds = %if.end
>     %2 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 2), align 4
>     %cmp4 = icmp eq i32 %2, 2
>     br i1 %cmp4, label %if.end6, label %if.then5
>   
>   if.then5:                                         ; preds = %if.end3
>     tail call void @abort() #3
>     unreachable
>   
>   if.end6:                                          ; preds = %if.end3
>     %3 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 3), align 4
>     %cmp7 = icmp eq i32 %3, 1
>     br i1 %cmp7, label %if.end9, label %if.then8
>   
>   if.then8:                                         ; preds = %if.end6
>     tail call void @abort() #3
>     unreachable
>   
>   if.end9:                                          ; preds = %if.end6
>     tail call void @exit(i32 0) #3
>     unreachable
>   }
>   
>   declare void @abort()
>   
>   declare void @exit(i32)
> 
> 
> So, what is the path forwards here? It feels like we should try to do something sensible without adding too much complex logic. Without a whole-function view we're not going to be better in all cases. Having the combines discussed in this thread feels like a reasonable starting point, but it is possible to imaging pathological inputs with uses spread across many basic blocks. Maybe these combines should be enabled only for a static relocation model?
> 
> I'd welcome further thoughts.

https://reviews.llvm.org/D45748