[PATCH] D45748: [RISCV] Separate base from offset in lowerGlobalAddress

Tue May 15 13:34:43 PDT 2018

asb added a comment.

I looked at this in some more detail. I have a few very rough statistics on the effect of the various peepholes on the GCC torture suite. I don't claim this is representative of real code, but of course global accesses are highly codebase dependent anyway. Stats are just on the total static instruction count, i.e. it doesn't take into account if instructions are moved from hot to cold basic blocks or vice versa.

Improvement of your proposed peephole over your patch with the peephole disabled:

- 26 files have reduced number of instructions
- 10 have an increased number of instructions
- 6 have changed output but the same static instruction count

If you then add my proposed peephole on top of that, then you see the following changes (baseline = this patch including your peephole)

- 23 files have reduced number of instructions
- 11 have increased number of instructions
- 2 have changed output but the same static instruction count

If it were doable, the ideal starting point would be the following:

- If the global base has only a single reference across the whole function, or every reference has the same offset then combine the global with offset
- If the global base has multiple references with different offsets, then never combine the global with the offset

I think the above two rules would be sensible regardless of whether you are linking statically or e.g. dynamic linking with PIC. Unfortunately we don't have access to that information at the point we'd like to make a decision about combining an offset. A particular lui+addi of a global may have only a single use, but that global may be referenced many times in the function. New TargetGlobalAddress nodes are introduced each time rather than reusing them. In these cases, MachineCSE can remove redundant instructions as long as different offsets aren't folded into the global. Unless we can force the reuse of identical TargetGlobalAddress nodes we can't easily write a peephole pass that respects the two rules proposed above. This input shows the problem well:

  @a = internal unnamed_addr global [4 x i32] zeroinitializer, align 4

  ; Function Attrs: noreturn nounwind
  define dso_local i32 @main() local_unnamed_addr #0 {
  entry:
    %0 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 0), align 4
    %cmp = icmp eq i32 %0, 0
    br i1 %cmp, label %if.end, label %if.then

  if.then:                                          ; preds = %entry
    tail call void @abort() #3
    unreachable

  if.end:                                           ; preds = %entry
    %1 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 1), align 4
    %cmp1 = icmp eq i32 %1, 3
    br i1 %cmp1, label %if.end3, label %if.then2

  if.then2:                                         ; preds = %if.end
    tail call void @abort() #3
    unreachable

  if.end3:                                          ; preds = %if.end
    %2 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 2), align 4
    %cmp4 = icmp eq i32 %2, 2
    br i1 %cmp4, label %if.end6, label %if.then5

  if.then5:                                         ; preds = %if.end3
    tail call void @abort() #3
    unreachable

  if.end6:                                          ; preds = %if.end3
    %3 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 3), align 4
    %cmp7 = icmp eq i32 %3, 1
    br i1 %cmp7, label %if.end9, label %if.then8

  if.then8:                                         ; preds = %if.end6
    tail call void @abort() #3
    unreachable

  if.end9:                                          ; preds = %if.end6
    tail call void @exit(i32 0) #3
    unreachable
  }

  declare void @abort()

  declare void @exit(i32)

So, what is the path forwards here? It feels like we should try to do something sensible without adding too much complex logic. Without a whole-function view we're not going to be better in all cases. Having the combines discussed in this thread feels like a reasonable starting point, but it is possible to imaging pathological inputs with uses spread across many basic blocks. Maybe these combines should be enabled only for a static relocation model?

I'd welcome further thoughts.

https://reviews.llvm.org/D45748