[PATCH] D123394: [CodeGen] Late cleanup of redundant address/immediate definitions.

Fri Sep 9 05:48:26 PDT 2022

jonpa added reviewers: asb, tstellar, RKSimon, t.p.northover, rengolin, nemanjai, qcolombet, hfinkel, lattner, kparzysz, bjope.
jonpa added a comment.

As I have spent a certain amount of time developing this new target independent pass, I would be very happy to get some feedback.

Short summary of what this patch does:

- Cleaning up of rematerialized immediates: After ISel, an immediate is typically copied to all it's users from the original def, while regalloc then rematerializes it which causes a lot of identical immediate loads which are currently never cleaned away even if they are close in the same MBB.

- Cleaning up of big address offsets forced into registers: SystemZRegisterInfo::eliminateFrameIndex() loads the bigger part of an out of range offset into a register for each FI access, with the idea that these anchors should be reused between accesses. There is however no late "machine cleanup" of identical instructions currently, so there is no reuse at all. This surprised me and was the motivation for this patch, and I think this should be relevant to other targets as well.

It's ~300 LOC, with a small compile time approximately similar to MachineCopyPropagation.

On SystemZ I see ~22k instructions cleaned on SPEC and some slight benchmark improvements which is nice, but perhaps not enough to add a new pass to the backend. I expected this to be done in common code in the first place, and I think this is where it belongs as it is not really target specific.

The question for me now is whether this cleanup should be purposefully omitted (e.g. "doesn't matter enough on OOO CPUs"), or if there is a performance improvement available here. If there is, this patch could be used, or perhaps some other way of achieving this (i.e. earlier in the compilation - any ideas anyone?). If there *isn't*, then that would also be good to know (and perhaps make the comment somewhere in CodeGen)!

I hope you agree that it would be worth at least trying once if this cleanup matters or not. These targets have these number of test files improved: RISCV: 21, AMDGPU: 15, X86: 13, AArch64/ARM/Thumb: 12, PowerPC: 4, Mips: 2, BPF: 1. It would be very nice to learn how much of a cleanup targets achieve in terms of number of instructions eliminated and benchmark improvements. If there is no static improvement on benchmarks (no eliminated instructions) for a target, that would also be interesting to see why this was not even needed.

I am not really strongly arguing for this pass to be added, but I would very much like to come to some consensus on this one way or the other, thanks.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123394/new/

https://reviews.llvm.org/D123394