[PATCH] 'CSE' of ADRP instructions used for load/store addressing

Mon Mar 31 04:44:12 PDT 2014

t.p.northover added you to the CC list for the revision "'CSE' of ADRP instructions used for load/store addressing".

Hi t.p.northover,

Hi,

Attached patch is to do 'CSE' of ADRP instructions used for load/store addressing.

GCC always use common attribute to expose global symbols to other modules, but this seems not mandatory for most of the compilation scenarios (Let me know if I'm wrong). With -fno-common enabled, we could get a chance of merging global definitions within the current compilation module  to be a single monolithic big structure. In this way, LLVM would be able to treat all global access as a field within the newly created big structure.

For global symbol access, we usually have the following instruction sequence,

adrp ; load page address (4K boundary)
add  ; add the address within the page
load/store ; use the address from the last add

If the big structure can be always fit into a page, we would be able to see fewer adrp instructions.

LLVM already has a GlobalMerge Pass, and originally it only tries to merge static variables, so now this patch is to get it extended to support external variables for AArch64.

There are some other considerations,
1) In theory, ADRP reduction may not always be good, because
* The live range holding page address would increase a lot, and register pressure can be increased accordingly. 
* The data section may have some holes, and data cache behavior may be changed accordingly.
2) The maximum offset being supported by AArch64 ldr/str instruction actually is 4096*Sizeof(elem_type), which is larger than a 4K page. In theory, we can create an even larger struct merged for global variables. But the static experiment shows fewer extra adrp reduction when increasing merged size from 4K to 8K.
3) The scenario of applying this optimization is different from ARM target because of different addressing mode.
4) Actually we can reduce the "add" instruction as well by propagating the in-page address, which is usually a relocation, to load/store instruction, and the original offset in load/store instruction can be dumped into fixup(addend) of relocation section. This is be a separate optimization for which we could give follow-up.
5) I know ARM64 back-end is in trunk now, and I can also port it to ARM64 if needed.

Finally, I don't have real AArch64 machine to test the patch, so here comes the static statistic data only for SPEC2000. I would be extremely appreciative if anybody can help me to validate the performance.

(The before/after merge column contains is the number of ADRP instructions)

|CINT2000|before merge|after merge|decreased|
|256.bzip2|655|367|43.97%|
|186.crafty|3898|3786|2.87%|
|175.vpr|1636|1620|0.98%|
|255.vortex|7333|6950|5.22%|
|252.eon|1660|1658|0.12%|
|181.mcf|58|58|0.00%|
|164.gzip|718|622|13.37%|
|253.perlbmk|9460|9443|0.18%|
|197.parser|1437|1309|8.91%|
|300.twolf|2993|2713|9.36%|
|254.gap|6145|5596|8.93%|
|176.gcc|14650|12908|11.89%|
|183.equake|264|181|31.44%|
|188.ammp|937|862|8.00%|
|179.art|344|162|52.91%|
|177.mesa|4230|4230|0.00%|

Thanks,
-Jiangning

http://llvm-reviews.chandlerc.com/D3223

Files:
  include/llvm/IR/GlobalAlias.h
  lib/CodeGen/AsmPrinter/AsmPrinter.cpp
  lib/IR/Globals.cpp
  lib/Target/AArch64/AArch64ISelLowering.cpp
  lib/Target/AArch64/AArch64ISelLowering.h
  lib/Target/AArch64/AArch64TargetMachine.cpp
  lib/Transforms/Scalar/GlobalMerge.cpp
  test/CodeGen/AArch64/global_merge.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D3223.1.patch
Type: text/x-patch
Size: 10049 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140331/4a677b63/attachment.bin>