[PATCH] D115690: [SystemZ] Implement orderFrameObjects().

Mon Jan 10 16:49:59 PST 2022

jonpa updated this revision to Diff 398776.
jonpa added a comment.
Herald added subscribers: kerbowa, nhaehnle, jvesely.

I have experimented further with the sorting and improved the patch in several ways.

Common code change: Pass (and run) MachineBlockFrequencyInfo from PrologEpilogInserter to orderFrameObjects(). This helps in prioritizing instructions inside loops.

The "first try", which just did one sort by U12Freq (and secondarily by DPairFreq), gave this improvement on SPEC:

  lay            :                62128                55002    -7126
  la             :               532263               533346    +1083
  lghi           :               446000               445758     -242
  ldy            :                 3416                 3234     -182
  ld             :                98097                98279     +182
  std            :                66248                66412     +164
  stdy           :                 1227                 1063     -164
  ...
  OPCDIFFS: -6333

To improve on this I worked on an iterative sorting approach like:

1. Sort all objects as if they are all in range.
2. Sort the ones within S20 offset by D12Freq, and add D12Freq objects below U12.

2b. Take extra care with an (U12) overlapping D12Freq object and see if it should remain.

3. Sort again within S20 for DPairFreq and add below U12 for shortening.
4. Now sort above S20 without D12Freq, since no short displacements are in range.
5. Finally sort the different ranges by alignment.

After all this looked nice I found to my surprise that it was only a slight improvement
to the first simple version:

  lay            :                62128                54715    -7413
  la             :               532263               533630    +1367
  ...
  OPCDIFFS: -6346

The reason became quite clear: There are no frame objects ever out of range for the long displacement, so there could not be any gain by sorting for this. 14334 (sorted) frames were within U12, and then all of the rest (411) where within S20.

I looked into the cases that actually had still improved, and it turned out that these were cases where a big D12Freq object that was not fully within U12 had been moved to let smaller objects closer to SP. In many cases then the offsets used into that big object were small (like 0) and therefore still in range, even though the object was further away from SP.

So I trimmed down my patch again to do the first initial sorting, and also if not all objects fit within U12, "pack" smaller objects below the last object partially within U12. While doing this the bigger object is still kept in range as its max offset is kept track of.

This version (as posted) now gives a slightly better result yet:

  lay            :                62128                54005    -8123
  la             :               532263               534200    +1937
  lghi           :               446000               445756     -244
  ldy            :                 3416                 3225     -191
  ld             :                98097                98288     +191
  stdy           :                 1227                 1056     -171
  std            :                66248                66419     +171
  ...
  OPCDIFFS: -6487

  Stack size differences:
  Byte diff       Count
  -72             1
  -56             1
  -48             2
  -40             16
  -32             41
  -24             96
  -16             260
   -8             1141
    0             60788
    8             2

This is the best result so far.  The code changes seen above all then come from the bigger functions/stacks since if all objects fit within U12, only the stack space savings after sorting by alignment could be the benefit.

The sorting of objects into U12/S20 ranges depends on predicting those ranges accurately. The exact size of the final stack is not easy to predict as it depends on the final order of objects and their alignment requirements. The PrologEpilogInserter pass adds objects in a descending order (starting at highest address), and adds alignments to each object on the way. However, the sorting of the frame objects naturally begins starting from the lowest address, which in this patch is 'BottomMargin'. It includes the outgoing reg save area, the MaxCallFrameSize, and an extra 8 bytes of alignment margin. In addition, each object has its size rounded up to the next multiple of its alignment as a way to simplify. This leads to all 'EstimatedStackSize':es being slightly bigger than the actual final stack adjustments in emitPrologue():

  Counts Number of bytes extra
     452 2
    2957 4
     240 6
   11057 8
       5 10
      34 12

(E.g. 11057 stacks were overestimated by 8 bytes).

This is not quite optimal, but the important thing is that it is never underestimated, since the reordering for alignments depends on all objects actually ending up being within the same range as they were in their sorted order. In other words, the most dense object may end up near the U12 range if it has a low alignment value, so it better not end up actually going out of range.

This is checked with assertions in emitPrologue() and eliminateFrameIndex(). The SystemZMachineFunctionInfo is used for these values since orderFrameObjects() is 'const'. These checks could be simplified  - for example perhaps a simple thing would be to drop the const from orderFrameObjects() and assert the EstimatedStackSize within SystemZELFFrameLowering.

Typical sizes of the objects (that are actually sorted):

  Num     Bytes
    72620 8
    14974 4
     8281 16
     5529 24
     4336 32
     1853 64
      ...

(E.g. 5529 of the sorted objects were 24 bytes big). There were many more sizes, up to 160k.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115690/new/

https://reviews.llvm.org/D115690

Files:
  llvm/include/llvm/CodeGen/TargetFrameLowering.h
  llvm/lib/CodeGen/PrologEpilogInserter.cpp
  llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
  llvm/lib/Target/AArch64/AArch64FrameLowering.h
  llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
  llvm/lib/Target/SystemZ/SystemZFrameLowering.h
  llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
  llvm/lib/Target/SystemZ/SystemZInstrInfo.h
  llvm/lib/Target/SystemZ/SystemZMachineFunctionInfo.h
  llvm/lib/Target/SystemZ/SystemZRegisterInfo.cpp
  llvm/lib/Target/X86/X86FrameLowering.cpp
  llvm/lib/Target/X86/X86FrameLowering.h
  llvm/test/CodeGen/AArch64/O0-pipeline.ll
  llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
  llvm/test/CodeGen/SystemZ/foldmemop-vec-binops.mir
  llvm/test/CodeGen/SystemZ/foldmemop-vec-cmp.mir
  llvm/test/CodeGen/SystemZ/foldmemop-vec-fusedfp.mir
  llvm/test/CodeGen/SystemZ/frame-27.mir
  llvm/test/CodeGen/SystemZ/stackmap.ll
  llvm/test/CodeGen/X86/O0-pipeline.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D115690.398776.patch
Type: text/x-patch
Size: 48677 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220111/fbb94154/attachment-0001.bin>