[PATCH] D115690: [SystemZ] Implement orderFrameObjects().
Jonas Paulsson via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Jan 10 16:49:59 PST 2022
jonpa updated this revision to Diff 398776.
jonpa added a comment.
Herald added subscribers: kerbowa, nhaehnle, jvesely.
I have experimented further with the sorting and improved the patch in several ways.
Common code change: Pass (and run) MachineBlockFrequencyInfo from PrologEpilogInserter to orderFrameObjects(). This helps in prioritizing instructions inside loops.
The "first try", which just did one sort by U12Freq (and secondarily by DPairFreq), gave this improvement on SPEC:
lay : 62128 55002 -7126
la : 532263 533346 +1083
lghi : 446000 445758 -242
ldy : 3416 3234 -182
ld : 98097 98279 +182
std : 66248 66412 +164
stdy : 1227 1063 -164
...
OPCDIFFS: -6333
To improve on this I worked on an iterative sorting approach like:
1. Sort all objects as if they are all in range.
2. Sort the ones within S20 offset by D12Freq, and add D12Freq objects below U12.
2b. Take extra care with an (U12) overlapping D12Freq object and see if it should remain.
3. Sort again within S20 for DPairFreq and add below U12 for shortening.
4. Now sort above S20 without D12Freq, since no short displacements are in range.
5. Finally sort the different ranges by alignment.
After all this looked nice I found to my surprise that it was only a slight improvement
to the first simple version:
lay : 62128 54715 -7413
la : 532263 533630 +1367
...
OPCDIFFS: -6346
The reason became quite clear: There are no frame objects ever out of range for the long displacement, so there could not be any gain by sorting for this. 14334 (sorted) frames were within U12, and then all of the rest (411) where within S20.
I looked into the cases that actually had still improved, and it turned out that these were cases where a big D12Freq object that was not fully within U12 had been moved to let smaller objects closer to SP. In many cases then the offsets used into that big object were small (like 0) and therefore still in range, even though the object was further away from SP.
So I trimmed down my patch again to do the first initial sorting, and also if not all objects fit within U12, "pack" smaller objects below the last object partially within U12. While doing this the bigger object is still kept in range as its max offset is kept track of.
This version (as posted) now gives a slightly better result yet:
lay : 62128 54005 -8123
la : 532263 534200 +1937
lghi : 446000 445756 -244
ldy : 3416 3225 -191
ld : 98097 98288 +191
stdy : 1227 1056 -171
std : 66248 66419 +171
...
OPCDIFFS: -6487
Stack size differences:
Byte diff Count
-72 1
-56 1
-48 2
-40 16
-32 41
-24 96
-16 260
-8 1141
0 60788
8 2
This is the best result so far. The code changes seen above all then come from the bigger functions/stacks since if all objects fit within U12, only the stack space savings after sorting by alignment could be the benefit.
The sorting of objects into U12/S20 ranges depends on predicting those ranges accurately. The exact size of the final stack is not easy to predict as it depends on the final order of objects and their alignment requirements. The PrologEpilogInserter pass adds objects in a descending order (starting at highest address), and adds alignments to each object on the way. However, the sorting of the frame objects naturally begins starting from the lowest address, which in this patch is 'BottomMargin'. It includes the outgoing reg save area, the MaxCallFrameSize, and an extra 8 bytes of alignment margin. In addition, each object has its size rounded up to the next multiple of its alignment as a way to simplify. This leads to all 'EstimatedStackSize':es being slightly bigger than the actual final stack adjustments in emitPrologue():
Counts Number of bytes extra
452 2
2957 4
240 6
11057 8
5 10
34 12
(E.g. 11057 stacks were overestimated by 8 bytes).
This is not quite optimal, but the important thing is that it is never underestimated, since the reordering for alignments depends on all objects actually ending up being within the same range as they were in their sorted order. In other words, the most dense object may end up near the U12 range if it has a low alignment value, so it better not end up actually going out of range.
This is checked with assertions in emitPrologue() and eliminateFrameIndex(). The SystemZMachineFunctionInfo is used for these values since orderFrameObjects() is 'const'. These checks could be simplified - for example perhaps a simple thing would be to drop the const from orderFrameObjects() and assert the EstimatedStackSize within SystemZELFFrameLowering.
Typical sizes of the objects (that are actually sorted):
Num Bytes
72620 8
14974 4
8281 16
5529 24
4336 32
1853 64
...
(E.g. 5529 of the sorted objects were 24 bytes big). There were many more sizes, up to 160k.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D115690/new/
https://reviews.llvm.org/D115690
Files:
llvm/include/llvm/CodeGen/TargetFrameLowering.h
llvm/lib/CodeGen/PrologEpilogInserter.cpp
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
llvm/lib/Target/AArch64/AArch64FrameLowering.h
llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
llvm/lib/Target/SystemZ/SystemZFrameLowering.h
llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
llvm/lib/Target/SystemZ/SystemZInstrInfo.h
llvm/lib/Target/SystemZ/SystemZMachineFunctionInfo.h
llvm/lib/Target/SystemZ/SystemZRegisterInfo.cpp
llvm/lib/Target/X86/X86FrameLowering.cpp
llvm/lib/Target/X86/X86FrameLowering.h
llvm/test/CodeGen/AArch64/O0-pipeline.ll
llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
llvm/test/CodeGen/SystemZ/foldmemop-vec-binops.mir
llvm/test/CodeGen/SystemZ/foldmemop-vec-cmp.mir
llvm/test/CodeGen/SystemZ/foldmemop-vec-fusedfp.mir
llvm/test/CodeGen/SystemZ/frame-27.mir
llvm/test/CodeGen/SystemZ/stackmap.ll
llvm/test/CodeGen/X86/O0-pipeline.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D115690.398776.patch
Type: text/x-patch
Size: 48677 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220111/fbb94154/attachment-0001.bin>
More information about the llvm-commits
mailing list