[PATCH] D73127: AMDGPU/GlobalISel: Widen non-power-of-2 load results
Nicolai Hähnle via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Feb 5 05:50:34 PST 2020
nhaehnle added a comment.
A couple of notes, in addition to the inline comment:
- On hardware that supports unaligned loads (CI+), we should just keep dword loads if the load has align=1 (no known alignment at all). This will execute much faster when the pointer happens to be aligned, and will still be faster in the unaligned case. The unaligned-support should be a subtarget feature, because Windows KMD is apparently unable to set the relevant register setting to enable the hardware feature.
- The same may be true when align=2.
- All of this probably only applies to global (and possibly local) loads, because loading from scratch has more limitation because of the swizzling.
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp:720-722
+ unsigned Align = Query.MMODescrs[0].AlignInBits;
+ unsigned RoundedSize = NextPowerOf2(Size);
+ return (Align >= RoundedSize);
----------------
As long as the alignment and size are both at least 32, I believe we can always support it. E.g., a dwordx4 load from a pointer that's 4-byte aligned is okay.
Though... admittedly in that case you may end up loading an additional cache line which you otherwise wouldn't load, because you cross a cacheline boundary. So maybe the right decision is to keep the test as is?
Either way, the decision ought to be documented as a comment.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D73127/new/
https://reviews.llvm.org/D73127
More information about the llvm-commits
mailing list