[PATCH] D73127: AMDGPU/GlobalISel: Widen non-power-of-2 load results

Wed Feb 5 05:50:34 PST 2020

nhaehnle added a comment.

A couple of notes, in addition to the inline comment:

- On hardware that supports unaligned loads (CI+), we should just keep dword loads if the load has align=1 (no known alignment at all). This will execute much faster when the pointer happens to be aligned, and will still be faster in the unaligned case. The unaligned-support should be a subtarget feature, because Windows KMD is apparently unable to set the relevant register setting to enable the hardware feature.
- The same may be true when align=2.
- All of this probably only applies to global (and possibly local) loads, because loading from scratch has more limitation because of the swizzling.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp:720-722
+    unsigned Align = Query.MMODescrs[0].AlignInBits;
+    unsigned RoundedSize = NextPowerOf2(Size);
+    return (Align >= RoundedSize);
----------------
As long as the alignment and size are both at least 32, I believe we can always support it. E.g., a dwordx4 load from a pointer that's 4-byte aligned is okay.

Though... admittedly in that case you may end up loading an additional cache line which you otherwise wouldn't load, because you cross a cacheline boundary. So maybe the right decision is to keep the test as is?

Either way, the decision ought to be documented as a comment.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D73127/new/

https://reviews.llvm.org/D73127