[PATCH] D123956: [AMDGPU] Refine 64 bit misaligned LDS ops selection

Mon Apr 18 12:35:25 PDT 2022

rampitec created this revision.
rampitec added reviewers: arsenm, foad.
Herald added subscribers: hsmhsm, kerbowa, hiraditya, t-tye, tpr, dstuttard, yaxunl, nhaehnle, jvesely, kzhuravl.
Herald added a project: All.
rampitec requested review of this revision.
Herald added a subscriber: wdng.
Herald added a project: LLVM.

Here is the performance data:

  Using platform: AMD Accelerated Parallel Processing
  Using device: gfx900:xnack-

  ds_write_b64                       aligned by  8:  3.2 sec
  ds_write2_b32                      aligned by  8:  3.2 sec
  ds_write_b16 * 4                   aligned by  8:  7.0 sec
  ds_write_b8 * 8                    aligned by  8: 13.2 sec
  ds_write_b64                       aligned by  1:  7.3 sec
  ds_write2_b32                      aligned by  1:  7.5 sec
  ds_write_b16 * 4                   aligned by  1: 14.0 sec
  ds_write_b8 * 8                    aligned by  1: 13.2 sec
  ds_write_b64                       aligned by  2:  7.3 sec
  ds_write2_b32                      aligned by  2:  7.5 sec
  ds_write_b16 * 4                   aligned by  2:  7.1 sec
  ds_write_b8 * 8                    aligned by  2: 13.3 sec
  ds_write_b64                       aligned by  4:  4.6 sec
  ds_write2_b32                      aligned by  4:  3.2 sec
  ds_write_b16 * 4                   aligned by  4:  7.1 sec
  ds_write_b8 * 8                    aligned by  4: 13.3 sec
  ds_read_b64                        aligned by  8:  2.3 sec
  ds_read2_b32                       aligned by  8:  2.2 sec
  ds_read_u16 * 4                    aligned by  8:  4.8 sec
  ds_read_u8 * 8                     aligned by  8:  8.6 sec
  ds_read_b64                        aligned by  1:  4.4 sec
  ds_read2_b32                       aligned by  1:  7.3 sec
  ds_read_u16 * 4                    aligned by  1: 14.0 sec
  ds_read_u8 * 8                     aligned by  1:  8.7 sec
  ds_read_b64                        aligned by  2:  4.4 sec
  ds_read2_b32                       aligned by  2:  7.3 sec
  ds_read_u16 * 4                    aligned by  2:  4.8 sec
  ds_read_u8 * 8                     aligned by  2:  8.7 sec
  ds_read_b64                        aligned by  4:  4.4 sec
  ds_read2_b32                       aligned by  4:  2.3 sec
  ds_read_u16 * 4                    aligned by  4:  4.8 sec
  ds_read_u8 * 8                     aligned by  4:  8.7 sec

  Using platform: AMD Accelerated Parallel Processing
  Using device: gfx1030

  ds_write_b64                       aligned by  8:  4.4 sec
  ds_write2_b32                      aligned by  8:  4.3 sec
  ds_write_b16 * 4                   aligned by  8:  7.9 sec
  ds_write_b8 * 8                    aligned by  8: 13.0 sec
  ds_write_b64                       aligned by  1: 23.2 sec
  ds_write2_b32                      aligned by  1: 23.1 sec
  ds_write_b16 * 4                   aligned by  1: 44.0 sec
  ds_write_b8 * 8                    aligned by  1: 13.0 sec
  ds_write_b64                       aligned by  2: 23.2 sec
  ds_write2_b32                      aligned by  2: 23.1 sec
  ds_write_b16 * 4                   aligned by  2:  7.9 sec
  ds_write_b8 * 8                    aligned by  2: 13.1 sec
  ds_write_b64                       aligned by  4: 13.5 sec
  ds_write2_b32                      aligned by  4:  4.3 sec
  ds_write_b16 * 4                   aligned by  4:  7.9 sec
  ds_write_b8 * 8                    aligned by  4: 13.1 sec
  ds_read_b64                        aligned by  8:  3.5 sec
  ds_read2_b32                       aligned by  8:  3.4 sec
  ds_read_u16 * 4                    aligned by  8:  5.3 sec
  ds_read_u8 * 8                     aligned by  8:  8.5 sec
  ds_read_b64                        aligned by  1: 13.1 sec
  ds_read2_b32                       aligned by  1: 22.7 sec
  ds_read_u16 * 4                    aligned by  1: 43.9 sec
  ds_read_u8 * 8                     aligned by  1:  7.9 sec
  ds_read_b64                        aligned by  2: 13.1 sec
  ds_read2_b32                       aligned by  2: 22.7 sec
  ds_read_u16 * 4                    aligned by  2:  5.6 sec
  ds_read_u8 * 8                     aligned by  2:  7.9 sec
  ds_read_b64                        aligned by  4: 13.1 sec
  ds_read2_b32                       aligned by  4:  3.4 sec
  ds_read_u16 * 4                    aligned by  4:  5.6 sec
  ds_read_u8 * 8                     aligned by  4:  7.9 sec

GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.

https://reviews.llvm.org/D123956

Files:
  llvm/lib/Target/AMDGPU/DSInstructions.td
  llvm/lib/Target/AMDGPU/SIISelLowering.cpp
  llvm/test/CodeGen/AMDGPU/ds-alignment.ll
  llvm/test/CodeGen/AMDGPU/ds_read2.ll
  llvm/test/CodeGen/AMDGPU/ds_write2.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D123956.423452.patch
Type: text/x-patch
Size: 7002 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220418/0a0e56f1/attachment.bin>