[PATCH] D123524: [AMDGCN] Split unaligned 3 DWORD DS operations

Mon Apr 11 10:43:26 PDT 2022

rampitec created this revision.
rampitec added reviewers: arsenm, foad.
Herald added subscribers: hsmhsm, kerbowa, hiraditya, nhaehnle, jvesely.
Herald added a project: All.
rampitec requested review of this revision.
Herald added a subscriber: wdng.
Herald added a project: LLVM.

I have written a minitest to check the performance. Overall
the benefit of aligned b96 operations on data which is not
known but happens to be aligned is small, while performance
hit of using b96 operations on a really unaligned memory is
high.

The only exception is when data is not aligned even by 4, it
is better to use b96 in this case.

Here is the test output on Vega and Navi:

  Using platform: AMD Accelerated Parallel Processing
  Using device: gfx900:xnack-

  ds_write_b96                                  aligned: 3.4 sec
  ds_write_b32 + ds_write_b64                   aligned: 4.5 sec
  ds_write_b32 * 3                              aligned: 4.8 sec
  ds_write_b96                          misaligned by 1: 4.8 sec
  ds_write_b32 + ds_write_b64           misaligned by 1: 7.2 sec
  ds_write_b32 * 3                      misaligned by 1: 10.0 sec
  ds_write_b96                          misaligned by 2: 4.8 sec
  ds_write_b32 + ds_write_b64           misaligned by 2: 7.2 sec
  ds_write_b32 * 3                      misaligned by 2: 10.1 sec
  ds_write_b96                          misaligned by 4: 4.8 sec
  ds_write_b32 + ds_write_b64           misaligned by 4: 4.2 sec
  ds_write_b32 * 3                      misaligned by 4: 4.9 sec
  ds_write_b96                          misaligned by 8: 4.8 sec
  ds_write_b32 + ds_write_b64           misaligned by 8: 4.6 sec
  ds_write_b32 * 3                      misaligned by 8: 4.9 sec
  ds_read_b96                                   aligned: 3.3 sec
  ds_read_b32 + ds_read_b64                     aligned: 4.9 sec
  ds_read_b32 * 3                               aligned: 2.6 sec
  ds_read_b96                           misaligned by 1: 4.1 sec
  ds_read_b32 + ds_read_b64             misaligned by 1: 7.2 sec
  ds_read_b32 * 3                       misaligned by 1: 10.1 sec
  ds_read_b96                           misaligned by 2: 4.1 sec
  ds_read_b32 + ds_read_b64             misaligned by 2: 7.2 sec
  ds_read_b32 * 3                       misaligned by 2: 10.1 sec
  ds_read_b96                           misaligned by 4: 4.1 sec
  ds_read_b32 + ds_read_b64             misaligned by 4: 2.6 sec
  ds_read_b32 * 3                       misaligned by 4: 2.6 sec
  ds_read_b96                           misaligned by 8: 4.1 sec
  ds_read_b32 + ds_read_b64             misaligned by 8: 4.9 sec
  ds_read_b32 * 3                       misaligned by 8: 2.6 sec

  Using platform: AMD Accelerated Parallel Processing
  Using device: gfx1030

  ds_write_b96                                  aligned: 4.1 sec
  ds_write_b32 + ds_write_b64                   aligned: 13.0 sec
  ds_write_b32 * 3                              aligned: 4.5 sec
  ds_write_b96                          misaligned by 1: 12.5 sec
  ds_write_b32 + ds_write_b64           misaligned by 1: 22.0 sec
  ds_write_b32 * 3                      misaligned by 1: 31.5 sec
  ds_write_b96                          misaligned by 2: 12.4 sec
  ds_write_b32 + ds_write_b64           misaligned by 2: 22.0 sec
  ds_write_b32 * 3                      misaligned by 2: 31.5 sec
  ds_write_b96                          misaligned by 4: 12.4 sec
  ds_write_b32 + ds_write_b64           misaligned by 4: 4.0 sec
  ds_write_b32 * 3                      misaligned by 4: 4.5 sec
  ds_write_b96                          misaligned by 8: 12.4 sec
  ds_write_b32 + ds_write_b64           misaligned by 8: 13.0 sec
  ds_write_b32 * 3                      misaligned by 8: 4.5 sec
  ds_read_b96                                   aligned: 3.8 sec
  ds_read_b32 + ds_read_b64                     aligned: 12.8 sec
  ds_read_b32 * 3                               aligned: 4.4 sec
  ds_read_b96                           misaligned by 1: 10.9 sec
  ds_read_b32 + ds_read_b64             misaligned by 1: 21.8 sec
  ds_read_b32 * 3                       misaligned by 1: 31.5 sec
  ds_read_b96                           misaligned by 2: 10.9 sec
  ds_read_b32 + ds_read_b64             misaligned by 2: 21.9 sec
  ds_read_b32 * 3                       misaligned by 2: 31.5 sec
  ds_read_b96                           misaligned by 4: 10.9 sec
  ds_read_b32 + ds_read_b64             misaligned by 4: 3.8 sec
  ds_read_b32 * 3                       misaligned by 4: 4.5 sec
  ds_read_b96                           misaligned by 8: 10.9 sec
  ds_read_b32 + ds_read_b64             misaligned by 8: 12.8 sec
  ds_read_b32 * 3                       misaligned by 8: 4.5 sec

Fixes: SWDEV-330802

https://reviews.llvm.org/D123524

Files:
  llvm/lib/Target/AMDGPU/DSInstructions.td
  llvm/lib/Target/AMDGPU/SIISelLowering.cpp
  llvm/test/CodeGen/AMDGPU/ds-alignment.ll
  llvm/test/CodeGen/AMDGPU/lds-misaligned-bug.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D123524.421963.patch
Type: text/x-patch
Size: 4354 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220411/39bd4ab3/attachment.bin>