[PATCH] D44210: [AMDGPU] Supported ds_read_b128 generation; Widened vector length for local address-space

Wed Mar 7 11:19:18 PST 2018

arsenm added a comment.

I've implemented this before: https://github.com/arsenm/llvm/tree/ds-128

This looks mostly the same. It's not clear to me it's always better to use this. I don't think this executes any faster, and at least for ds_write_b128, this has an additional constraint that the inputs must now be in a contiguous 128-bit register instead of 2 independent 64-bit pairs, which increases register pressure and may require copies. It might be better to defer forming this until later, like in the LoadStoreOptimizer pass. Jeff had a benchmark he wanted to try with this.

================
Comment at: lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp:246-248
+    if (ST->getGeneration() >= AMDGPUSubtarget::SEA_ISLANDS)
+      return 128;
+    return 64;
----------------
It might be OK to say 128 anyway. You could still do adjacent ds_read2_b64 even when not using ds_read_b128. I don't think we try to do the same trick we do with 4-byte aligned 8 byte reads for the 64-bit equivalent, but you might want to look into that.

Anything you change here would also equally apply to REGION_ADDRESS

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:5427
+    unsigned Alignment = Load->getAlignment();
+    if (Subtarget->getGeneration() >= SISubtarget::SEA_ISLANDS &&
+        isAligned16(Alignment) && MemVT.getStoreSize() == 16)
----------------
This should be hidden inside a Subtarget->hasDS128() check

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:5428
+    if (Subtarget->getGeneration() >= SISubtarget::SEA_ISLANDS &&
+        isAligned16(Alignment) && MemVT.getStoreSize() == 16)
+        return SDValue();
----------------
You don't need the isAligned16 helper. You just need to check that the alignment is >= 16, not % 16

================
Comment at: test/CodeGen/AMDGPU/ds_read2_superreg.ll:100-135
 ; CI-LABEL: {{^}}simple_read2_v4f32_superreg:
-; CI-DAG: ds_read2_b64 [[REG_ZW:v\[[0-9]+:[0-9]+\]]], v{{[0-9]+}} offset1:1{{$}}
+; CI-DAG: ds_read_b128 [[REG_ZW:v\[[0-9]+:[0-9]+\]]], v{{[0-9]+}}
 ; CI: buffer_store_dwordx4 [[REG_ZW]]
 ; CI: s_endpgm
 define amdgpu_kernel void @simple_read2_v4f32_superreg(<4 x float> addrspace(1)* %out) #0 {
   %x.i = tail call i32 @llvm.amdgcn.workitem.id.x() #1
   %arrayidx0 = getelementptr inbounds [512 x <4 x float>], [512 x <4 x float>] addrspace(3)* @lds.v4, i32 0, i32 %x.i
----------------
These tests have the unfortunate side effect of breaking what this test intended, which was the pass forming the read2. Maybe change all of these to reduce the alignment so you still get read2?

https://reviews.llvm.org/D44210