[llvm-dev] NVPTX - Reordering load instructions

Thu Jun 21 10:48:04 PDT 2018

On 06/21/2018 12:18 PM, Tim Besard via llvm-dev wrote:
> Hi all,
>
> I'm looking into the performance difference of a benchmark compiled with
> NVCC vs NVPTX (coming from Julia, not CUDA C) and I'm seeing a
> significant difference due to PTX instruction ordering. The relevant
> source code consists of two nested loops that get fully unrolled, doing
> some basic arithmetic with values loaded from shared memory:
>
>> #define BLOCK_SIZE 16
>>
>> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
>> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
>>
>> int idx = threadIdx.x - BLOCK_SIZE;
>> for (int i = 0; i < BLOCK_SIZE; i++) {
>>  for (int j = 0; j < i; j++)
>>      peri_col[idx][i] -= peri_col[idx][j] * dia[j][i];
>>  peri_col[idx][i] /= dia[i][i];
>> }
> NVCC emits PTX instructions where all loads from shared memory are
> packed together:
>
>> ...
>> ld.shared.f32   %f546, [kernel_dia+440];
>> ld.shared.f32   %f545, [%r4+-996];
>> ld.shared.f32   %f544, [kernel_dia+56];
>> ld.shared.f32   %f543, [kernel_dia+88];
>> ld.shared.f32   %f542, [kernel_dia+500];
>> ld.shared.f32   %f541, [kernel_dia+84];
>> ld.shared.f32   %f540, [%r4+-972];
>> ld.shared.f32   %f539, [%r4+-1008];
>> ld.shared.f32   %f538, [kernel_dia+496];
>> ld.shared.f32   %f537, [kernel_dia+136];
>> ld.shared.f32   %f536, [%r4+-976];
>> ld.shared.f32   %f535, [kernel_dia+428];
>> ... # hundreds of these
> Even though this heavily bloats register usage (and NVCC seems to do
> this unconditionally, even with launch configurations where this could
> hurt performance), it allows the CUDA PTX JIT to emit 128-bit loads:
>
>> LDS.128 R76, [0x2f0];
>> LDS.128 R60, [0xa0];
>> LDS.128 R72, [0x130];
>> LDS.128 R96, [0x1b0];
>> LDS.128 R92, [0x30];
>> LDS.128 R116, [0x50];
>> LDS.128 R108, [0x1f0];
> LLVM preserves the operations more or less as they are emitted by the
> front-end, interleaving memory operations with arithmetic. As a result,
> the SASS code contains many more 32-bit loads, which lowers performance
> by ~10% on this specific benchmark.
>
> What would be the best approach to improve generated code? I can imagine
> a late IR pass shuffling instructions around, but I figured I'd ask to
> see if this is a good approach and whether there's existing work doing
> similar transformations.

You could make a custom pass, late IR or MI. You might also be able to
use the existing instruction-scheduling infrastructure. You can
implement ScheduleDAGMutation that does the clustering that you'd like,
or if the existing ones do what you want, use those. We have preexisting
createLoadClusterDAGMutation and createStoreClusterDAGMutation
functions. If you look at AMDGPU/AMDGPUTargetMachine.cpp, you'll see
calls like this:

    DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));

and I think that you probably want to do the same.

Also, you might want to override the subtarget's useAA() method to
return true (as this gives more freedom to the scheduler to move memory
accesses around to do this kind of clustering).

 -Hal

>
> Thanks,

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory