[llvm-dev] NVPTX - Reordering load instructions

Thu Jun 21 10:18:16 PDT 2018

Hi all,

I'm looking into the performance difference of a benchmark compiled with
NVCC vs NVPTX (coming from Julia, not CUDA C) and I'm seeing a
significant difference due to PTX instruction ordering. The relevant
source code consists of two nested loops that get fully unrolled, doing
some basic arithmetic with values loaded from shared memory:

> #define BLOCK_SIZE 16
>
> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
>
> int idx = threadIdx.x - BLOCK_SIZE;
> for (int i = 0; i < BLOCK_SIZE; i++) {
>  for (int j = 0; j < i; j++)
>      peri_col[idx][i] -= peri_col[idx][j] * dia[j][i];
>  peri_col[idx][i] /= dia[i][i];
> }

NVCC emits PTX instructions where all loads from shared memory are
packed together:

> ...
> ld.shared.f32   %f546, [kernel_dia+440];
> ld.shared.f32   %f545, [%r4+-996];
> ld.shared.f32   %f544, [kernel_dia+56];
> ld.shared.f32   %f543, [kernel_dia+88];
> ld.shared.f32   %f542, [kernel_dia+500];
> ld.shared.f32   %f541, [kernel_dia+84];
> ld.shared.f32   %f540, [%r4+-972];
> ld.shared.f32   %f539, [%r4+-1008];
> ld.shared.f32   %f538, [kernel_dia+496];
> ld.shared.f32   %f537, [kernel_dia+136];
> ld.shared.f32   %f536, [%r4+-976];
> ld.shared.f32   %f535, [kernel_dia+428];
> ... # hundreds of these

Even though this heavily bloats register usage (and NVCC seems to do
this unconditionally, even with launch configurations where this could
hurt performance), it allows the CUDA PTX JIT to emit 128-bit loads:

> LDS.128 R76, [0x2f0];
> LDS.128 R60, [0xa0];
> LDS.128 R72, [0x130];
> LDS.128 R96, [0x1b0];
> LDS.128 R92, [0x30];
> LDS.128 R116, [0x50];
> LDS.128 R108, [0x1f0];

LLVM preserves the operations more or less as they are emitted by the
front-end, interleaving memory operations with arithmetic. As a result,
the SASS code contains many more 32-bit loads, which lowers performance
by ~10% on this specific benchmark.

What would be the best approach to improve generated code? I can imagine
a late IR pass shuffling instructions around, but I figured I'd ask to
see if this is a good approach and whether there's existing work doing
similar transformations.

Thanks,
-- 
Tim Besard
Computer Systems Lab
Department of Electronics & Information Systems
Ghent University