[llvm-dev] NVPTX - Reordering load instructions

Thu Jun 21 12:49:18 PDT 2018

On 06/21/2018 02:32 PM, Justin Lebar wrote:
> We already have a pass that vectorizes loads and stores in nvptx and
> amdgpu.  Not at my laptop, I forget the exact filename, but it's
> called load-store vectorizer.

It's here: lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

I agree that, if LLVM can do this explicitly, it seems better.

 -Hal

>
> I think the question is, why is LSV not vectorizing this code?
>
> I think the answer is, llvm can't tell that the loads are aligned. 
> Ptxas can, but only because it's (apparently) doing vectorization
> *after* it reesolves the shmem variables to physical addresses.  That
> is a cool trick, and llvm can't do it, because llvm never sees the
> physical shmem addresses.
>
> If you told llvm that the shmem variables were aligned to 16 bytes,
> LSV might do what you want here.  llvm and ptxas should be able to
> cooperate to give you the alignment you ask for in the IR.
>
> If that doesn't work I'd recommend trying to debug the LSV code to see
> why it's not vectorizing.  You can run `opt -load-store-vectorizer
> -debug` -- or, my favorite way of doing it is to run that command
> under rr-project.org <http://rr-project.org>.
>
> It's possible that clang should opportunistically mark all shmem
> variables over a certain size as align(16) so that this happens
> automagically.  That would kind of be a weird heuristic, but maybe it
> makes sense.  I don't think that would make sense for LLVM to do that,
> though, so it wouldn't help you.
>
> I think relying on LSV to do its job is better than messing with the
> instruction order because the former is more powerful -- it can
> vectorize in cases where ptxas would have a much harder time.
>
> Justin
>
> On Thu, Jun 21, 2018, 7:48 AM Hal Finkel via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
>     On 06/21/2018 12:18 PM, Tim Besard via llvm-dev wrote:
>     > Hi all,
>     >
>     > I'm looking into the performance difference of a benchmark
>     compiled with
>     > NVCC vs NVPTX (coming from Julia, not CUDA C) and I'm seeing a
>     > significant difference due to PTX instruction ordering. The relevant
>     > source code consists of two nested loops that get fully
>     unrolled, doing
>     > some basic arithmetic with values loaded from shared memory:
>     >
>     >> #define BLOCK_SIZE 16
>     >>
>     >> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
>     >> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
>     >>
>     >> int idx = threadIdx.x - BLOCK_SIZE;
>     >> for (int i = 0; i < BLOCK_SIZE; i++) {
>     >>  for (int j = 0; j < i; j++)
>     >>      peri_col[idx][i] -= peri_col[idx][j] * dia[j][i];
>     >>  peri_col[idx][i] /= dia[i][i];
>     >> }
>     > NVCC emits PTX instructions where all loads from shared memory are
>     > packed together:
>     >
>     >> ...
>     >> ld.shared.f32   %f546, [kernel_dia+440];
>     >> ld.shared.f32   %f545, [%r4+-996];
>     >> ld.shared.f32   %f544, [kernel_dia+56];
>     >> ld.shared.f32   %f543, [kernel_dia+88];
>     >> ld.shared.f32   %f542, [kernel_dia+500];
>     >> ld.shared.f32   %f541, [kernel_dia+84];
>     >> ld.shared.f32   %f540, [%r4+-972];
>     >> ld.shared.f32   %f539, [%r4+-1008];
>     >> ld.shared.f32   %f538, [kernel_dia+496];
>     >> ld.shared.f32   %f537, [kernel_dia+136];
>     >> ld.shared.f32   %f536, [%r4+-976];
>     >> ld.shared.f32   %f535, [kernel_dia+428];
>     >> ... # hundreds of these
>     > Even though this heavily bloats register usage (and NVCC seems to do
>     > this unconditionally, even with launch configurations whe
>     <https://maps.google.com/?q=ons+whe&entry=gmail&source=g>re this could
>     > hurt p
>     <https://maps.google.com/?q=hurt+p&entry=gmail&source=g>erformance),
>     it allow
>     <https://maps.google.com/?q=t+allow&entry=gmail&source=g>s the
>     CUDA PTX JIT to emit 128-bit loads:
>     >
>     >> LDS.128 R76, [0x2f0];
>     >> LDS.128 R60, [0xa0];
>     >> LDS.128 R72, [0x130];
>     >> LDS.128 R96, [0x1b0];
>     >> LDS.128 R92, [0x30];
>     >> LDS.128 R116, [0x50];
>     >> LDS.128 R108, [0x1f0];
>     > LLVM preserves the operations more or less as they are emitted
>     by the
>     > front-end, interleaving memory operations with arithmetic. As a
>     result,
>     > the SASS code contains many more 32-bit loads, which lowers
>     performance
>     > by ~10% on this specific benchmark.
>     >
>     > What would be the best approach to improve generated code? I can
>     imagine
>     > a late IR pass shuffling instructions around, but I figured I'd
>     ask to
>     > see if this is a good approach and whether there's existing work
>     doing
>     > similar transformations.
>
>     You could make a custom pass, late IR or MI. You might also be able to
>     use the existing instruction-scheduling infrastructure. You can
>     implement ScheduleDAGMutation that does the clustering that you'd
>     like,
>     or if the existing ones do what you want, use those. We have
>     preexisting
>     createLoadClusterDAGMutation and createStoreClusterDAGMutation
>     functions. If you look at AMDGPU/AMDGPUTargetMachine.cpp, you'll see
>     calls like this:
>
>         DAG->addMutation(createLoadClusterDAGMutation(DAG->TII,
>     DAG->TRI));
>
>     and I think that you probably want to do the same.
>
>     Also, you might want to override the subtarget's useAA() method to
>     return true (as this gives more freedom to the scheduler to move
>     memory
>     accesses around to do this kind of clustering).
>
>      -Hal
>
>     >
>     > Thanks,
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180621/96b8592c/attachment.html>