[llvm-dev] NVPTX - Reordering load instructions
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Thu Jun 21 12:49:18 PDT 2018
On 06/21/2018 02:32 PM, Justin Lebar wrote:
> We already have a pass that vectorizes loads and stores in nvptx and
> amdgpu. Not at my laptop, I forget the exact filename, but it's
> called load-store vectorizer.
It's here: lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
I agree that, if LLVM can do this explicitly, it seems better.
-Hal
>
> I think the question is, why is LSV not vectorizing this code?
>
> I think the answer is, llvm can't tell that the loads are aligned.
> Ptxas can, but only because it's (apparently) doing vectorization
> *after* it reesolves the shmem variables to physical addresses. That
> is a cool trick, and llvm can't do it, because llvm never sees the
> physical shmem addresses.
>
> If you told llvm that the shmem variables were aligned to 16 bytes,
> LSV might do what you want here. llvm and ptxas should be able to
> cooperate to give you the alignment you ask for in the IR.
>
> If that doesn't work I'd recommend trying to debug the LSV code to see
> why it's not vectorizing. You can run `opt -load-store-vectorizer
> -debug` -- or, my favorite way of doing it is to run that command
> under rr-project.org <http://rr-project.org>.
>
> It's possible that clang should opportunistically mark all shmem
> variables over a certain size as align(16) so that this happens
> automagically. That would kind of be a weird heuristic, but maybe it
> makes sense. I don't think that would make sense for LLVM to do that,
> though, so it wouldn't help you.
>
> I think relying on LSV to do its job is better than messing with the
> instruction order because the former is more powerful -- it can
> vectorize in cases where ptxas would have a much harder time.
>
> Justin
>
> On Thu, Jun 21, 2018, 7:48 AM Hal Finkel via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
>
> On 06/21/2018 12:18 PM, Tim Besard via llvm-dev wrote:
> > Hi all,
> >
> > I'm looking into the performance difference of a benchmark
> compiled with
> > NVCC vs NVPTX (coming from Julia, not CUDA C) and I'm seeing a
> > significant difference due to PTX instruction ordering. The relevant
> > source code consists of two nested loops that get fully
> unrolled, doing
> > some basic arithmetic with values loaded from shared memory:
> >
> >> #define BLOCK_SIZE 16
> >>
> >> __shared__ float dia[BLOCK_SIZE][BLOCK_SIZE];
> >> __shared__ float peri_col[BLOCK_SIZE][BLOCK_SIZE];
> >>
> >> int idx = threadIdx.x - BLOCK_SIZE;
> >> for (int i = 0; i < BLOCK_SIZE; i++) {
> >> for (int j = 0; j < i; j++)
> >> peri_col[idx][i] -= peri_col[idx][j] * dia[j][i];
> >> peri_col[idx][i] /= dia[i][i];
> >> }
> > NVCC emits PTX instructions where all loads from shared memory are
> > packed together:
> >
> >> ...
> >> ld.shared.f32 %f546, [kernel_dia+440];
> >> ld.shared.f32 %f545, [%r4+-996];
> >> ld.shared.f32 %f544, [kernel_dia+56];
> >> ld.shared.f32 %f543, [kernel_dia+88];
> >> ld.shared.f32 %f542, [kernel_dia+500];
> >> ld.shared.f32 %f541, [kernel_dia+84];
> >> ld.shared.f32 %f540, [%r4+-972];
> >> ld.shared.f32 %f539, [%r4+-1008];
> >> ld.shared.f32 %f538, [kernel_dia+496];
> >> ld.shared.f32 %f537, [kernel_dia+136];
> >> ld.shared.f32 %f536, [%r4+-976];
> >> ld.shared.f32 %f535, [kernel_dia+428];
> >> ... # hundreds of these
> > Even though this heavily bloats register usage (and NVCC seems to do
> > this unconditionally, even with launch configurations whe
> <https://maps.google.com/?q=ons+whe&entry=gmail&source=g>re this could
> > hurt p
> <https://maps.google.com/?q=hurt+p&entry=gmail&source=g>erformance),
> it allow
> <https://maps.google.com/?q=t+allow&entry=gmail&source=g>s the
> CUDA PTX JIT to emit 128-bit loads:
> >
> >> LDS.128 R76, [0x2f0];
> >> LDS.128 R60, [0xa0];
> >> LDS.128 R72, [0x130];
> >> LDS.128 R96, [0x1b0];
> >> LDS.128 R92, [0x30];
> >> LDS.128 R116, [0x50];
> >> LDS.128 R108, [0x1f0];
> > LLVM preserves the operations more or less as they are emitted
> by the
> > front-end, interleaving memory operations with arithmetic. As a
> result,
> > the SASS code contains many more 32-bit loads, which lowers
> performance
> > by ~10% on this specific benchmark.
> >
> > What would be the best approach to improve generated code? I can
> imagine
> > a late IR pass shuffling instructions around, but I figured I'd
> ask to
> > see if this is a good approach and whether there's existing work
> doing
> > similar transformations.
>
> You could make a custom pass, late IR or MI. You might also be able to
> use the existing instruction-scheduling infrastructure. You can
> implement ScheduleDAGMutation that does the clustering that you'd
> like,
> or if the existing ones do what you want, use those. We have
> preexisting
> createLoadClusterDAGMutation and createStoreClusterDAGMutation
> functions. If you look at AMDGPU/AMDGPUTargetMachine.cpp, you'll see
> calls like this:
>
> DAG->addMutation(createLoadClusterDAGMutation(DAG->TII,
> DAG->TRI));
>
> and I think that you probably want to do the same.
>
> Also, you might want to override the subtarget's useAA() method to
> return true (as this gives more freedom to the scheduler to move
> memory
> accesses around to do this kind of clustering).
>
> -Hal
>
> >
> > Thanks,
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180621/96b8592c/attachment.html>
More information about the llvm-dev
mailing list