[PATCH] D56274: [OPENMP][NVPTX]Fix incompatibility of __syncthreads with LLVM, NFC.

Tue May 28 12:27:14 PDT 2019

gtbercea added a comment.

In D56274#1519835 <https://reviews.llvm.org/D56274#1519835>, @arsenm wrote:

> In D56274#1519820 <https://reviews.llvm.org/D56274#1519820>, @gtbercea wrote:
>
> > In D56274#1519738 <https://reviews.llvm.org/D56274#1519738>, @ABataev wrote:
> >
> > > In D56274#1519707 <https://reviews.llvm.org/D56274#1519707>, @hfinkel wrote:
> > >
> > > > In D56274#1519687 <https://reviews.llvm.org/D56274#1519687>, @ABataev wrote:
> > > >
> > > > > In D56274#1519316 <https://reviews.llvm.org/D56274#1519316>, @hfinkel wrote:
> > > > >
> > > > > > In D56274#1519296 <https://reviews.llvm.org/D56274#1519296>, @arsenm wrote:
> > > > > >
> > > > > > > In D56274#1519286 <https://reviews.llvm.org/D56274#1519286>, @ABataev wrote:
> > > > > > >
> > > > > > > > In D56274#1519272 <https://reviews.llvm.org/D56274#1519272>, @arsenm wrote:
> > > > > > > >
> > > > > > > > > In D56274#1519176 <https://reviews.llvm.org/D56274#1519176>, @ABataev wrote:
> > > > > > > > >
> > > > > > > > > > In D56274#1519039 <https://reviews.llvm.org/D56274#1519039>, @arsenm wrote:
> > > > > > > > > >
> > > > > > > > > > > In D56274#1517067 <https://reviews.llvm.org/D56274#1517067>, @arsenm wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Is OpenMP not marking all functions as convergent?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > ping
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP,  and this doesn't look correct to me:
> > > > > > > > >
> > > > > > > > > __kmpc_barrier is declared as convergent, but the callers are not:
> > > > > > > > >
> > > > > > > > >   declare void @__kmpc_barrier(%struct.ident_t*, i32) #3
> > > > > > > > >   define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 {
> > > > > > > > >   attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" }
> > > > > > > > >   attributes #1 = { nounwind readnone }
> > > > > > > > >   attributes #2 = { argmemonly nounwind }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *All* functions need to be assumed convergent, not just the convergent barrier leafs.
> > > > > > > >
> > > > > > > >
> > > > > > > > The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
> > > > > > >
> > > > > > >
> > > > > > > I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked
> > > > > >
> > > > > >
> > > > > > +1 to the verifier check. @jlebar , do you agree?
> > > > >
> > > > >
> > > > > If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.
> > > >
> > > >
> > > > Is there a public test case? If not, can you share/construct one?
> > >
> > >
> > > Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.
> >
> >
> > I reported several problems to NVIDIA. Is the problem below the one you're referring to?
> >
> > For the following CUDA code:
> >
> >   if (threadIdx.x == 0) {
> >   // do some initialization (A)
> >   }
> >   __synchtreads();
> >   // some code (B)
> >
> >
> > when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:
> >
> >   entry:
> >   %0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
> >   %cmp.i2 = icmp eq i32 %0, 0
> >   br i1 %cmp.i2, label %if.then, label %if.end.split
> >  
> >   if.end.split:
> >   tail call void @llvm.nvvm.barrier0()#6
> >   // LLVM IR for B code block
> >   br label %if.end
> >  
> >   if.then:
> >   // LLVM IR for A code block
> >   tail call void @llvm.nvvm.barrier0()#6
> >   // LLVM IR for B code block
> >   br label %if.end
> >  
> >   if.end:
> >
>
>
> Can you post the starting IR for this?

This is the code without optimizations enabled:

If I don't enable optimizations then I get the following code which works correctly and doesn't hang:

  entry:
  %0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12
  %cmp.i2 = icmp eq i32 %0, 0
  br i1 %cmp.i2, label %if.then, label %if.end

  if.then:
  // LLVM IR for A code block
  br label %if.end

  if.end:
  tail call void @llvm.nvvm.barrier0() #6
  // LLVM IR for B code block

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D56274/new/

https://reviews.llvm.org/D56274