[PATCH] D12246: [NVPTX] change threading intrinsics from noduplicate to convergent

Sun Aug 30 23:40:55 PDT 2015

jingyue added a comment.

A conservative solution to this loop unrolling issue is to disable partial- and runtime-unrolling if they move a convergent call under a new condition. Full-unrolling should be fine, but I can't prove it because the constraints of duplicating a "convergent" instruction are unclear.

I suggest we close this patch, and (discuss how to) fix potentially unsafe control-flow transformations (such as loop unrolling and sinking in InstCombine) in other patches. The current way of treating `__syncthreads` noduplicate slows down SHOC's FFT benchmark <https://github.com/vetter/shoc/tree/3bd13c7982cd8ac85bc4dfa17cbd64bb94ccf0ab/src/cuda/level1/fft> by over 2x, because the `transpose` function <https://github.com/vetter/shoc/blob/3bd13c7982cd8ac85bc4dfa17cbd64bb94ccf0ab/src/cuda/level1/fft/codelets.h#L328> is not inlined.

http://reviews.llvm.org/D12246