[LLVMdev] PTX backend support for atomics

Mon Nov 21 07:41:16 PST 2011

On Fri, Nov 18, 2011 at 8:52 PM, Jonathan Ragan-Kelley <jrk at csail.mit.edu>wrote:

> Looking further during down time at the dev meeting today, it actually
> seems that PTX atom.* and red.* intrinsics map extremely naturally
> onto the LLVM atomicrmw and cmpxchg instructions. The biggest issue is
> that a subset of things expressible with these LLVM instructions do
> not trivially map to PTX, and the range of things naturally supported
> depends on the features of a given target. With sufficient effort, all
> possible uses of these instructions could be emulated on all targets,
> at the cost of efficiency, but this would significantly complicate the
> codegen and probably produce steep performance cliffs.
>
> The basic model:
>
>  %d = cmpxchg {T}* %a, {T} %b, {T} %c
>  --> atom.{space of %a}.cas.{T} d, [a], b, c
>
>  %d = atomicrmw {OP} {T}* %a, {T} b
>  --> atom.{space of %a}.{OP}.{T} d, [a], b
>  for op in { add, and, or, xor, min, max, xchg }
>
> with the special cases:
>
>  %d is unused --> red.{space of %a}.{OP}.{T} d, [a], b   # i.e. use
> the reduce instr instead of the atom instr
>
>  {OP} == {add, sub} && b == 1 --> use PTX inc/dec op
>
> I think the right answer for the indefinite future is to map exactly
> those operations and types which trivially map to the given PTX and
> processor versions, leaving other cases as unsupported. (Even on the
> SSE and NEON backends, after all, select with a vector condition has
> barfed for years.) In the longer run, it could be quite useful for
> portability to map the full range of atomicrmw behaviors to all PTX
> targets using emulation when necessary, but relative to the current
> state of the art (manually writing different CUDA code paths with
> different sync strategies for different machine generations), only
> supporting what maps trivially is not a regression.
>
> Thoughts?
>

For the short term, I definitely agree that implementing the trivial maps
is most important.  I'm not too concerned about the corner cases at the
moment.

As for emulating atomics when they are not available, this is probably just
something we have to live with.  To be complete, we should support all
intrinsics on all targets and leave it up to the front-end to determine how
best to implement source-level functionality.  I'm not particularly
troubled by steep performance curves for emulated atomics.  It's ultimately
the job of the LLVM IR generator to decide how best to map to target
intrinsics given a target.

-- 

Thanks,

Justin Holewinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111121/8b0a93c1/attachment.html>