[PATCH] D52944: AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap

Mon Oct 8 14:59:59 PDT 2018

mareko added inline comments.

================
Comment at: include/llvm/IR/IntrinsicsAMDGPU.td:397-399
+  // M0 = {hi16:address, lo16:waveID}. Allow passing M0 as a pointer, so that
+  // the bit packing can be optimized at the IR level.
+  [LLVMQualPointerType<llvm_i32_ty, 2>, // IntToPtr(M0)
----------------
mareko wrote:
> nhaehnle wrote:
> > mareko wrote:
> > > nhaehnle wrote:
> > > > I don't like having this as a pointer. It really isn't, so we're really kidding ourselves here.
> > > > 
> > > > This could just be an i32. Alternatively, and honestly preferably, this should be just the pointer to the honest-to-goodness address, and the waveID a separate operand. It should be possible to add the DAG nodes to build the M0 value in LowerINTRINSIC_W_CHAIN. Unless the DAG combine is unable to optimize that? (Or possibly even then, and we should fix the DAG combines...)
> > > That was what the previous version did and it produced worse code. It selected correct instructions, but we want CSE of the m0 expression across basic blocks and simplifications based on the numbers of known zero bits in the inputs. Doing it in LowerINTRINSIC_W_CHAIN is too late for those.
> > I see, that sucks. We really need better CSE at the MachineInstr level.
> > 
> > Well, the problem is that pretending that this thing is a pointer is pretty far from correct. Since it makes for a cleaner "API", I'd personally prefer to have the real pointer here, and the waveID as a separate argument, even if it means a few additional instructions.
> I wouldn't like more instructions, because I may be instruction-bound. The pointer is made by inttoptr anyway, currently even inttoptr((NULL << 16) | waveID) = inttoptr(waveID). inttoptr is most likely to be used in all cases, because GDS ordered count offsets are constant and have only 14 bits that can change. i32 for m0 would be better.
One optimization for drivers is to pass a non-zero GDS offset in a user SGPR in the high 16bits, so that the shader doesn't have to shift. Then it's just OR'd with WaveID. If the offset is 0, there is no offset to OR.

Repository:
  rL LLVM

https://reviews.llvm.org/D52944