[PATCH] D42079: AMDGPU: Add a function attribute that shrinks buggy s_buffer opcodes on GFX9

Thu Feb 1 09:20:27 PST 2018

arsenm requested changes to this revision.
arsenm added inline comments.
This revision now requires changes to proceed.

================
Comment at: lib/Target/AMDGPU/SIShrinkInstructions.cpp:310-312
+      if (ST.hasBuggySBufferLoadStoreAtomicxN() &&
+          MFI->shrinkBuggySBufferLoadStoreAtomicxN() &&
+          TII->isSMRD(MI.getOpcode())) {
----------------
mareko wrote:
> nhaehnle wrote:
> > mareko wrote:
> > > nhaehnle wrote:
> > > > mareko wrote:
> > > > > nhaehnle wrote:
> > > > > > dstuttard wrote:
> > > > > > > mareko wrote:
> > > > > > > > arsenm wrote:
> > > > > > > > > mareko wrote:
> > > > > > > > > > arsenm wrote:
> > > > > > > > > > > We shouldn't be putting bug workarounds in an optimization pass. This should probably be part of the initial selection
> > > > > > > > > > What initial section are you talking about? Note that the placement of this pass is perfect, because the new code needs to run after SILoadStoreOptimizer.
> > > > > > > > > One option would be to do this in the DAG and split the intrinsics before getting selected in the first place, or we could do this in AdjustInstrPostInstrSelection or in a new bug workaround pass. 
> > > > > > > > > 
> > > > > > > > > Why does it need to be after SILoadStoreOptimizer? Is it just because it will try to merge and form these? If it's buggy it should just not do that in the first place.
> > > > > > > > The intrinsic is translated into s_buffer_load_dword only. There are no intrinsics for the xN opcodes - this is actually the optimal situation because we have SILoadStoreOptimizer. SILoadStoreOptimizer merges s_buffer_load_dword into x2 and x4 - this is the only place that generates the xN opcodes. This new code needs to run after that to convert s_buffer_load into s_load for xN opcodes where N >= 2.
> > > > > > > I'm planning to add intrinsics that will allow multi-dword s_buffer_loads as this is considerably easier for our front-end. However, I guess the placement of this work-around means that this won't matter.
> > > > > > > 
> > > > > > > I agree that the work-around being after SILoadStoreOptimizer looks like the best place, but whether that should be as a separate bug workaround pass as Matt suggests is arguable - Matt, do you think it is just cleaner to have it pulled into a separate pass?
> > > > > > >  
> > > > > > Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected chips?
> > > > > > 
> > > > > > By the way, we may want to assume that buffers are aligned to e.g. 16 bytes and make this dependent on the alignment, e.g. if the low 4 bits of the saddr are known to be zero, we should still be able to use x4.
> > > > > > Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected chips
> > > > > 
> > > > > That's the current behavior, and it increases code size by 14.4%.
> > > > > 
> > > > > Loads are only guaranteed to be aligned to 4 bytes.
> > > > > Loads are only guaranteed to be aligned to 4 bytes.
> > > > 
> > > > We could easily increase the required alignment on UBOs in radeonsi, let's say to 16 or even 64 bytes. I believe some other drivers set it to 256 bytes, which is the maximum allowed by the OpenGL spec. Then checking alignment for constant offsets becomes trivial, and for non-constant offsets we could still use computeKnownBits.
> > > The problem is, even when we set the required UBO alignment to 16, there is not guarantee that s_buffer_load_dwordx4 will be aligned to 16, because s_buffer_load_dwordx4 is constructed from multiple s_buffer_load_dword which load consecutive dwords, but it doesn't mean that the first one is aligned to 16. Or course, we could enforce the literal offset to be a multiple of 16, but I think that would decrease the number of opportunities for load opcode merging.
> > Yes, that's precisely what I meant. Yes, it provides a smaller number of opportunities, but it could be enabled unconditionally without a special flag. So it's still a useful trade-off.
> I'm a little concerned about app compatibility if we increase the alignment. 256 bytes was for ancient hw.
This cannot be in this optimization pass. This must be done somewhere else

Repository:
  rL LLVM

https://reviews.llvm.org/D42079