[PATCH] D42079: AMDGPU: Add a function attribute that shrinks buggy s_buffer opcodes on GFX9

Wed Jan 24 09:20:24 PST 2018

nhaehnle added inline comments.

================
Comment at: lib/Target/AMDGPU/SIShrinkInstructions.cpp:310-312
+      if (ST.hasBuggySBufferLoadStoreAtomicxN() &&
+          MFI->shrinkBuggySBufferLoadStoreAtomicxN() &&
+          TII->isSMRD(MI.getOpcode())) {
----------------
dstuttard wrote:
> mareko wrote:
> > arsenm wrote:
> > > mareko wrote:
> > > > arsenm wrote:
> > > > > We shouldn't be putting bug workarounds in an optimization pass. This should probably be part of the initial selection
> > > > What initial section are you talking about? Note that the placement of this pass is perfect, because the new code needs to run after SILoadStoreOptimizer.
> > > One option would be to do this in the DAG and split the intrinsics before getting selected in the first place, or we could do this in AdjustInstrPostInstrSelection or in a new bug workaround pass. 
> > > 
> > > Why does it need to be after SILoadStoreOptimizer? Is it just because it will try to merge and form these? If it's buggy it should just not do that in the first place.
> > The intrinsic is translated into s_buffer_load_dword only. There are no intrinsics for the xN opcodes - this is actually the optimal situation because we have SILoadStoreOptimizer. SILoadStoreOptimizer merges s_buffer_load_dword into x2 and x4 - this is the only place that generates the xN opcodes. This new code needs to run after that to convert s_buffer_load into s_load for xN opcodes where N >= 2.
> I'm planning to add intrinsics that will allow multi-dword s_buffer_loads as this is considerably easier for our front-end. However, I guess the placement of this work-around means that this won't matter.
> 
> I agree that the work-around being after SILoadStoreOptimizer looks like the best place, but whether that should be as a separate bug workaround pass as Matt suggests is arguable - Matt, do you think it is just cleaner to have it pulled into a separate pass?
>  
Shouldn't the SILoadStoreOptimizer just not generate the higher xN instructions on affected chips?

By the way, we may want to assume that buffers are aligned to e.g. 16 bytes and make this dependent on the alignment, e.g. if the low 4 bits of the saddr are known to be zero, we should still be able to use x4.

Repository:
  rL LLVM

https://reviews.llvm.org/D42079