[PATCH] D79670: [AMDGPU] Order pos exports before param exports

Mon May 11 07:29:20 PDT 2020

foad added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUExportClustering.cpp:41
+  unsigned InsertionPoint = 0;
+  for (unsigned Idx = 0, End = Chain.size(); Idx < End; ++Idx) {
+    SUnit *SU = Chain[Idx];
----------------
critson wrote:
> foad wrote:
> > It's a shame that this sorting is O(n^2) but I guess it's not a problem because the average chain length will be 2?
> > 
> > You could probably do this sorting in a cute way with std::partition_copy if you felt inclined: https://en.cppreference.com/w/cpp/algorithm/partition_copy
> This sort is O(n), it passes through the list only once moving elements to the top as it goes.
Yeah but each move is done with an erase and an insert, which both have to move the all the remaining items in the vector.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUExportClustering.cpp:82-85
+  // Pass through DAG gathering a list of exports and removing barrier edges
+  // creating dependencies on exports. Freeing exports of successor edges
+  // allows more scheduling freedom, and nothing should be order dependent
+  // on exports.  Edges will be added later to order the exports.
----------------
critson wrote:
> foad wrote:
> > Why are the barrier edges there in the first place? Either the exports can be reordered, so the barrier edges should not be there; or they can't be, so we shouldn't ignore the barrier edges!
> At a high level, what is a barrier dependency on an export?
> They get introduced because of intrinsics which access memory, etc.
> Consider the following:
>   call void @llvm.amdgcn.exp.f32(i32 32, i32 15, float 1.0, float 1.0, float 1.0, float 1.0, i1 false, i1 false)
>   call void @llvm.amdgcn.exp.f32(i32 33, i32 15, float 1.0, float 1.0, float 1.0, float 0.5, i1 false, i1 false)
>   %load = call float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32> undef, i32 %idx, i32 0, i32 0)
>   call void @llvm.amdgcn.exp.f32(i32 12, i32 15, float 0.0, float 0.0, float 0.0, float %load, i1 true, i1 false)
> 
> The load forces ordering between the exports either side of it.  I do not think there is a hardware motivation for this?
EXP and EXP_DONE instructions have mayStore=1 and it seems reasonable that the scheduler doesn't have any information about what they're storing or where, so it adds barriers to stop them from being reordered. In particular they satisfy `isGlobalMemoryObject` in ScheduleDAGInstrs.cpp.

So I think it's fine to remove the existing barriers and add your own, if you know more about it than the scheduler :-)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79670/new/

https://reviews.llvm.org/D79670