[PATCH] D79670: [AMDGPU] Order pos exports before param exports

Mon May 11 05:52:14 PDT 2020

critson marked 2 inline comments as done.
critson added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUExportClustering.cpp:38
+
+  // Move position exports before other exports while preserving
+  // the order within different export types (pos or other).
----------------
foad wrote:
> Can you say why it's beneficial to do position exports first?
I will add comment.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUExportClustering.cpp:41
+  unsigned InsertionPoint = 0;
+  for (unsigned Idx = 0, End = Chain.size(); Idx < End; ++Idx) {
+    SUnit *SU = Chain[Idx];
----------------
foad wrote:
> It's a shame that this sorting is O(n^2) but I guess it's not a problem because the average chain length will be 2?
> 
> You could probably do this sorting in a cute way with std::partition_copy if you felt inclined: https://en.cppreference.com/w/cpp/algorithm/partition_copy
This sort is O(n), it passes through the list only once moving elements to the top as it goes.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUExportClustering.cpp:82-85
+  // Pass through DAG gathering a list of exports and removing barrier edges
+  // creating dependencies on exports. Freeing exports of successor edges
+  // allows more scheduling freedom, and nothing should be order dependent
+  // on exports.  Edges will be added later to order the exports.
----------------
foad wrote:
> Why are the barrier edges there in the first place? Either the exports can be reordered, so the barrier edges should not be there; or they can't be, so we shouldn't ignore the barrier edges!
At a high level, what is a barrier dependency on an export?
They get introduced because of intrinsics which access memory, etc.
Consider the following:
  call void @llvm.amdgcn.exp.f32(i32 32, i32 15, float 1.0, float 1.0, float 1.0, float 1.0, i1 false, i1 false)
  call void @llvm.amdgcn.exp.f32(i32 33, i32 15, float 1.0, float 1.0, float 1.0, float 0.5, i1 false, i1 false)
  %load = call float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32> undef, i32 %idx, i32 0, i32 0)
  call void @llvm.amdgcn.exp.f32(i32 12, i32 15, float 0.0, float 0.0, float 0.0, float %load, i1 true, i1 false)

The load forces ordering between the exports either side of it.  I do not think there is a hardware motivation for this?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D79670/new/

https://reviews.llvm.org/D79670