[PATCH] D28915: [ExecutionDepsFix] Optimize instruction insertion

Thu Feb 9 07:13:08 PST 2017

myatsina added a comment.

First, I'm sorry for the late review :)

Second, I have a general comment:
I know that all the things you've put in this patch are aimed at improving the number of xors and the registers we pick as undefs, but it is much easier to review, check necessity of each change and make sure the feature is properly tested when you break down your commits to small changes/patches.
Removing redundant xors when you have several instructions that use same undef reg is one issue
isDependencyBreak support is another issue
Trying to add only true dependency to the undef regs set is another issue
Trying to choose a better register is another issue (and I would expect it to somehow make "pickBestReg" smarter and not rewrite everything at the end).

Most of these changes where not part of the original patch you've uploaded in your previous patch version that combined the function calls, so I'm wondering how they got themselves into this patch now? Do we really need them? Have you seen performance improvements by these changes too measuring each one separately (even if they look good "on paper" they might not bring the expected result, they might even make it worse)?

I'm sorry I'm being very strict about this again, but I want to make sure we do each such change the best possible way and keep them separate (with dedicated tests of course), so please break your changes as much as possible, upload each change you want in a different review with a different test case.
I would also think about the general approach of trying to rewrite the register again - how does it work together with "pickBestRegister"? Does it undo the work of that function? Or did we miss some cases there in the first place? With this issue i would start from the beginning  - what problem are we trying to solve and what is the best way to solve it?

================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:635
+// Set the undef read register to `Reg` for all UndefReads in the range
+// [from,to).
+void ExeDepsFix::collapseUndefReads(unsigned from, unsigned to, unsigned Reg) {
----------------
Maybe it's better to use iterator_range for this purpose?
You can see regIndices example in this file.

================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:649
+
+unsigned ExeDepsFix::updateChooseableRegs(SparseSet<unsigned> &ChoosableRegs,
+                                          const TargetRegisterClass *OpRC,
----------------
Please document this function and what's its purpose

================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:656
+    if (LiveRegSet.contains(Reg))
+      ChoosableRegs.erase(Reg);
+    else if (add) {
----------------
formatting issue - please surround the if with {} and thus be consistent with the style of the other conditions in this if- if - else if-else

================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:661
+        LowestValid = Reg;
+    } else if (ChoosableRegs.count(Reg) == 1) {
+      if (LowestValid == (unsigned)-1)
----------------
why do you need this case for?

================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:666
+  }
+  return LowestValid;
+}
----------------
why do we choose lowest register? why not choose a "far away" register (this way we may even be able to save the xor)?

================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:690
+  // before every instruction above. However, what we really want is
+  //    vxorps %xmm3, %xmm3, %xmm3
+  //    vrandom %xmm3<undef>, %xmm0<def>
----------------
Shouldn't "pickBestRegister" catch these cases? And if not, maybe we've missed an optimization case there.
I'm not sure we should have this in this form.
What is the scenario you're trying to optimizer and that "pickBestRegister" doesn't catch?

================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:711
+  size_t i, LastInit;
+  i = LastInit = UndefReads.size() - 1;
+  MachineInstr *UndefMI = std::get<0>(UndefReads[i]);
----------------
Please try to avoid defining variable's initial values this way. It not very readable.
How about:
size_t LastInit = ...;
size_t i = LastInit;

================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:722
+    for (auto LiveReg : LiveRegSet)
+      ChoosableRegs.erase(LiveReg);
+
----------------
Aren't you checking  the liverange set in updateChooseableRegs? Why do it here too?

================
Comment at: lib/CodeGen/ExecutionDepsFix.cpp:729
+          TII->getRegClass(UndefMI->getDesc(), OpIdx, TRI, *MF);
+      if (OpRC != LastOpRC || ChoosableRegs.size() == 0) {
+        if (LastInit != i) {
----------------
Please document the meaning of this code, and the next section as well. What are each of these sections doing?
It's really hard for to follow this.
I read the documentation that you've added earlier on of what is the purpose of your transformation, but I don't really understand how you iterate over the undefs and achieve this, why you've split it to several sections and the cases each section handles.

================
Comment at: lib/Target/X86/X86InstrInfo.cpp:7501
+  unsigned Opc = MI.getOpcode();
+  if (!(Opc == X86::VXORPSrr || Opc == X86::VXORPDrr || Opc == X86::XORPSrr ||
+        Opc == X86::XORPDrr))
----------------
There are additional instructions like KXORBrr, KXORwrr, KXORDrr, KXORQrr(AVX512).
Others might be added in the future, so perhaps we should set the right infrastructure for that.

swithc (op):
default:
 return false;
case XORPS:
case XORPS:
...
<check xor instruction>

================
Comment at: lib/Target/X86/X86InstrInfo.cpp:7504
+    return false;
+  unsigned Reg = 0;
+  for (unsigned i = 0; i < MI.getNumOperands(); ++i) {
----------------
you can use X86::NoRegister instead

================
Comment at: lib/Target/X86/X86InstrInfo.cpp:7505
+  unsigned Reg = 0;
+  for (unsigned i = 0; i < MI.getNumOperands(); ++i) {
+    const MachineOperand &MO = MI.getOperand(i);
----------------
You can iterate directly over operands

================
Comment at: lib/Target/X86/X86InstrInfo.cpp:7508
+    if (!MO.isReg() || (Reg != 0 && MO.getReg() != Reg))
+      return false;
+    Reg = MO.getReg();
----------------
I think this whole section of could would be a bit clearer if you write it in this spirit:

for ()

================
Comment at: lib/Target/X86/X86InstrInfo.cpp:7511
+  }
+  if (OutReg)
+    *OutReg = Reg;
----------------
I see that you write a lot of functions so that they could return 2 arguments and that you've used the same "trick" in other functions as well.
I don't thing it's a good approach in general. This need of a function to return 2 values generally indicates that you need to extract the common logic somehow or define a new type or do some other form of refactoring refactoring.
In this case I think "isDependencyBreak" should only return true/false (as the name indicates). If the module asking wants to know which register it is then it can find the dest reg on its own, or we can provide the module a "getDependencyBreakReg". In general it will behave similar to the already existing "isReg()" and "getReg()".

================
Comment at: test/CodeGen/X86/break-false-dep.ll:236
 top:
-  tail call void asm sideeffect "", "~{xmm6},~{dirflag},~{fpsr},~{flags}"()
-  tail call void asm sideeffect "", "~{xmm0},~{xmm1},~{xmm2},~{xmm3},~{dirflag},~{fpsr},~{flags}"()
-  tail call void asm sideeffect "", "~{xmm4},~{xmm5},~{xmm7},~{dirflag},~{fpsr},~{flags}"()
+;AVX-LABEL:@clearence
+; This is carefully constructed to force LLVM to materialize a vxorps, which
----------------
Each feature should have it's own test. You are not supposed to eliminate the original test (unless the original test's logic is no longer true), you're suppose to a small unit test for each feature.
This test was originally designed to check "pickBestRegister".

================
Comment at: test/CodeGen/X86/break-false-dep.ll:237
+;AVX-LABEL:@clearence
+; This is carefully constructed to force LLVM to materialize a vxorps, which
+; also implicitly breaks the dependency, making it a good candidate for the
----------------
This test is solely for testing the xor combining you've added? Or is it suppose to test something else?
I already see you've added a dedicated test for the "xor", so I'm not sure why you've changed this one.

If it's only for xor combine then you can do a very simple test like this:

define double @clearence(i64 %arg1, i64 %arg2) {
// the inline asm calls that make xmm6 as the best choice
%tmp1 = sitofp i64 %arg1 to double
%tmp2 = sitofp i64 %arg2 to double
%tmp3 = add i64 %tmp1, %tmp2
  ret double %tmp3
}

In this case xmm6 should be chosen as the undef for both convert instructions.
and there should only be one xor

================
Comment at: test/CodeGen/X86/break-false-dep.ll:242
+;AVX: vucomisd  [[XMM1]], %xmm0
+  %0 = fcmp ult double %x, 0.0
+  br i1 %0, label %main, label %fake
----------------
why did you add this part?
What is the logic that is tests?

================
Comment at: test/CodeGen/X86/break-false-dep.ll:367
+  ret double %fadd3
+}
----------------
where are the tests for all the other optimizations you've added?
The "isBreakingDependency", "register re-picking", etc?

================
Comment at: test/CodeGen/X86/vec_int_to_fp.ll:1668
 ; VEX-NEXT:  # BB#7:
-; VEX-NEXT:    vcvtsi2ssq %rax, %xmm2, %xmm1
+; VEX-NEXT:    vcvtsi2ssq %rax, %xmm1, %xmm1
 ; VEX-NEXT:  .LBB39_8:
----------------
Why is using xmm1 ok?
If we fall through from the revius BB where xmm1 is xored, then it's ok but we might jump to this block from another place and if we use xmm1 and won't have a xor, then we will have a stall, no?

https://reviews.llvm.org/D28915