[PATCH] D70210: [MirNamer][Canonicalizer]: Perform instruction semantic based renaming

Thu Nov 14 18:09:21 PST 2019

aditya_nandakumar marked 10 inline comments as done.
aditya_nandakumar added inline comments.

================
Comment at: llvm/lib/CodeGen/MIRCanonicalizerPass.cpp:441
-  }
-
   Changed |= doDefKillClear(MBB);
----------------
plotfi wrote:
> Is there anything specifically that is needed from MIRCanonicalizerPass Given this deletion here? Can you use MIRNamerPass as your base instead and do this deletion in a later NFC commit if necessary? 
The change affects both. This also creates a bunch of new vreg names and changes the names with no benefit - the names have been already created with many useful properties and this undoes that.

================
Comment at: llvm/lib/CodeGen/MIRVRegNamerUtils.cpp:45

-      MachineOperand &MO = candidate->getOperand(i);
-      if (!(MO.isReg() && Register::isVirtualRegister(MO.getReg())))
-        continue;
+  std::map<std::string, unsigned> VRegNameCollisionMap;

----------------
plotfi wrote:
> Might want to consider a StringMap
Mostly followed the style which is using std containers in the rest of this file.

================
Comment at: llvm/lib/CodeGen/MIRVRegNamerUtils.cpp:71

-    // Here we walk the root candidates. We start from the 0th operand because
-    // the root is normally a store to a vreg.
-    for (unsigned i = 0; i < candidate->getNumOperands(); i++) {
-
-      if (!candidate->mayStore() && !candidate->isBranch())
-        break;
-
-      MachineOperand &MO = candidate->getOperand(i);
-
-      // TODO: Do we want to only add vregs here?
-      if (!MO.isReg() && !MO.isFI())
-        continue;
-
-      LLVM_DEBUG(dbgs() << "Enqueue Reg/FI"; MO.dump(); dbgs() << "\n";);
-
-      RegQueue.push(MO.isReg() ? TypedVReg(MO.getReg())
-                               : TypedVReg(RSE_FrameIndex));
+std::string NamedVRegCursor::getInstructionOpcodeHash(MachineInstr &MI) {
+  std::string S;
----------------
plotfi wrote:
> Comments please. 
Comments in the header.

================
Comment at: llvm/lib/CodeGen/MIRVRegNamerUtils.cpp:81
+      return Register::isVirtualRegister(MO.getReg())
+                 ? MRI.getVRegDef(MO.getReg())->getOpcode()
+                 : (unsigned)MO.getReg();
----------------
plotfi wrote:
> plotfi wrote:
> > I think the candidate walk code can be left alone here, and removed in an NFC if really necessary.
> How do you guarantee that two different vregs don't resolve to the same "HashOperand" value? Because can't the opcode of two different def instructions happen to be the same?  Is it just highly unlikely because you are also hashing the operands of the entire instruction together? 
Consider this sequence.
On one file you have

```
%0 = COPY $x0
%1 = LDIMM 25
%2 = ADD %0, %1
%3 = SUB %2, 42
```
On the other side you have
```
%0 = COPY $x1
%1 = LDIMM 26
%2 = ADD %0, %1
%3 = SUB %2, 42
```
Clearly the add is identical on both sides and is not that interesting - essentially it's an add from a LDIMM and a COPY. It's possible that those values might be different, but as far as the ADD is concerned, it's sufficient to say that for both sides ADD is doing something similar and is not the source of a diff. Add's VReg name will show identical on both cases, but sources will appear different (due to different names). However each of %0 and %1 would differ in how they are named and will show up as the diff. Sub will finally look identical. In general we want to capture the diff in the very first places they happen and not beyond that. 
After renaming LHS would be,
```
%hash1 = COPY $x0
%hash2 = LDIMM 25
%hash3 = ADD %hash1, %hash2
%hash7 = SUB %hash3, 0
```
RHS would be
```
%hash4 = COPY $x1
%hash5 = LDIMM 26
%hash3 = ADD %hash4, %hash5
%hash7 = SUB %hash3, 0
```
Now when you diff those, the ADD is not the one that's different, but the first two instructions are. So they will show up nicely in the diff, but the ADD will appear identical with respect to destination and opcode, but will differ from sources. The SUB in both will be identical and will disappear from the context of the diff as it's not interesting. Now even if you have additional instruction somewhere in the instruction stream in either the LHS or the RHS side, the naming of this won't change and will point out the differences nicely.

================
Comment at: llvm/lib/CodeGen/MIRVRegNamerUtils.cpp:81
-
-void doCandidateWalk(std::vector<TypedVReg> &VRegs,
-                     std::queue<TypedVReg> &RegQueue,
----------------
aditya_nandakumar wrote:
> plotfi wrote:
> > plotfi wrote:
> > > I think the candidate walk code can be left alone here, and removed in an NFC if really necessary.
> > How do you guarantee that two different vregs don't resolve to the same "HashOperand" value? Because can't the opcode of two different def instructions happen to be the same?  Is it just highly unlikely because you are also hashing the operands of the entire instruction together? 
> Consider this sequence.
> On one file you have
> 
> ```
> %0 = COPY $x0
> %1 = LDIMM 25
> %2 = ADD %0, %1
> %3 = SUB %2, 42
> ```
> On the other side you have
> ```
> %0 = COPY $x1
> %1 = LDIMM 26
> %2 = ADD %0, %1
> %3 = SUB %2, 42
> ```
> Clearly the add is identical on both sides and is not that interesting - essentially it's an add from a LDIMM and a COPY. It's possible that those values might be different, but as far as the ADD is concerned, it's sufficient to say that for both sides ADD is doing something similar and is not the source of a diff. Add's VReg name will show identical on both cases, but sources will appear different (due to different names). However each of %0 and %1 would differ in how they are named and will show up as the diff. Sub will finally look identical. In general we want to capture the diff in the very first places they happen and not beyond that. 
> After renaming LHS would be,
> ```
> %hash1 = COPY $x0
> %hash2 = LDIMM 25
> %hash3 = ADD %hash1, %hash2
> %hash7 = SUB %hash3, 0
> ```
> RHS would be
> ```
> %hash4 = COPY $x1
> %hash5 = LDIMM 26
> %hash3 = ADD %hash4, %hash5
> %hash7 = SUB %hash3, 0
> ```
> Now when you diff those, the ADD is not the one that's different, but the first two instructions are. So they will show up nicely in the diff, but the ADD will appear identical with respect to destination and opcode, but will differ from sources. The SUB in both will be identical and will disappear from the context of the diff as it's not interesting. Now even if you have additional instruction somewhere in the instruction stream in either the LHS or the RHS side, the naming of this won't change and will point out the differences nicely.
I removed this as there will be no more users of this method. 

================
Comment at: llvm/lib/CodeGen/MIRVRegNamerUtils.cpp:88
+    // okay.
+    return 0; // TODO.
+  };
----------------
plotfi wrote:
> llvm_unreachable please. 
That unfortunately won't work as we are not explicitly handling all variants of MachineOperand Kinds here. There are some challenges that can be solved (bb naming scheme) when dealing with PHIS ie how to hash them without using the name . There maybe other cases where we want to differentiate special operands. Right now we uniquely differentiate immediate and regs, but we can't really assert on any of the others.

================
Comment at: llvm/lib/CodeGen/MIRVRegNamerUtils.cpp:105
+bool NamedVRegCursor::renameInstsInMBB(MachineBasicBlock *MBB) {
+  std::vector<TypedVReg> VRegs;
+  std::string Prefix = "bb" + std::to_string(getCurrentBBNumber()) + "_";
----------------
plotfi wrote:
> I like this linear approach but I might like to keep the tree based approach as well as a toggle until we can add more tests. In the tree based approach I was trying to do the canonicalization based on the chain of operations that flow into a side effect, where here the side effects are renaming barriers? On second thought, I really like the hashing approach on the VReg-Def opcode and if you are confident it wont result in too may cases where a difference that should have remained is lost, I'd be fine with replacing all of this walking business. 
> 
> Comments (and perhaps some brief MIR snippets) on how this renaming mechanism works would be really nice to have as well.
> 
> From what I understand, you have many test cases downstream. Can these be ported to aarch64 to bolster the testing upstream? Even the tests with 7 and 8 instructions can be useful, and I'd assume shouldn't be too difficult to port to a supported downstream target? Does this sound reasonable to you @aditya_nandakumar @bogner ??
In general hash collisions are resolved similarly on both sides of a diff and the hash collision renaming will also happen similarly on both sides of the diff (just incrementing numbers). This will line up really well for diffs. Additionally with the hashing method, just when you're staring at two equivalent pieces of code, just by looking at the reg name that is a hash, you can just assume that they're likely equivalent instructions and move your focus elsewhere.
In general, due to the disadvantages of the previous algorithm and the advantages of this approach, there should be no need to keep both approaches.
Regarding tests, there's sufficient coverage. The core algorithm is simple - hash instruction oeprands and rename based on the hash(which capture the semantics of the instruction). It's evident that this works correctly (look at MIR/AArch64/mirCanonIdempotent.mir included in this patch where two of the same MOV instructions are renamed correctly)
```
# CHECK-NEXT: %bb0_42274__1:gpr32 = MOVi32imm 408
# CHECK-NEXT: %bb0_42274__2:gpr32 = MOVi32imm 408
```
The only thing adding more tests would help with is we tie that up some diffing tool and make sure that the core strategy still works. Otherwise, it will be quite identical to the ones we already have here. I'll still try to come up with some screenshots of the two approaches and attach it to the review.

================
Comment at: llvm/lib/CodeGen/MIRVRegNamerUtils.h:35
 class NamedVRegCursor {
+  // TypedVReg and VRType are used to tell the renamer what to do at points in a
+  // sequence of values to be renamed. A TypedVReg can either contain
----------------
plotfi wrote:
> Did this need to me moved from MIRVRegNamerUtils.cpp to MIRVRegNamerUtils.h? Can he be done in a NFC commit? On top of that, this appears to be a copy paste from https://reviews.llvm.org/D70029. Would you like to work with me on massaging these changes along D70029 into place here? 
I initially used that as the base but forgot to remove them as it's not needed due to the new algorithm. As this abstraction is not really needed any more, I'll go ahead and remove this.
For the new algorithm the only thing we need is RSE_Reg and if we have just one variant, there's no need for an enum.

================
Comment at: llvm/lib/CodeGen/MIRVRegNamerUtils.h:68

-  /// virtualVRegNumber - Book keeping of the last vreg position.
-  unsigned virtualVRegNumber;
+  unsigned CurrentBBNumber = 0;

----------------
plotfi wrote:
> Another copy paste from D70029. CurrentBBNumber is never incremented or really used for anything of import here. Why is it included in this diff? 
It's used for the renaming of the regs - Each instruction is named as "bb<BlockNum>_hash_<collision_counter>". The BlockNum is obtained by just saying getCurrentBlockNumber().
It's incremented in the top level renameRegs for each block that you visit.

================
Comment at: llvm/lib/CodeGen/MIRVRegNamerUtils.h:92

-  /// renameVRegs - For a given MachineBasicBlock, scan for side-effecting
-  /// instructions, walk the def-use from each side-effecting root (in sorted
-  /// root order) and rename the encountered vregs in the def-use graph in a
-  /// canonical ordering. This method maintains book keeping for which vregs
-  /// were already renamed in RenamedInOtherBB.
-  // @return changed
-  bool renameVRegs(MachineBasicBlock *MBB);
+  unsigned createVirtualRegisterWithName(unsigned VReg,
+                                         const std::string &Name);
----------------
plotfi wrote:
> Comment please. It wait for D70029 to solidify and land.
There's not much that's taken from D70029 besides this method name and the CurrentBBNumber. Because of the differences in how instructions are named and the algorithm changes, there's no need to wait for that patch. In fact, I just realized that I can simplify this a little more by removing some of the abstractions that we don't need any more.

I missed the comment on this one. Thanks for catching it.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70210/new/

https://reviews.llvm.org/D70210