[PATCH] [SDAG] Make the DAGCombine worklist not grow endlessly due to duplicate insertions.

Tue Jul 22 15:46:19 PDT 2014

I've made the requested minor changes. Let me know if you'd like a fresh patch to review. See detailed replies to questions and comments below.

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:101-107
@@ -99,8 +100,9 @@
     //
     // The set contains the items on the worklist, but does not
     // maintain the order they should be visited.
     //
     // The vector maintains the order nodes should be visited, but may
     // contain duplicate or removed nodes. When choosing a node to
     // visit, we pop off the order stack until we find an item that is
     // also in the contents set. All operations are O(log N).
+    SmallVector<SDNode*, 64> Worklist;
----------------
Tim Northover wrote:
> This comment now seems out of date.
Yea, updated comments here.

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:1082
@@ -1074,1 +1081,3 @@
 
+static bool recursivelyDeleteUnusedNodes(SelectionDAG &DAG, DAGCombiner &DC,
+                                         SDNode *N) {
----------------
hfinkel at anl.gov wrote:
> It seems like we now have a number of routines that do something like this. SDAG has an:
>   void RemoveDeadNode(SDNode *N);
> 
> which is also supposed to have behavior close to this (except that it does not updated the DC worklist?). Maybe some refactoring could consolidate things?
The intent was for this to be a DAG-combiner specific deletion mechanism.

Maybe it would be more clear as a method on DAGCombiner? I think I like that better. It also (IMO) removes some of the problems with clarifying that it manages the worklist.

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:1096
@@ +1095,3 @@
+      for (unsigned i = 0, e = N->getNumOperands(); i != e; ++i)
+        if (N != N->getOperand(i).getNode())
+          Nodes.insert(N->getOperand(i).getNode());
----------------
hfinkel at anl.gov wrote:
> Tim Northover wrote:
> > Is this possible? Isn't that an immediate cycle?
> And if it is possible, do you need a isPredecessorOf check instead?
I saw crashes due to this at some point in my testing but I can't reproduce them. I'll remove it for now.

Note that we don't really need to understand the nature of the cycle, just avoid creating an infinite loop in this function because we've already removed N from the set.

================
Comment at: test/CodeGen/ARM/sxt_rot.ll:12-13
@@ -11,3 +11,4 @@
 ; CHECK: test1
-; CHECK: sxtb r0, r0, ror #8
+; CHECK: lsr r0, r0, #8
+; CHECK: sxtb r0, r0
   %B = lshr i32 %A, 8
----------------
Tim Northover wrote:
> This doesn't look ideal, but it's just a deficiency in the ARM patterns. I've got a fix ready to go, but I'll wait until this is in to avoid giving you conflicts to resolve or anything.
Yea, this seemed clearly like a regression. Thanks for prepping the fix.

================
Comment at: test/CodeGen/X86/block-placement.ll:240
@@ -239,3 +239,1 @@
 
-define void @test_loop_rotate_reversed_blocks() {
-; This test case (greatly reduced from an Olden bencmark) ensures that the loop
----------------
Eric Christopher wrote:
> Inquiring minds here? It looks like it was deleted because it's no longer applicable? Does the code look ok?
This test was trying to check for a very precise misbehavior when we formed a particularly convoluted loop structure (see the comment). After this patch, we don't form that loop structure. I have no way of re-reducing a test case that still does, and the value seems very low.

================
Comment at: test/CodeGen/X86/divide-by-constant.ll:60
@@ -58,4 +59,3 @@
 ; CHECK-LABEL: test6:
-; CHECK: imull $26215, %eax, %ecx
-; CHECK: sarl $18, %ecx
-; CHECK: shrl $15, %eax
+; CHECK: imull $26215, %eax
+; CHECK: movl %eax, %ecx
----------------
Eric Christopher wrote:
> Assume the additional check lines aren't because the code generated is actually worse?
Correct. As it happens, this code is quite a bit better (fewer reg-reg dependencies).

================
Comment at: test/CodeGen/X86/narrow-shl-load.ll:34-61
@@ -33,30 @@
-
-; DAGCombiner shouldn't fold the sdiv (ashr) away.
-; rdar://8636812
-; CHECK-LABEL: test2:
-; CHECK:   sarl
-
-define i32 @test2() nounwind {
-entry:
-  %i = alloca i32, align 4
-  %j = alloca i8, align 1
-  store i32 127, i32* %i, align 4
-  store i8 0, i8* %j, align 1
-  %tmp3 = load i32* %i, align 4
-  %mul = mul nsw i32 %tmp3, 2
-  %conv4 = trunc i32 %mul to i8
-  %conv5 = sext i8 %conv4 to i32
-  %div6 = sdiv i32 %conv5, 2
-  %conv7 = trunc i32 %div6 to i8
-  %conv9 = sext i8 %conv7 to i32
-  %cmp = icmp eq i32 %conv9, -1
-  br i1 %cmp, label %if.then, label %if.end
-
-if.then:                                          ; preds = %entry
-  ret i32 0
-
-if.end:                                           ; preds = %entry
-  call void @abort() noreturn
-  unreachable
-}
-
----------------
Note that the code for this test case is actually worse after my patch.

Before:
  test2:                                  # @test2
  # BB#0:                                 # %entry
        pushq   %rax
        movl    $127, 4(%rsp)
        movb    $0, 3(%rsp)
        movl    4(%rsp), %eax
        addl    %eax, %eax
        movsbl  %al, %eax
        sarl    %eax
        cmpl    $-1, %eax
        jne     .LBB1_2

After:
  test2:                                  # @test2
  # BB#0:                                 # %entry
        pushq   %rax
        movl    $127, 4(%rsp)
        movb    $0, 3(%rsp)
        movl    4(%rsp), %eax
        addl    %eax, %eax
        movsbl  %al, %eax
        shrl    %eax
        movzbl  %al, %eax
        cmpl    $255, %eax
        jne     .LBB1_2


I haven't tracked this down, but the test doesn't really give *any* information about what this was actually trying to test. The change it went in with added a single value type test to *avoid* one DAG combine. Without more context for what the desirable code was, it seemed a bad idea to keep the test at all.

We probably should have something which recognizes that we could do "sarl, cmpl -1" rather than "shrl, movzbl, cmpl 255" (or equivalent other constants). But I'm not fussed about letting this regress minorly and letting someone who wants come along and clean it up later.

================
Comment at: test/CodeGen/X86/store-narrow.ll:37
@@ -36,3 +36,3 @@
 ; X32-LABEL: test2:
-; X32: movb	8(%esp), %[[REG:[abcd]l]]
-; X32: movb	%[[REG]], 1(%{{.*}})
+; X32: movzbl	8(%esp), %e[[REG:[abcd]]]x
+; X32: movb	%[[REG]]l, 1(%{{.*}})
----------------
Note that this is probably a (very minor) regression as we don't actually create partial register stalls here.

================
Comment at: test/CodeGen/X86/vec_extract-sse4.ll:7
@@ -6,3 +6,3 @@
 ; CHECK-NEXT:    movl 8(%esp), %[[R1:e[abcd]x]]
-; CHECK-NEXT:    movl 12(%[[R1]]), %[[R2:e[abcd]x]]
-; CHECK-NEXT:    movl %[[R2]], (%[[R0]])
+; CHECK-NEXT:    movss 12(%[[R1]]), %[[R2:xmm.*]]
+; CHECK-NEXT:    movss %[[R2]], (%[[R0]])
----------------
This is also likely a minor regression as we fail to avoid the cross-register-bank copy, although in this case it seems likely to be quite minor.

http://reviews.llvm.org/D4616