[llvm] [VPlan] Model FOR resume value extraction in VPlan. (PR #93396)

Tue Jun 4 06:56:10 PDT 2024

================
@@ -847,6 +847,55 @@ bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan,
     // all users.
     RecurSplice->setOperand(0, FOR);
 
+    // This is the second phase of vectorizing first-order recurrences. An
+    // overview of the transformation is described below. Suppose we have the
+    // following loop.
+    //
+    //   for (int i = 0; i < n; ++i)
+    //     b[i] = a[i] - a[i - 1];
+    //
+    // There is a first-order recurrence on "a". For this loop, the shorthand
+    // scalar IR looks like:
+    //
+    //   scalar.ph:
+    //     s_init = a[-1]
+    //     br scalar.body
+    //
+    //   scalar.body:
+    //     i = phi [0, scalar.ph], [i+1, scalar.body]
+    //     s1 = phi [s_init, scalar.ph], [s2, scalar.body]
+    //     s2 = a[i]
+    //     b[i] = s2 - s1
+    //     br cond, scalar.body, ...
+    //
+    // In this example, s1 is a recurrence because it's value depends on the
+    // previous iteration. In the first phase of vectorization, we created a
+    // vector phi v1 for s1. We now complete the vectorization and produce the
+    // shorthand vector IR shown below (for VF = 4, UF = 1).
+    //
+    //   vector.ph:
+    //     v_init = vector(..., ..., ..., a[-1])
+    //     br vector.body
+    //
+    //   vector.body
+    //     i = phi [0, vector.ph], [i+4, vector.body]
+    //     v1 = phi [v_init, vector.ph], [v2, vector.body]
+    //     v2 = a[i, i+1, i+2, i+3];
+    //     v3 = vector(v1(3), v2(0, 1, 2))
+    //     b[i, i+1, i+2, i+3] = v2 - v3
+    //     br cond, vector.body, middle.block
+    //
+    //   middle.block:
+    //     x = v2(3)
+    //     br scalar.ph
+    //
+    //   scalar.ph:
+    //     s_init = phi [x, middle.block], [a[-1], otherwise]
+    //     br scalar.body
+    //
----------------
ayalz wrote:

Would be good to also show how **live-outs** are extracted and reach the exit block, along with **resumes** that reach the scalar loop. The former currently extract the penultimate element from v2 in the middle block, alongside extracting its last element there for the latter. Note that the last element of v3 can be extracted as live-out instead of extracting the penultimate element from v2. Here's a sketch, along with the alternative thought of keeping the FOR phi scalar mentioned above:
```suggestion
    //   Original scalar IR, including live-out in exit block:
    //
    //   scalar.ph:
    //     s_init = a[-1]
    //     br scalar.body
    //
    //   scalar.body:
    //     i = phi [0, scalar.ph], [i+1, scalar.body]
    //     s1 = phi [s_init, scalar.ph], [s2, scalar.body]
    //     s2 = a[i]
    //     b[i] = s2 - s1
    //     br cond, scalar.body, exit.block
    //
    //   exit.block:
    //     lo = lcssa.phi [s1, scalar.body]
    //
    //   Alternative end result:
    //
    //   old.scalar.ph (pre.ph):
    //     s_init = a[-1]
    //    <potential bypass blocks>
    //     br vector.ph
    //
    //   vector.ph:
    //     br vector.body
    //
    //   vector.body
    //     i = phi [0, vector.ph], [i+4, vector.body]
    //     s1 = phi [s_init, vector.ph], [s_resume, vector.body]
    //     v2 = a[i, i+1, i+2, i+3];
    //     v3 = vector(s1, v2(0, 1, 2))
    //     b[i, i+1, i+2, i+3] = v2 - v3
    //     s_resume = v2(3)
    //     br cond, vector.body, middle.block
    //
    //   middle.block:
    //     s_penultimate = v2(2) = v3(3)
    //     br cond, scalar.ph, exit.block
    //
    //   scalar.ph:
    //     s_init' = phi [s_resume, middle.block], [s_init, otherwise]
    //     br scalar.body
    //   scalar.body:
    //     i = phi [0, scalar.ph], [i+1, scalar.body]
    //     s1 = phi [s_init', scalar.ph], [s2, scalar.body]
    //     s2 = a[i]
    //     b[i] = s2 - s1
    //     br cond, scalar.body, exit.block
    //
    //   exit.block:
    //     lo = lcssa.phi [s1, scalar.body], [s.penultimate, middle.block]
    
```

https://github.com/llvm/llvm-project/pull/93396