[PATCH] D99723: [ARM] Transforming memcpy to Tail predicated Loop

Mon Apr 5 01:40:57 PDT 2021

malharJ added inline comments.

================
Comment at: llvm/lib/Target/ARM/ARMInstrMVE.td:6873
+let usesCustomInserter = 1, hasNoSchedulingInfo = 1 in {
+  def MVE_MEMCPYLOOPINST : PseudoInst<(outs),
+                                      (ins rGPR
----------------
dmgreen wrote:
> Can you improve this formatting. If you clang-formatted it, it doesn't do well with .td files and manually making it look more like the others will do better.
Had clang formatted it but yeah doesnt look good.
Updated now.

================
Comment at: llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll:21
+; CHECK-NEXT:    lsrs r7, r7, #4
+; CHECK-NEXT:    wlstp.8 lr, r7, .LBB0_3
+; CHECK-NEXT:    b .LBB0_4
----------------
dmgreen wrote:
> Why does this not use r3 directly?
This seems to be an issue with generating a preHeader during the transform .. 

The phi-node-elimination pass is lowering the phi instructions (in the TP loopBody)  as COPY operations (into the PreHeader).
In this instance, the copy/mov can be seen below on line 32:
```
mov r7, r3
```

I've fixed this issue as of now by not generating an extra preHeader during the transform .. 
so the mov ends up above the t2WhileLoopStartLR and overall it seems to work. 

Please see my comment about the latest changes for more details on this.

================
Comment at: llvm/test/CodeGen/Thumb2/mve_tp_loop.ll:35
+; CHECK-NEXT:    push {r7, lr}
+; CHECK-NEXT:    add.w r3, r2, #15
+; CHECK-NEXT:    bic r3, r3, #16
----------------
dmgreen wrote:
> What is this generating r3 for? I thought those should be removed.
Good spot,

I think this is because DCE was not happening for the instructions calculating iterationCount.

I had a quick look at  ARMLowOverheadLoops::IterationCountDCE( ), and it seems
that the expectation from the generated MIR is:

```
//   $lr = big-itercount-expression
//   ..
//   $lr = t2DoLoopStart/t2WhileLoopStartLR renamable $lr
//   vector.body:
```

I've updated the final expression (in the generated MIR) calculating iteration count to now return
the result in LR (earlier it was returning in one of the rGPRs) and these are now getting removed.

================
Comment at: llvm/test/CodeGen/Thumb2/mve_tp_loop.ll:55
+; void test3(int* restrict X, int* restrict Y, int n){
+;     printf("n = %d\n", n);
+;     memcpy(X, Y, n);
----------------
dmgreen wrote:
> Why is this using printf? It looks like an execution test, not a unit test. Is it testing anything specifically? If so it can probably use any call, not a variadic version of printf.
So the intent of the test was just to check whether code surrounding memcpy call site is properly transformed. I simply used printf to prevent the code from getting optimized away (seems like a poor way now that I think about it).

I've removed this test now since the transformation involving nested loops (in memcall.ll) is already testing the mentioned intent.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99723/new/

https://reviews.llvm.org/D99723