[llvm] [MachineCopyPropagation] Detect and fix suboptimal instruction order to enable optimizations (PR #98087)

Tue Jul 9 08:30:43 PDT 2024

spaits wrote:

@qcolombet @s-barannikov  I have tried out postRA scheduling as you have suggested. I could get the pass to work on X86 with the following flags for `llc`: `-post-RA-scheduler -break-anti-dependencies=all`.

I used `smuloi8` in `llvm/test/CodeGen/X86/xmulo.ll` for testing where this patch enables optimization:
https://github.com/llvm/llvm-project/pull/98087/files#diff-4e4004fcba25c450462eaec7e0624e6fb03322ae07c833916cd1196e68c97ddfL86 (Also explained on bottom of the message.)

Here is the behavior of post-RA-scheduler on that function on Godbolt (left with postRA scheduling right without it): https://godbolt.org/z/K9z1TrK54

Also you can see that the data dependency was not introduced by the scheduler this time.

Here post-RA-scheduler just replaces `cl` to `dl`. (see Godbolt or bottom of the message) It does not resolve the dependency to enable copy propagation.
I think the goal of this pass is reorder machine instructions to optimize their execution on the target architecture by **considering the latencies** of instructions and the dependencies between them. Not by considering how we could more effective copy propagation. So to use this pass for copy propagation we will need extra code. And based on my observation that extra code would be less fitting for postRA scheduling.

I think this optimization would be pretty useful by looking at the tests it has improved on many different targets in many different contexts. Also there were other attempts to do similar things. I didn't really check them deeply but they did not move around instructions: https://github.com/llvm/llvm-project/pull/74239

I still would prefer this optimization to be in machine-cp. Copy propagation is already doing some very simple data flow analysis like stuff. What I would do just would make this analysis a bit more advanced. If I would extend the responsibilities of post-RA-scheduler I would have to also recognize possible copy propagation in post-RA-scheduler.

Can I ask why do you think that this optimization would belong to post-RA-scheduler?

(Sorry for the long messages I am just trying to exactly describe my thought process and my first language isn't English :) )

---

Explaining the behavior of  my patch and also machine-cp with my patch.
With my patch this happens:
Here is the assembly generated without my patch and without postRA scheduling:
```asm
smuloi8:
# %bb.0:
  movl %ecx, %eax
  imulb %dl
  seto %cl
  movb %al, (%r8)
  movl %ecx, %eax
  retq
```

It recognizes that the by replacing `movb %al, (%r8)` and `seto %cl` we can do a copy propagation.
```asm
smuloi8:
%bb.0:
  movl %ecx, %eax
  imulb %dl
  movb %al, (%r8)
  seto %al
  retq
```

What post-RA-scheduler with break-anti-dependencies=all gives:
```asm
smuloi8:                                # @smuloi8
        mov     eax, ecx
        imul    dl
        seto    dl
        mov     byte ptr [r8], al
        mov     eax, edx
        ret
```
It just does a renaming the dependencies making copy propagation not avalible stay.

(I don't know why there is a difference in the assembly I got form Godbolt and the one I got from my machine from the test generation. The semantics match).
In all cases the `-disable-peephole -mtriple=x86_64-pc-win32` flag was passed to llc.

https://github.com/llvm/llvm-project/pull/98087