<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Potential missed macro-fusion optimization opportunity with cmp/jcc test/jcc"

   href="https://bugs.llvm.org/show_bug.cgi?id=38452">38452</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Potential missed macro-fusion optimization opportunity with cmp/jcc test/jcc

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>codeman.consulting@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>In <a class="bz_bug_link 

          bz_status_RESOLVED  bz_closed"

   title="RESOLVED INVALID - cmp+store is not optimized to an unconditional store"

   href="show_bug.cgi?id=38450">bug 38450</a> the following was posted:

Gonzalo BG 2018-08-05 06:55:12 PDT

See <a href="https://godbolt.org/g/5W5q2K">https://godbolt.org/g/5W5q2K</a> , the following LLVM-IR:

define void @foo(i32* noalias nocapture dereferenceable(4) %x) {

start:

  %0 = load i32, i32* %x, align 4

  %1 = icmp eq i32 %0, 0

  br i1 %1, label %bb2, label %bb1

bb1:

  store i32 0, i32* %x, align 4

  br label %bb2

bb2:

  ret void

}

does not optimize to anything better with opt -O3. Using llc, it generates the

following x86_64 assembly code (<a href="https://godbolt.org/g/RX81Sn">https://godbolt.org/g/RX81Sn</a>): 

    cmpl    $0, (%rdi)

    je      .LBB0_2

    movl    $0, (%rdi)

.LBB0_2:  

    retq

Although the original bug was marked invalid because of side effects of the

proposed alternate, this code contains a potential missed macro-fusion

optimization:

In a more general case, would the following be faster because of the

macro-fusion of the cmp/je pair allowed by cmp mem, reg instead of cmp mem,

imm?  

[See Intel Optimization manual 2016 p. 3-13 "CMP and TEST can not be fused when

comparing MEM-IMM (e.g. CMP [EAX],0x80; JZ label). ]

Assembly/Compiler Coding Rule 19. states that additional instructions should

not be added to avoid a mem, imm comparison / test, but it should be avoided

when possible.  In this case the value can be used in the store as well, and

maybe re-used from an earlier zero constant.  We don't care about the flag

changes here.

    xor eax, eax

    cmp [rdi], eax

    je .LBB0_2

    mov [rdi], eax

.LBB0_2:

    ret

Code size is a bit smaller as well.  

Alternately: 

    mov eax, [rdi]

    test eax, eax

    je .LBB0_2

    mov DWORD PTR [rdi], 0

.LBB0_2

    ret

Should allow macro-fusion but I was unclear from the optimization manual

whether the load prior to test would introduce a stall (they do recommend it). 

[3.5.1.9]

Assembly/Compiler Coding Rule 40. (ML impact, M generality) Use the TEST

instruction instead of AND when the result of the logical AND is not used. This

saves μops in execution. Use a TEST of a register with itself instead of a CMP

 of the register to zero, this saves the need to encode the zero and saves

encoding space. Avoid comparing a constant to a memory operand. It is

preferable to load the memory operand and compare the constant to a register.  

Assembly/Compiler Coding Rule 41. (ML impact, M generality) Eliminate

unnecessary compare with zero instructions by using the appropriate conditional

jump instruction when the flags are already set by a preceding arithmetic

instruction. If necessary, use a TEST instruction instead of a compare. Be 

certain that any code transformations made do not introduce problems with

overflow.

40 and 41 seem to apply only to test and zero comparisons, whereas 19 is more

generalized and suggests not using extra instructions to do this.  Based on

that the second one seems closest to the guidelines, although the first seems

useful as a code size reduction / macro-fusion optimization when we have a

register already set to zero so aren't generating extra code for the sake of

this.  

Note that these only appear to be valid in 64 bit mode on Nehalem and up, but

apply to Core series 32 bit mode.  

Anyway, these should fold to single dispatch with reduced latency.   In the

example code this doesn't matter much, but in a loop it could help speeds a

bit. 

<a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [x86] Esoteric macro-fusion opportunity: converting jumps to fuseable conditional branches"

   href="show_bug.cgi?id=38079">bug 38079</a> mentions a similar enhancement.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>