<html>

    <head>

      <base href="https://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - Loads and Stores are not always coalesced"

   href="https://llvm.org/bugs/show_bug.cgi?id=25899">25899</a>

          </td>

        </tr>


        <tr>

          <th>Summary</th>

          <td>Loads and Stores are not always coalesced

          </td>

        </tr>


        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>


        <tr>

          <th>Version</th>

          <td>3.7

          </td>

        </tr>


        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>


        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>


        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>


        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>


        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>


        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>


        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>


        <tr>

          <th>Reporter</th>

          <td>haneef503@gmail.com

          </td>

        </tr>


        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr>


        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Clang (llvm?) sometimes generates inefficient code for loads and stores, but

recognizes that *the same code* can be optimized into fewer loads/stores at

different times. For example, take this simple code:


```

#include <stdint.h>


int l32 (const uint8_t *b) {

    int r = 0;

    r ^= b[0];

    r ^= b[1] << 8;

    r ^= b[2] << 16;

    r ^= b[3] << 24;


    return r;

}


int f (int a) {

    return l32 ((void *) &a);

}

```


`clang -O2` generates (clang 3.7, intel syntax, extraneous contents removed):


l32:

    movzx    eax, byte ptr [rdi]

    movzx    ecx, byte ptr [rdi + 1]

    shl    ecx, 8

    or    ecx, eax

    movzx    edx, byte ptr [rdi + 2]

    shl    edx, 16

    or    edx, ecx

    movzx    eax, byte ptr [rdi + 3]

    shl    eax, 24

    or    eax, edx

    ret


f:

    mov    eax, edi

    ret


If it was able to optimize f() to a simple register move, it must have

recognized that the loads could be coalesced into a single load (or that a

little endian load was being compiled for an architecture that happened to be

little endian). Hence, it really is quite odd that it didn't perform the same

optimization and reduce l32() to something more like:


l32:

    mov    eax, dword ptr [rdi]

    ret</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      
      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>