<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - ARMv7a: inefficient code generated from memcpy + bswap builtin"

   href="https://bugs.llvm.org/show_bug.cgi?id=51621">51621</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>ARMv7a: inefficient code generated from memcpy + bswap builtin

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: ARM

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>sami.liedes@iki.fi

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org, smithp352@googlemail.com, Ties.Stuij@arm.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Also applies to: armv7-a clang 11.0.1

Godbolt: <a href="https://godbolt.org/z/9f1GKf5qe">https://godbolt.org/z/9f1GKf5qe</a>

Consider this code:

----------

#include <stdint.h>

uint32_t read_unaligned_memcpy_bswap_32(const uint8_t *buf, int offset) {

    uint32_t val;

    __builtin_memcpy(&val, buf+offset, 4);

    return __builtin_bswap32(val);

}

uint32_t read_unaligned_shift_add_32(const uint8_t *buf, int offset) {

    return (((uint32_t)buf[offset]) << 24) +

           (((uint32_t)buf[offset+1]) << 16) +

           (((uint32_t)buf[offset+2]) << 8) +

           (((uint32_t)buf[offset+3]) << 0);

}

----------

On many architectures, eg. ARMv8, these produce identical and efficient code.

On ARMv7a, __builtin_bswap32 version produces what looks like *worse* code

compared to the shift+add version (although I admit I don't know the

architecture well enough to be sure, but at least the result has 14

instructions as opposed to 8):

read_unaligned_memcpy_bswap_32(unsigned char const*, int):

  ldrb r1, [r0, r1]!

  ldrb r2, [r0, #1]

  ldrb r3, [r0, #2]

  ldrb r0, [r0, #3]

  orr r1, r1, r2, lsl #8

  orr r0, r3, r0, lsl #8

  mov r2, #16711680

  orr r0, r1, r0, lsl #16

  mov r1, #65280

  and r1, r1, r0, lsr #8

  and r2, r2, r0, lsl #8

  orr r1, r1, r0, lsr #24

  orr r0, r2, r0, lsl #24

  orr r0, r0, r1

  bx lr

read_unaligned_shift_add_32(unsigned char const*, int):

  ldrb r1, [r0, r1]!

  ldrb r2, [r0, #1]

  ldrb r3, [r0, #2]

  ldrb r0, [r0, #3]

  lsl r2, r2, #16

  orr r1, r2, r1, lsl #24

  orr r1, r1, r3, lsl #8

  orr r0, r1, r0

  bx lr

The same applies to the 16-bit version (see the Godbolt link for code), but the

difference is much less dramatic (also there trunk generates one instruction

more compared to 11.0.1 for the 16-bit bswap version; I don't know how

significant that is).</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>