<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Trick for inlining more small constant sized memcmp() and memcpy()"

   href="https://bugs.llvm.org/show_bug.cgi?id=36426">36426</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Trick for inlining more small constant sized memcmp() and memcpy()

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>dave@znu.io

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>One can use a pair of overlapping loads to avoid calling the C runtime in more

scenarios when the size is constant and awkwardly small. For example,

"memcmp(a, b, 15) == 0" would generate:

size_t offset = 15 - sizeof(int64_t);

bool same = 0 == ((*(int64_t*)a ^ *(int64_t*)b) |

                  (*(int64_t*)(a + offset) ^ *(int64_t*)(b + offset)));

This should scale up to a pair of vector loads too.

I haven't benchmarked this with memcpy() yet, but I'd expect that avoiding a

call to the runtime to be worth it there too and for the same reasons: fewer

register spills, no function call overhead, and no dynamic algorithm selection

once the constantness of the size parameter is lost.

NOTE: memcmp/memcpy doesn't guarantee that the pointers are aligned, therefore

don't assume that the first load is "aligned" and the second load is "not

aligned". The opposite could be true at run time.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>