<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [x86, loop vectorizer] Smaller VF preferred when VFs have the same cost"

   href="https://bugs.llvm.org/show_bug.cgi?id=35687">35687</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[x86, loop vectorizer] Smaller VF preferred when VFs have the same cost

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>dneilson@azul.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Created <span class=""><a href="attachment.cgi?id=19572" name="attach_19572" title="IR to demonstrate">attachment 19572</a> <a href="attachment.cgi?id=19572&action=edit" title="IR to demonstrate">[details]</a></span>

IR to demonstrate

The attached IR was distilled down from one of our internal tests that degraded

~50% with the landing of <a href="https://reviews.llvm.org/rL317576">https://reviews.llvm.org/rL317576</a> (Fix default cost

model for cast op in X86). That change had the effect of calculating the cost

of a bitcast fed by a load as 0 (due to CodeGen/BasicTTIImpl.h lines 561-568 --

"If this is a zext/sext of a load, return 0 if the corresponding extending load

exists on target"). The result is that the vectorized loops in this IR end up

being 8-elements wide instead of 16; resulting in about half the throughput.

The obvious fix -- of changing the vectorizer to choose the larger VF when

costs are the same -- does fix our issue, but fails two tests:

 Transforms/LoopVectorize/X86/avx1.ll

 Transforms/LoopVectorize/X86/fp64_to_uint32-cost-model.ll

I'm filing this bug so that someone more knowledgable about loop vectorization

on x86 can chime in with a suggested way-forward.

For avx1.ll, the loop in @read_mod_i64 has the same cost for VFs 2 and 4; so,

the change would have the VF as 4 instead of 2. The test would seem to indicate

that this is undesirable with slow-unaligned-mem-32.

For fp64_to_uint32-cost-model, again the loop has the same cost at VFs 1, 2,

and 4. However, the test indicates a preference for a scalarized loop in this

case.

I don't know the nuances of x86 vectorization heuristics well enough to know

whether these two failing tests are invariants that should be addressed by the

cost model. It does seem sensible to me to desire the widest possible vector,

so perhaps there are deficiencies in the cost model that would have to be

addressed?</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>