<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - [x86, loop vectorizer] Smaller VF preferred when VFs have the same cost"
   href="https://bugs.llvm.org/show_bug.cgi?id=35687">35687</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>[x86, loop vectorizer] Smaller VF preferred when VFs have the same cost
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>new-bugs
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>new bugs
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>dneilson@azul.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Created <span class=""><a href="attachment.cgi?id=19572" name="attach_19572" title="IR to demonstrate">attachment 19572</a> <a href="attachment.cgi?id=19572&action=edit" title="IR to demonstrate">[details]</a></span>
IR to demonstrate

The attached IR was distilled down from one of our internal tests that degraded
~50% with the landing of <a href="https://reviews.llvm.org/rL317576">https://reviews.llvm.org/rL317576</a> (Fix default cost
model for cast op in X86). That change had the effect of calculating the cost
of a bitcast fed by a load as 0 (due to CodeGen/BasicTTIImpl.h lines 561-568 --
"If this is a zext/sext of a load, return 0 if the corresponding extending load
exists on target"). The result is that the vectorized loops in this IR end up
being 8-elements wide instead of 16; resulting in about half the throughput.

The obvious fix -- of changing the vectorizer to choose the larger VF when
costs are the same -- does fix our issue, but fails two tests:
 Transforms/LoopVectorize/X86/avx1.ll
 Transforms/LoopVectorize/X86/fp64_to_uint32-cost-model.ll

I'm filing this bug so that someone more knowledgable about loop vectorization
on x86 can chime in with a suggested way-forward.

For avx1.ll, the loop in @read_mod_i64 has the same cost for VFs 2 and 4; so,
the change would have the VF as 4 instead of 2. The test would seem to indicate
that this is undesirable with slow-unaligned-mem-32.

For fp64_to_uint32-cost-model, again the loop has the same cost at VFs 1, 2,
and 4. However, the test indicates a preference for a scalarized loop in this
case.

I don't know the nuances of x86 vectorization heuristics well enough to know
whether these two failing tests are invariants that should be addressed by the
cost model. It does seem sensible to me to desire the widest possible vector,
so perhaps there are deficiencies in the cost model that would have to be
addressed?</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>