[llvm-bugs] [Bug 35687] New: [x86, loop vectorizer] Smaller VF preferred when VFs have the same cost

via llvm-bugs llvm-bugs at lists.llvm.org
Mon Dec 18 10:44:45 PST 2017


https://bugs.llvm.org/show_bug.cgi?id=35687

            Bug ID: 35687
           Summary: [x86, loop vectorizer] Smaller VF preferred when VFs
                    have the same cost
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: dneilson at azul.com
                CC: llvm-bugs at lists.llvm.org

Created attachment 19572
  --> https://bugs.llvm.org/attachment.cgi?id=19572&action=edit
IR to demonstrate

The attached IR was distilled down from one of our internal tests that degraded
~50% with the landing of https://reviews.llvm.org/rL317576 (Fix default cost
model for cast op in X86). That change had the effect of calculating the cost
of a bitcast fed by a load as 0 (due to CodeGen/BasicTTIImpl.h lines 561-568 --
"If this is a zext/sext of a load, return 0 if the corresponding extending load
exists on target"). The result is that the vectorized loops in this IR end up
being 8-elements wide instead of 16; resulting in about half the throughput.

The obvious fix -- of changing the vectorizer to choose the larger VF when
costs are the same -- does fix our issue, but fails two tests:
 Transforms/LoopVectorize/X86/avx1.ll
 Transforms/LoopVectorize/X86/fp64_to_uint32-cost-model.ll

I'm filing this bug so that someone more knowledgable about loop vectorization
on x86 can chime in with a suggested way-forward.

For avx1.ll, the loop in @read_mod_i64 has the same cost for VFs 2 and 4; so,
the change would have the VF as 4 instead of 2. The test would seem to indicate
that this is undesirable with slow-unaligned-mem-32.

For fp64_to_uint32-cost-model, again the loop has the same cost at VFs 1, 2,
and 4. However, the test indicates a preference for a scalarized loop in this
case.

I don't know the nuances of x86 vectorization heuristics well enough to know
whether these two failing tests are invariants that should be addressed by the
cost model. It does seem sensible to me to desire the widest possible vector,
so perhaps there are deficiencies in the cost model that would have to be
addressed?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20171218/22649e53/attachment-0001.html>


More information about the llvm-bugs mailing list