[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Mon Feb 8 07:41:20 PST 2016

Folks,

I'm now looking at https://llvm.org/bugs/show_bug.cgi?id=16274, which
seems to have some support in the vectorizer, but not as we need for
this particular case. I may have missed something obvious, please let
me know if there is a better way.

As you already know, ARM has two FP instruction sets: VFP and NEON.
VFP applies to single FP registers while NEON is a full SIMD. The
problem is that NEON is not IEEE compliant on FP operations, while VFP
is.

Even if the target has NEON and the user has asked for it to be used,
without -ffast-math and related arguments, we simply can't produce
NEON instructions for FP operations. Different operations may have
different non-compliance (inf, denormals, etc) and I haven't yet
investigated the full support, but it's safe to start from blocking
*all* FP operations on NEON when *any* FP restrictions are in place.
We can expand for better support later, when the infrastructure is in
place.

As far as I could see, ffast-math is included in the vectorizer, but
as an all-or-nothing, which is not what we want to do. So, I thought
about two ways we could go about doing this:

1.  The pragmatic way

Add a cost "TCC_Impossible = AllOnes" to TCC and on ARM's cost model,
check if fast-math is checked on FP ALU operations and return that if
false. So, VFP costs would be less than NEON costs divided by their
widths.

This would make any vectorization beyond VFP instructions impossible
is fast-math is not chosen, while still using VFP instructions in the
loop, making it slightly faster.

I'm sceptical to introducing the TCC_Impossible cost, as it seems a
dirty trick. I'm open to other better solutions.

2.  The thorough way

Add a flag on TableGen on vector instructions meaning IEEE compliance
for the different levels of support. Add a "fall-back" VFP instruction
to each of them (either in TableGen or TTI).

In the vectorizer, on FP ALU cost, add a check on fast-math && IEEE
conformance. If failed, check the fall-back instruction's width and
add the cost as that * Width/FallBackWidth.

In the back-end, when emitting vector instructions, add the same check
and emit (unroll) the NEON instructions into similar VFP ones, by
checking it's fall-back instruction.

This approach has the benefit of validating IEEE compliance at the
instruction level, thus working for any other "vectorizer" out there,
including out-of-tree ones (though this benefit is very limited).

But it also can change code that it shouldn't, like inline asm or
intrinsics. I have no solution to this particular problem.

Any thoughts?

cheers,
--renato