[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Mon Feb 8 08:33:11 PST 2016

Hi Renato,

I think it’s important to distinguish between the loop vectorizer and already-existing vector IR + the SLP vectorizer here.

The loop vectorizer does indeed require -ffast-math, but the IEEE-nonconformant transforms it does are far greater than using an ISA which may FTZ. It needs -ffast-math because any FP reductions necessarily have their execution order shuffled, due to executing some of them in parallel and reducing to scalar at the end. Therefore the LV doesn’t need to be changed - it will only work when “fast” is given and will only emit “fast” vector instructions.

The SLP vectoriser however should theoretically take non-fast scalars and produce non-fast vectors. Similarly people will hand-write vector IR, or generate it from other frontends.

Because of this, I think it’s important that we shouldn’t change the semantics of the IR currently. Making vector IR targeting ARM produce scalar instructions unless a modifier is given will undoubtedly cause problems down the line with frontends being out of sync or not being updated. Even worse, the symptom of this would just be “LLVM produces poor code for ARM” / “LLVM’s vector codegen is terrible for ARM” - performance errata and not conformance. That’s why I think changing to a full-strict-by-default approach would be bad for the project. It would also violate the principle of least surprise - I wrote vector instructions and picked a vector ISA… but they’re being scalarized?

My experience is that the number of people who care about pull IEEE compatibility on ARMv7 hardware is limited, and the set of people who care about exact ULP constraints even more limited. I think we absolutely should make a solution that solves PR16274, but I think it would have to be opt-in, not opt-out.

James
> On 8 Feb 2016, at 15:41, Renato Golin <renato.golin at linaro.org> wrote:
> 
> Folks,
> 
> I'm now looking at https://llvm.org/bugs/show_bug.cgi?id=16274, which
> seems to have some support in the vectorizer, but not as we need for
> this particular case. I may have missed something obvious, please let
> me know if there is a better way.
> 
> As you already know, ARM has two FP instruction sets: VFP and NEON.
> VFP applies to single FP registers while NEON is a full SIMD. The
> problem is that NEON is not IEEE compliant on FP operations, while VFP
> is.
> 
> Even if the target has NEON and the user has asked for it to be used,
> without -ffast-math and related arguments, we simply can't produce
> NEON instructions for FP operations. Different operations may have
> different non-compliance (inf, denormals, etc) and I haven't yet
> investigated the full support, but it's safe to start from blocking
> *all* FP operations on NEON when *any* FP restrictions are in place.
> We can expand for better support later, when the infrastructure is in
> place.
> 
> As far as I could see, ffast-math is included in the vectorizer, but
> as an all-or-nothing, which is not what we want to do. So, I thought
> about two ways we could go about doing this:
> 
> 
> 1.  The pragmatic way
> 
> Add a cost "TCC_Impossible = AllOnes" to TCC and on ARM's cost model,
> check if fast-math is checked on FP ALU operations and return that if
> false. So, VFP costs would be less than NEON costs divided by their
> widths.
> 
> This would make any vectorization beyond VFP instructions impossible
> is fast-math is not chosen, while still using VFP instructions in the
> loop, making it slightly faster.
> 
> I'm sceptical to introducing the TCC_Impossible cost, as it seems a
> dirty trick. I'm open to other better solutions.
> 
> 
> 2.  The thorough way
> 
> Add a flag on TableGen on vector instructions meaning IEEE compliance
> for the different levels of support. Add a "fall-back" VFP instruction
> to each of them (either in TableGen or TTI).
> 
> In the vectorizer, on FP ALU cost, add a check on fast-math && IEEE
> conformance. If failed, check the fall-back instruction's width and
> add the cost as that * Width/FallBackWidth.
> 
> In the back-end, when emitting vector instructions, add the same check
> and emit (unroll) the NEON instructions into similar VFP ones, by
> checking it's fall-back instruction.
> 
> This approach has the benefit of validating IEEE compliance at the
> instruction level, thus working for any other "vectorizer" out there,
> including out-of-tree ones (though this benefit is very limited).
> 
> But it also can change code that it shouldn't, like inline asm or
> intrinsics. I have no solution to this particular problem.
> 
> Any thoughts?
> 
> cheers,
> --renato
>