[llvm-dev] Vectorization with fast-math on irregular ISA sub-sets

Wed Feb 10 06:30:50 PST 2016

On 9 February 2016 at 20:29, Hal Finkel <hfinkel at anl.gov> wrote:
>> If the scalarisation is in IR, then any NEON intrinsic in C code will
>> get wrongly scalarised. Builtins can be lowered in either IR
>> operations or builtins, and the back-end has no way of knowing the
>> origin.
>>
>> If the scalarization is lower down, then we risk also changing inline
>> ASM snippets, which is even worse.
>
> Yes, but we don't do that, so that's not a practical concern.

The IR scalarisation is, though.

> To be fair, our IR language reference does not actually say that our floating-point arithmetic is IEEE compliant, but it is implied, and frontends depend on this fact. We really should not change the IR floating-point semantics contract over this. It might require some user education, but that's much better than producing subtly-wrong results.

But we lower a NEON intrinsic into plain IR instructions.

If I got it right, the current "fast" attribute is "may use non IEEE
compliant", emphasis on the *may*.

As a user, I'd be really angry if I used "float32x4_t vaddq_f32
(float32x4_t, float32x4_t)" and the compiler emitted four VADD.f32 SN.

Right now, Clang lowers:
  vaddq_f32 (a, b);

to:
  %add.i = fadd <4 x float> %a, %b

which lowers (correctly) to:
  vadd.f32 q0, q0, q1

If, OTOH, "fast" means "*must* select the fastest", then we may get
away with using it.

So, your proposal seems to be that, while lowering NEON intrinsics,
Clang *always* emit the "fast" attribute for all FP operations, and
that such scalarisation phase would split *all* non-fast FP operations
if the target has non-IEEE-754 compliant SIMD.

James' proposal is to not vectorise loops if an IEE-754 compliant SIMD
is not on, and to only generate VFP instructions in the SLP
vectoriser. If we're not generating the large vector operations in the
first place, why would we need to scalarise them?

If we do vectorise to SIMD and then later scalarise, wouldn't that
change the cost model? Wouldn't it be harder to predict performance
gains, given that our cost model is only approximate and very
empirical?

Other front-ends should produce "valid" (target-specific) IR in the
first place, no? Hand generated broken IR is not something we wish to
support either, I believe.

> We have a pass-feedback mechanism, I think it would be very useful for compiling with -Rpass-missed=loop-vectorize and/or -Rpass-analysis=loop-vectorize helpfully informed users that compiling with -ffast-math and/or -ffinite-math-only and -fno-signed-zeros would allow the loop to be vectorized for the targeted hardware.

That works for optimisations, not for intrinsics. Since we use the
same intermediate representation for both, we can't assume anything.

cheers,
--renato