[LLVMdev] NEON vector instructions and the fast math IR flags

Fri Jun 7 00:48:51 PDT 2013

On 06/06/2013 11:58 PM, Renato Golin wrote:
> On 7 June 2013 07:05, Owen Anderson <resistor at mac.com> wrote:

Hi Owen, hi Renato,

thanks for your replies.

>> Darwin uses NEON for floating point, but does *not* (and should not).
>> globally enable fast math flags.  Use of NEON for FP needs to remain
>> achievable without globally setting the fast math flags.  Fast math may
>> imply reasonably imply NEON, but the opposite direction is not accurate.

Good point. Fast math is probably a too tough requirement. I need to 
look into what are the ways NEON does not comply with IEEE 754. For now 
the only difference I see is that it may round denormals to zero.

>> That said, I don't think anyone would object to making VFP codegen
>> available under non-Darwin triples.  It's just a matter of making it happen.

I see.

> Tobi,
>
> The march=arm option would default to ARMv4, while mattr=+neon would force
> NEON, but I'm not sure it would default to A8, which would be a weird
> combination of ARM7TDMI+NEON.
>
> There are two things to know at this point:
>
> 1. When the execution gets to resetSubtargetFeatures, what CPU has it
> detected for your arguments. You may also have to look at ARM.td to see if
> the CPU that it got detected has in its description the feature
> "FeatureNEONForFP".
>
> 2. If the CPU is correct (Cortex-A*), and it's neither A5 nor A8, do we
> still want to generate single-precision float on NEON when non-Darwin and
> safe math? I don't think so. Possibly, that condition should be extended to
> ignore the CPU you're using and *only* emit NEON SP-FP when either Darwin
> or UnsafeMath are on.

Renato:

When to set which subtarget feature is a policy decision, where I 
honestly don't have any opinion on for clang. The best is probably to 
mirror the gcc behavior on linux targets. My current goal is to 
understand the implications of certain features and to make sure a tool 
using the LLVM back-ends can actually implement any policy it likes.

I just looked again at the +neonfp flag. Compiling with and without 
+neonfp flag seems to only affect scalar types in the attached test 
case. If e.g. the LLVM vectorizer introduces vector instructions on 
LLVM-IR level floating point vectors still yield NEON assembly even if 
compiled with "-mattr=+neon,-neonfp". Is this expected?

Cheers,
Tobias

-------------- next part --------------
; RUN: llc -march=arm -mattr=+vfp3,+neon < %s | FileCheck %s

; fooP() performs a vector floating point multiplication with full precision
; requirement. Even if we allow neon with -mattr=+neon, NEON should not be used
; to implement this function as it does not comply to the full precision
; requirements (NEON rounds e.g. denormals to zero which reduces precision)
define <4 x float> @fooP(<4 x float> %A, <4 x float> %B)
{
	%C = fmul <4 x float> %A, %B
; CHECK: fooP
; CHECK: vmul.f32	s
; CHECK: vmul.f32	s
; CHECK: vmul.f32	s
; CHECK: vmul.f32	s
	ret <4 x float> %C
}

; fooR() performs a vector floating point multiplication with relaxed precision
; requirements. In this case the precision loss introduced by neon is acceptable
; and we should generate NEON instructions
define <4 x float> @fooR(<4 x float> %A, <4 x float> %B)
{
	%C = fmul fast <4 x float> %A, %B
; CHECK: fooR
; CHECK: vmul.f32	q
	ret <4 x float> %C
}

; bar() performs a vector integer multiplication. On an ARM NEON device, this
; code should always be execute as vector code.
define <4 x i32> @bar(<4 x i32> %A, <4 x i32> %B)
{
	%C = mul <4 x i32> %A, %B
; CHECK: bar
; CHECK: vmul.i32	q
	ret <4 x i32> %C
}

define float @fooS(float %A, float %B)
{
        %C = fmul fast float %A, %B
; CHECK: fooR
; CHECK: vmul.f32       q
        ret float %C
}