[PATCH]{AArch64] Implemented half-precision vget/vset_lane_f16 intrinsics
Ana Pazos
apazos at codeaurora.org
Wed Dec 4 23:26:08 PST 2013
Hi Jiangning,
In the example you mentioned the dup instruction is the best pattern because
the test case simply returns the result of vget intrinsic as a float.
But we cannot guarantee that vset/vget will use the preferred mapping from
the ARM manual nor that the preferred mapping is the best code always,
because it depends on how the half-precision FP value is used before or
after the vget/vset intrinsics calls (which may cause other patterns like
UMOV, INS to be applied).
What I decided to do:
I added a couple of additional patterns that include fp16_to_f32 operation
in the vector extract/insert and vector copy expressions. This way I can
optimize the test case you mentioned and a couple more cases I could think
of.
See the updated patches attached.
Thanks,
Ana.
From: Jiangning Liu [mailto:liujiangning1 at gmail.com]
Sent: Wednesday, December 04, 2013 1:21 AM
To: Ana Pazos
Cc: llvm-commits; cfe-commits at cs.uiuc.edu; Tim Northover
Subject: Re: [PATCH]{AArch64] Implemented half-precision vget/vset_lane_f16
intrinsics
Hi Ana,
Can we generate more optimized code by using instruction DUP Hd,Vn.H[lane]
directly.
With your patch we are generating two instructions like,
umov w0, v0.h[3]
fmov s0, w0
Actually they can be combined to a single "dup h0, v0.h[3]". This is what we
want for this intrinsic function.
What you need to do is by adding one more pattern to match this case.
Thanks,
-Jiangning
2013/12/4 Ana Pazos <apazos at codeaurora.org>
Hi Tim and reviewers,
I followed Tim's suggestion to use macros to handle the float16_t storage
type.
Notes:
1) Since the vget/vset_lane_f16 intrinsics read and write 16 bit data
(no FP arithmetic performed), I simply reinterpreted float16_t and the
vector of float16_t as i16 data.
See the operators defined in NeonEmitter.
2) With this change vget/vset_lane_f16 use to vget/vset_lane_i16
implementation.
3) I added f16_to_f32 pattern because in the vset_lane case the i16
data is moved from GPR to a FP register, and hence the need for this
pattern.
4) I added test cases that define float16_t variable in the function
body, but do not return such value type as it is not allowed. Let me know if
these tests are satisfactory.
5) I did not try to enforce the recommended intrinsic->instruction map
from the ARM document.
To force those instructions I would have to use builtins and v1i16 type
casts so I can create the pattern.
Even doing that, in some cases the UMOV, INS patterns defined earlier can
prevail over.
Thanks,
Ana.
_______________________________________________
cfe-commits mailing list
cfe-commits at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
--
Thanks,
-Jiangning
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131204/c81d24d1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-clang-Implemented-vget-vset_lane_f16-intrinsics.patch
Type: application/octet-stream
Size: 5720 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131204/c81d24d1/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-llvm-Implemented-vget-vset_lane_f16-intrinsics.patch
Type: application/octet-stream
Size: 6456 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131204/c81d24d1/attachment-0001.obj>
More information about the llvm-commits
mailing list