[PATCH] D99324: [AArch64][SVE] Codegen dup_lane for dup(vector_extract)
JunMa via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Mar 29 05:33:27 PDT 2021
junparser added inline comments.
================
Comment at: llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td:624
+ def : Pat<(nxv4f16 (AArch64dup (f16 (vector_extract (nxv4f16 ZPR:$vec), sve_elm_idx_extdup_h:$index)))),
+ (DUP_ZZI_H ZPR:$vec, sve_elm_idx_extdup_h:$index)>;
+ def : Pat<(nxv2f16 (AArch64dup (f16 (vector_extract (nxv2f16 ZPR:$vec), sve_elm_idx_extdup_h:$index)))),
----------------
sdesmalen wrote:
> junparser wrote:
> > junparser wrote:
> > > paulwalker-arm wrote:
> > > > sdesmalen wrote:
> > > > > This isn't entirely correct, because a nxv4f16 has gaps between the elements. A full nxv8f16 has vscale x 8 elements, so that means a nxv4f16 has vscale x 4 elements, with 4 gaps in between, e.g. `<elt0, _, elt1, _, .. >`. That means the element must be multiplied by 2 in this case (and the one for nxv2f32), and 4 for the nxv2f16 case.
> > > > While logically true I think in practice you'd rewrite the patten so the instruction's element type matched that of the "packed" vector associated with the dag result's element count (i.e. D for nxv2, S for nxv4).
> > > >
> > > > So in this instance something like:
> > > > ```
> > > > def : Pat<(nxv4f16 (AArch64dup (f16 (vector_extract (nxv4f16 ZPR:$vec), sve_elm_idx_extdup_s:$index)))),
> > > > (DUP_ZZI_S ZPR:$vec, sve_elm_idx_extdup_s:$index)>;
> > > > ```
> > > >
> > > > So in essense all `nxv4` results are considered to be duplicating floats, with all `nxv2` results the result of duplicating doubles.
> > > >
> > > > Is it possible to move the patterns into the multiclass for sve_int_perm_dup_i?
> > > > This isn't entirely correct, because a nxv4f16 has gaps between the elements. A full nxv8f16 has vscale x 8 elements, so that means a nxv4f16 has vscale x 4 elements, with 4 gaps in between, e.g. `<elt0, _, elt1, _, .. >`. That means the element must be multiplied by 2 in this case (and the one for nxv2f32), and 4 for the nxv2f16 case.
> > >
> > > This is quiet different than what I thought, for nxv4f16, I thought the upper 64bit should be empty. Where can i find these rules? I haven't see such in anywhere
> > OK, I'll move them to sve_int_perm_dup_i
> We haven't explicitly described these rules anywhere I believe. This format is required to generate code for scalable vectors because we have no means to generate a predicate for nxv4f16 that's like `<11110000 | ... | 11110000 >`, where the bitpattern repeats for each 128-bit chunk . We can however always use the unpacked format, because an operation on `nxv4f16` can use the predicate that would be used for a `nxv4f32`, and thus disables every other lane.
> We haven't explicitly described these rules anywhere I believe. This format is required to generate code for scalable vectors because we have no means to generate a predicate for nxv4f16 that's like `<11110000 | ... | 11110000 >`, where the bitpattern repeats for each 128-bit chunk . We can however always use the unpacked format, because an operation on `nxv4f16` can use the predicate that would be used for a `nxv4f32`, and thus disables every other lane.
================
Comment at: llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td:624
+ def : Pat<(nxv4f16 (AArch64dup (f16 (vector_extract (nxv4f16 ZPR:$vec), sve_elm_idx_extdup_h:$index)))),
+ (DUP_ZZI_H ZPR:$vec, sve_elm_idx_extdup_h:$index)>;
+ def : Pat<(nxv2f16 (AArch64dup (f16 (vector_extract (nxv2f16 ZPR:$vec), sve_elm_idx_extdup_h:$index)))),
----------------
junparser wrote:
> sdesmalen wrote:
> > junparser wrote:
> > > junparser wrote:
> > > > paulwalker-arm wrote:
> > > > > sdesmalen wrote:
> > > > > > This isn't entirely correct, because a nxv4f16 has gaps between the elements. A full nxv8f16 has vscale x 8 elements, so that means a nxv4f16 has vscale x 4 elements, with 4 gaps in between, e.g. `<elt0, _, elt1, _, .. >`. That means the element must be multiplied by 2 in this case (and the one for nxv2f32), and 4 for the nxv2f16 case.
> > > > > While logically true I think in practice you'd rewrite the patten so the instruction's element type matched that of the "packed" vector associated with the dag result's element count (i.e. D for nxv2, S for nxv4).
> > > > >
> > > > > So in this instance something like:
> > > > > ```
> > > > > def : Pat<(nxv4f16 (AArch64dup (f16 (vector_extract (nxv4f16 ZPR:$vec), sve_elm_idx_extdup_s:$index)))),
> > > > > (DUP_ZZI_S ZPR:$vec, sve_elm_idx_extdup_s:$index)>;
> > > > > ```
> > > > >
> > > > > So in essense all `nxv4` results are considered to be duplicating floats, with all `nxv2` results the result of duplicating doubles.
> > > > >
> > > > > Is it possible to move the patterns into the multiclass for sve_int_perm_dup_i?
> > > > > This isn't entirely correct, because a nxv4f16 has gaps between the elements. A full nxv8f16 has vscale x 8 elements, so that means a nxv4f16 has vscale x 4 elements, with 4 gaps in between, e.g. `<elt0, _, elt1, _, .. >`. That means the element must be multiplied by 2 in this case (and the one for nxv2f32), and 4 for the nxv2f16 case.
> > > >
> > > > This is quiet different than what I thought, for nxv4f16, I thought the upper 64bit should be empty. Where can i find these rules? I haven't see such in anywhere
> > > OK, I'll move them to sve_int_perm_dup_i
> > We haven't explicitly described these rules anywhere I believe. This format is required to generate code for scalable vectors because we have no means to generate a predicate for nxv4f16 that's like `<11110000 | ... | 11110000 >`, where the bitpattern repeats for each 128-bit chunk . We can however always use the unpacked format, because an operation on `nxv4f16` can use the predicate that would be used for a `nxv4f32`, and thus disables every other lane.
> > We haven't explicitly described these rules anywhere I believe. This format is required to generate code for scalable vectors because we have no means to generate a predicate for nxv4f16 that's like `<11110000 | ... | 11110000 >`, where the bitpattern repeats for each 128-bit chunk . We can however always use the unpacked format, because an operation on `nxv4f16` can use the predicate that would be used for a `nxv4f32`, and thus disables every other lane.
>
>
Thanks for explain this!
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D99324/new/
https://reviews.llvm.org/D99324
More information about the llvm-commits
mailing list