[PATCH] D99324: [AArch64][SVE] Codegen dup_lane for dup(vector_extract)

Mon Mar 29 05:27:34 PDT 2021

sdesmalen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td:624
+  def : Pat<(nxv4f16 (AArch64dup (f16 (vector_extract (nxv4f16 ZPR:$vec), sve_elm_idx_extdup_h:$index)))),
+            (DUP_ZZI_H ZPR:$vec, sve_elm_idx_extdup_h:$index)>;
+  def : Pat<(nxv2f16 (AArch64dup (f16 (vector_extract (nxv2f16 ZPR:$vec), sve_elm_idx_extdup_h:$index)))),
----------------
junparser wrote:
> junparser wrote:
> > paulwalker-arm wrote:
> > > sdesmalen wrote:
> > > > This isn't entirely correct, because a nxv4f16 has gaps between the elements. A full nxv8f16 has vscale x 8 elements, so that means a nxv4f16 has vscale x 4 elements, with 4 gaps in between, e.g. `<elt0, _, elt1, _, .. >`. That means the element must be multiplied by 2 in this case (and the one for nxv2f32), and 4 for the nxv2f16 case.
> > > While logically true I think in practice you'd rewrite the patten so the instruction's element type matched that of the "packed" vector associated with the dag result's element count (i.e. D for nxv2, S for nxv4).
> > > 
> > > So in this instance something like:
> > > ```
> > >   def : Pat<(nxv4f16 (AArch64dup (f16 (vector_extract (nxv4f16 ZPR:$vec), sve_elm_idx_extdup_s:$index)))),
> > >             (DUP_ZZI_S ZPR:$vec, sve_elm_idx_extdup_s:$index)>;
> > > ``` 
> > > 
> > > So in essense all `nxv4` results are considered to be duplicating floats, with all `nxv2` results the result of duplicating doubles.
> > > 
> > > Is it possible to move the patterns into the multiclass for sve_int_perm_dup_i?
> > > This isn't entirely correct, because a nxv4f16 has gaps between the elements. A full nxv8f16 has vscale x 8 elements, so that means a nxv4f16 has vscale x 4 elements, with 4 gaps in between, e.g. `<elt0, _, elt1, _, .. >`. That means the element must be multiplied by 2 in this case (and the one for nxv2f32), and 4 for the nxv2f16 case.
> > 
> > This is quiet different than what I thought,  for nxv4f16,  I thought the upper 64bit should be empty. Where can i find these rules? I haven't see such in anywhere
> OK, I'll move them to sve_int_perm_dup_i
We haven't explicitly described these rules anywhere I believe. This format is required to generate code for scalable vectors because we have no means to generate a predicate for nxv4f16 that's like `<11110000 | ... | 11110000 >`, where the bitpattern repeats for each 128-bit chunk . We can however always use the unpacked format, because an operation on `nxv4f16` can use the predicate that would be used for a `nxv4f32`, and thus disables every other lane.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99324/new/

https://reviews.llvm.org/D99324