[PATCH] D110524: [AArch64ISelLowering] Avoid duplane in some cases when sve enabled

Tue Oct 12 07:07:43 PDT 2021

sdesmalen added a comment.

The work to handle DUP from wider vectors than 128bits would be a possible improvement for a future patch.
To move this patch forward, please address the two other comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:9469
     // Example: dup v2f32 (extract v4f32 X, 2), 1 --> dup v4f32 X, 3
-    Lane += V.getConstantOperandVal(1);
-    V = V.getOperand(0);
+    auto ExtractedValType = V.getOperand(0).getValueType();
+    if (ExtractedValType.isFixedLengthVector() &&
----------------
sdesmalen wrote:
> nit: The name `ExtractedValType` isn't exactly accurate, because this doesn't represent the value type of the extracted value, but rather the (input) value that a subvector is extracted from.
Can you please address this comment? Maybe just `s/ExtractedValType/VecVT/` ?

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:9471
+    if (ExtractedValType.isFixedLengthVector() &&
+        ExtractedValType.getSizeInBits().getValue() <= 128) {
+      Lane += V.getConstantOperandVal(1);
----------------
guopeilin wrote:
> sdesmalen wrote:
> > guopeilin wrote:
> > > sdesmalen wrote:
> > > > david-arm wrote:
> > > > > nit: Before committing could you change this to:
> > > > > 
> > > > >   ExtractedValType.getFixedSizeInBits()
> > > > > 
> > > > > It's a bit shorter and it also asserts that the TypeSize is fixed.
> > > > It should be fine to extract:
> > > > 
> > > >   v2f32 (extract v16f32 X, 2), 1 -> dup X, 3
> > > > 
> > > > The current code seems to check that sizeof(v16f32) <= 128, whereas it should use the index `ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))` to determine whether the lane can be indexed..
> > > > 
> > > > Additionally, the 128 seems conservative as well, since SVE indexed DUP allows an index < 512bits.
> > > > 
> > > Thanks for reviewing.
> > > And there is something that confused me, I just wonder what exactly is the value of `ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))` represents for? For your example, 
> > > here I guess the value if `f32 x 3` (Correct me if I am wrong), which represents neither the `v2f32`(return value) nor the `v16f32`(input value). And what should the `f32 x 3` compare with? 
> > AIUI, for:
> > 
> >   v2f32 (extract v16f32 X, 2), 1 -> dup X, 3
> > 
> > The expression:
> > 
> >   ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))
> > 
> > Represents:
> > 
> >   (v2f32.getVectorElementType().getSizeInBits() * (1 + 2)
> >   <=>
> >   32 * 3
> > 
> > Which must not exceed the native (Neon/SVE) vector length or the range of the immediate, otherwise the extract_subvector needs doing first.
> Thanks for the answer, but honestly speaking, I still not get any idea about the meaning of the `32 * 3`, why should we care about the `3` rather than the type of the operand of the DUP instruction.
> 
> I still believe that for this issue, the root cause is that the type of the operands of DUP instruction, which has been specified in the related `.td` file or `.inc` file, should not beyond 128 bits. In other words, we should check the legality of the new instruction, which is `dup X, 3` in your example. And since (v16f32)X is not legal for DUP, so we just ignore this optimization. 
> 
> Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that `v4f32 %res = dup v8f32 %X, 3` is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register.
I think there are two things here:
1. The DUP instruction must be able to handle the index, so the index must fit the immediate.
2. There must be patterns to handle a DUP for some index and I guess these patterns are only implemented for Neon at the moment.

If the type being extracted from is always a Neon register (<= 128bits), then the index always happens to fit the immediate, so by saying "input vector must be less or equal to 128bits", you implicitly satisfy criteria 1 (and I was concerned about it being implicit).

If you'd implement DUP patterns to dup lanes from a vector > 128bits (i.e. from a Z register) then you could allow `v2f32 (extract v16f32 X, 2), 1 -> dup X, 3`, but it would mean adding a condition to check that the new index fits the immediate.

> Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that v4f32 %res = dup v8f32 %X, 3 is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register.
You wouldn't need an extra P register to indicate which lane is valid, the other lanes can just be considered 'undef'. You also don't need an extra mov instruction to move an SVE register to a Neon register, because the two registers overlap (i.e. v0 is a subregister of z0)

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:9471
+    if (ExtractedValType.isFixedLengthVector() &&
+        ExtractedValType.getSizeInBits().getValue() <= 128) {
+      Lane += V.getConstantOperandVal(1);
----------------
sdesmalen wrote:
> guopeilin wrote:
> > sdesmalen wrote:
> > > guopeilin wrote:
> > > > sdesmalen wrote:
> > > > > david-arm wrote:
> > > > > > nit: Before committing could you change this to:
> > > > > > 
> > > > > >   ExtractedValType.getFixedSizeInBits()
> > > > > > 
> > > > > > It's a bit shorter and it also asserts that the TypeSize is fixed.
> > > > > It should be fine to extract:
> > > > > 
> > > > >   v2f32 (extract v16f32 X, 2), 1 -> dup X, 3
> > > > > 
> > > > > The current code seems to check that sizeof(v16f32) <= 128, whereas it should use the index `ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))` to determine whether the lane can be indexed..
> > > > > 
> > > > > Additionally, the 128 seems conservative as well, since SVE indexed DUP allows an index < 512bits.
> > > > > 
> > > > Thanks for reviewing.
> > > > And there is something that confused me, I just wonder what exactly is the value of `ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))` represents for? For your example, 
> > > > here I guess the value if `f32 x 3` (Correct me if I am wrong), which represents neither the `v2f32`(return value) nor the `v16f32`(input value). And what should the `f32 x 3` compare with? 
> > > AIUI, for:
> > > 
> > >   v2f32 (extract v16f32 X, 2), 1 -> dup X, 3
> > > 
> > > The expression:
> > > 
> > >   ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))
> > > 
> > > Represents:
> > > 
> > >   (v2f32.getVectorElementType().getSizeInBits() * (1 + 2)
> > >   <=>
> > >   32 * 3
> > > 
> > > Which must not exceed the native (Neon/SVE) vector length or the range of the immediate, otherwise the extract_subvector needs doing first.
> > Thanks for the answer, but honestly speaking, I still not get any idea about the meaning of the `32 * 3`, why should we care about the `3` rather than the type of the operand of the DUP instruction.
> > 
> > I still believe that for this issue, the root cause is that the type of the operands of DUP instruction, which has been specified in the related `.td` file or `.inc` file, should not beyond 128 bits. In other words, we should check the legality of the new instruction, which is `dup X, 3` in your example. And since (v16f32)X is not legal for DUP, so we just ignore this optimization. 
> > 
> > Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that `v4f32 %res = dup v8f32 %X, 3` is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register.
> I think there are two things here:
> 1. The DUP instruction must be able to handle the index, so the index must fit the immediate.
> 2. There must be patterns to handle a DUP for some index and I guess these patterns are only implemented for Neon at the moment.
> 
> If the type being extracted from is always a Neon register (<= 128bits), then the index always happens to fit the immediate, so by saying "input vector must be less or equal to 128bits", you implicitly satisfy criteria 1 (and I was concerned about it being implicit).
> 
> If you'd implement DUP patterns to dup lanes from a vector > 128bits (i.e. from a Z register) then you could allow `v2f32 (extract v16f32 X, 2), 1 -> dup X, 3`, but it would mean adding a condition to check that the new index fits the immediate.
> 
> > Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that v4f32 %res = dup v8f32 %X, 3 is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register.
> You wouldn't need an extra P register to indicate which lane is valid, the other lanes can just be considered 'undef'. You also don't need an extra mov instruction to move an SVE register to a Neon register, because the two registers overlap (i.e. v0 is a subregister of z0)
Please change: s/getValue/getFixedValue/

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D110524/new/

https://reviews.llvm.org/D110524