[PATCH] D136722: [AArch64] Extending lowering of 'zext <Y x i8> %x to <Y x i8X>' to use tbl instructions
NILANJANA BASU via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Nov 29 11:13:07 PST 2022
nilanjana_basu marked 2 inline comments as done.
nilanjana_basu added inline comments.
================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:13703
auto *UIToFP = dyn_cast<UIToFPInst>(I);
- if (UIToFP &&
- (SrcTy->getNumElements() == 8 || SrcTy->getNumElements() == 16) &&
- SrcTy->getElementType()->isIntegerTy(8) &&
+ if (UIToFP && SrcTy->getElementType()->isIntegerTy(8) &&
DstTy->getElementType()->isFloatTy()) {
----------------
fhahn wrote:
> nilanjana_basu wrote:
> > This conversion shows a regression in performance for some cases where there are multiple similar zext instructions present back to back. The generated code with the previous implementation could be folded into a more optimized set of instructions, which is not possible with 'tbl' instructions. One example is 16xi8->16xi64, where I find an increase in the number of instructions after being lowered to tbl on using a loop interleave count of 4, i.e. with 4 back to back zext instructions.
> >
> > Is it better to rule out this case in this 'if' block or should we not allow tbl-lowering when there are multiple zext instructions of the same type present back to back?
> > This conversion shows a regression in performance for some cases where there are multiple similar zext instructions present back to back. The generated code with the previous implementation could be folded into a more optimized set of instructions, which is not possible with 'tbl' instructions. One example is 16xi8->16xi64, where I find an increase in the number of instructions after being lowered to tbl on using a loop interleave count of 4, i.e. with 4 back to back zext instructions.
>
> is this covered by one of the unit tests? `zext_v16i8_to_v16i64_in_loop` looks fine to me, at least for little endian
>
I was mistaken in checking the instruction count earlier, but I have still added a unit test in zext_v16i8_to_v16i64_in_sequence_in_loop, since I see a performance regression in my local setup.
================
Comment at: llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll:444
+; CHECK-NEXT: add x8, x8, #16
+; CHECK-NEXT: tbl v4.16b, { v4.16b }, v2.16b
+; CHECK-NEXT: tbl v5.16b, { v5.16b }, v2.16b
----------------
fhahn wrote:
> dmgreen wrote:
> > I think this is worse, I'm afraid. We only want to use tbl if it would replace two instructions (it performs two truncate/zext steps). Otherwise we are just adding instructions to the loop header (and using more registers) for no gain.
> >
> >
> yep it looks like we should have a check for that. @nilanjana_basu could you update the patch and make sure this is also tested in `zext-to-tbl.ll`
In the latest patch, I've blocked zext lowering to tbl for destination vectors with i16 element type, since those were the ones that don't benefit from it. For other vector types, a single zext/truncate too improves on the instruction count.
================
Comment at: llvm/test/CodeGen/AArch64/zext-to-tbl.ll:1483
+define void @zext_v8i8_to_v8i16_in_loop(i8* %src, i16* %dst) {
+; CHECK-LABEL: _zext_v8i8_to_v8i16_in_loop:
+; CHECK: ; %bb.0: ; %entry
----------------
fhahn wrote:
> could you ad those new tests separately?
I added two new tests for 2 back-to-back zext instructions - zext_v8i8_to_v8i64_with_add_in_sequence_in_loop & zext_v16i8_to_v16i64_in_sequence_in_loop. The pre-patch codegen has been updated in the parent revision [[ https://reviews.llvm.org/D137993 | D137993 ]].
There seems to be a slight increase in instruction count for zext_v8i8_to_v8i64_with_add_in_sequence_in_loop.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D136722/new/
https://reviews.llvm.org/D136722
More information about the llvm-commits
mailing list