[PATCH] D136722: [AArch64] Extending lowering of 'zext <Y x i8> %x to <Y x i8X>' to use tbl instructions

Tue Nov 29 11:13:07 PST 2022

nilanjana_basu marked 2 inline comments as done.
nilanjana_basu added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:13703
   auto *UIToFP = dyn_cast<UIToFPInst>(I);
-  if (UIToFP &&
-      (SrcTy->getNumElements() == 8 || SrcTy->getNumElements() == 16) &&
-      SrcTy->getElementType()->isIntegerTy(8) &&
+  if (UIToFP && SrcTy->getElementType()->isIntegerTy(8) &&
       DstTy->getElementType()->isFloatTy()) {
----------------
fhahn wrote:
> nilanjana_basu wrote:
> > This conversion shows a regression in performance for some cases where there are multiple similar zext instructions present back to back. The generated code with the previous implementation could be folded into a more optimized set of instructions, which is not possible with 'tbl' instructions. One example is 16xi8->16xi64, where I find an increase in the number of instructions after being lowered to tbl on using a loop interleave count of 4, i.e. with 4 back to back zext instructions.
> > 
> > Is it better to rule out this case in this 'if' block or should we not allow tbl-lowering when there are multiple zext instructions of the same type present back to back?
> > This conversion shows a regression in performance for some cases where there are multiple similar zext instructions present back to back. The generated code with the previous implementation could be folded into a more optimized set of instructions, which is not possible with 'tbl' instructions. One example is 16xi8->16xi64, where I find an increase in the number of instructions after being lowered to tbl on using a loop interleave count of 4, i.e. with 4 back to back zext instructions.
> 
> is this covered by one of the unit tests? `zext_v16i8_to_v16i64_in_loop` looks fine to me, at least for little endian
> 
I was mistaken in checking the instruction count earlier, but I have still added a unit test in zext_v16i8_to_v16i64_in_sequence_in_loop, since I see a performance regression in my local setup. 

================
Comment at: llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll:444
+; CHECK-NEXT: 	add	x8, x8, #16
+; CHECK-NEXT: 	tbl	v4.16b, { v4.16b }, v2.16b
+; CHECK-NEXT: 	tbl	v5.16b, { v5.16b }, v2.16b
----------------
fhahn wrote:
> dmgreen wrote:
> > I think this is worse, I'm afraid. We only want to use tbl if it would replace two instructions (it performs two truncate/zext steps). Otherwise we are just adding instructions to the loop header (and using more registers) for no gain.
> > 
> > 
> yep it looks like we should have a check for that. @nilanjana_basu could you update the patch and make sure this is also tested in `zext-to-tbl.ll`
In the latest patch, I've blocked zext lowering to tbl for destination vectors with i16 element type, since those were the ones that don't benefit from it. For other vector types, a single zext/truncate too improves on the instruction count. 

================
Comment at: llvm/test/CodeGen/AArch64/zext-to-tbl.ll:1483
+define void @zext_v8i8_to_v8i16_in_loop(i8* %src, i16* %dst) {
+; CHECK-LABEL:  _zext_v8i8_to_v8i16_in_loop:
+; CHECK:  ; %bb.0:                                ; %entry
----------------
fhahn wrote:
> could you ad those new tests separately?
I added two new tests for 2 back-to-back zext instructions - zext_v8i8_to_v8i64_with_add_in_sequence_in_loop & zext_v16i8_to_v16i64_in_sequence_in_loop. The pre-patch codegen has been updated in the parent revision [[ https://reviews.llvm.org/D137993 | D137993 ]].

There seems to be a slight increase in instruction count for zext_v8i8_to_v8i64_with_add_in_sequence_in_loop.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136722/new/

https://reviews.llvm.org/D136722