[PATCH] D136722: [AArch64] Extending lowering of 'zext <Y x i8> %x to <Y x i8X>' to use tbl instructions

Tue Nov 22 03:11:28 PST 2022

fhahn added a comment.

In D136722#3917024 <https://reviews.llvm.org/D136722#3917024>, @dmgreen wrote:

> I know it's not your problem, but the code in optimizeExtendOrTruncateConversion doesn't feel like it is in the best place, to be honest. CGP has always described itself as a hack, but we shouldn't be hacking things that much. There will be some obvious cases where the extend/trunc can be optimized but the tbl blocks it.
> As far as I understand, the code is only in CGP because it is trying limit the transforms to loops. I'm wondering if it would be better to add some sort of flag into ISel so that combines could tell that the current block is a loop, and behave differently because of it.

Yep the only reason for doing it in CGP is to work around SelDAG's limitation.I am not sure about extending SelDAG for this, as we are planning to transition to GIsel at least on Darwin platforms very soon, which won't require doing this in CGP. I think @nilanjana_basu will also look into implementing this in GIsel

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:13703
   auto *UIToFP = dyn_cast<UIToFPInst>(I);
-  if (UIToFP &&
-      (SrcTy->getNumElements() == 8 || SrcTy->getNumElements() == 16) &&
-      SrcTy->getElementType()->isIntegerTy(8) &&
+  if (UIToFP && SrcTy->getElementType()->isIntegerTy(8) &&
       DstTy->getElementType()->isFloatTy()) {
----------------
nilanjana_basu wrote:
> This conversion shows a regression in performance for some cases where there are multiple similar zext instructions present back to back. The generated code with the previous implementation could be folded into a more optimized set of instructions, which is not possible with 'tbl' instructions. One example is 16xi8->16xi64, where I find an increase in the number of instructions after being lowered to tbl on using a loop interleave count of 4, i.e. with 4 back to back zext instructions.
> 
> Is it better to rule out this case in this 'if' block or should we not allow tbl-lowering when there are multiple zext instructions of the same type present back to back?
> This conversion shows a regression in performance for some cases where there are multiple similar zext instructions present back to back. The generated code with the previous implementation could be folded into a more optimized set of instructions, which is not possible with 'tbl' instructions. One example is 16xi8->16xi64, where I find an increase in the number of instructions after being lowered to tbl on using a loop interleave count of 4, i.e. with 4 back to back zext instructions.

is this covered by one of the unit tests? `zext_v16i8_to_v16i64_in_loop` looks fine to me, at least for little endian

================
Comment at: llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll:444
+; CHECK-NEXT: 	add	x8, x8, #16
+; CHECK-NEXT: 	tbl	v4.16b, { v4.16b }, v2.16b
+; CHECK-NEXT: 	tbl	v5.16b, { v5.16b }, v2.16b
----------------
dmgreen wrote:
> I think this is worse, I'm afraid. We only want to use tbl if it would replace two instructions (it performs two truncate/zext steps). Otherwise we are just adding instructions to the loop header (and using more registers) for no gain.
> 
> 
yep it looks like we should have a check for that. @nilanjana_basu could you update the patch and make sure this is also tested in `zext-to-tbl.ll`

================
Comment at: llvm/test/CodeGen/AArch64/zext-to-tbl.ll:1483
+define void @zext_v8i8_to_v8i16_in_loop(i8* %src, i16* %dst) {
+; CHECK-LABEL:  _zext_v8i8_to_v8i16_in_loop:
+; CHECK:  ; %bb.0:                                ; %entry
----------------
could you ad those new tests separately?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136722/new/

https://reviews.llvm.org/D136722