[PATCH] D54392: [DAGCombiner] look through bitcasts when trying to narrow vector binops

Tue Nov 20 08:38:38 PST 2018

spatel marked 2 inline comments as done.
spatel added a comment.

In https://reviews.llvm.org/D54392#1303764, @efriedma wrote:

> It's probably okay to canonicalize the way you are, but you're hitting a missing pattern for AArch64.  Something like the following appears to work:
>
>   def : Pat<(sub (extract_subvector (zext v8i8:$LHS), (i64 0)),
>                  (extract_subvector (zext v8i8:$RHS), (i64 0))),
>             (EXTRACT_SUBREG (USUBLv8i8_v8i16 v8i8:$LHS, v8i8:$RHS), dsub)>;
>
>
> Of course, needs to be rewritten to to match all the relevant types and operations.  x86 doesn't really have those sort of operations, I guess?

Filed here:
https://bugs.llvm.org/show_bug.cgi?id=39722

Given that it's an existing bug, there's probably not much incentive to make this patch dependent on that getting fixed?

================
Comment at: test/CodeGen/AArch64/arm64-ld1.ll:918-919
 ; CHECK-NEXT: ld1r.2s { [[ARG2:v[0-9]+]] }, [x1]
-; CHECK-NEXT: usubl.8h v[[RESREGNUM:[0-9]+]], [[ARG1]], [[ARG2]]
+; CHECK-NEXT: ushll.8h [[ARG1]], [[ARG1]], #0
+; CHECK-NEXT: ushll.8h [[ARG2]], [[ARG2]], #0
+; CHECK-NEXT: sub.4h v[[RESREGNUM:[0-9]+]], [[ARG1]], [[ARG2]]
----------------
efriedma wrote:
> spatel wrote:
> > Side note for the ARM folks - I think this applies here?
> > 
> > ```
> > UXTL{2} <Vd>.<Ta>, <Vn>.<Tb>
> > is equivalent to
> > USHLL{2} <Vd>.<Ta>, <Vn>.<Tb>, #0
> > and is the preferred disassembly...
> > ```
> Not sure why the alias isn't getting automatically applied; please file a bug.
https://bugs.llvm.org/show_bug.cgi?id=39721

================
Comment at: test/CodeGen/X86/i64-mem-copy.ll:95
 ; X32AVX-NEXT:    vextracti128 $1, %ymm0, %xmm0
+; X32AVX-NEXT:    vpaddw %xmm1, %xmm0, %xmm0
 ; X32AVX-NEXT:    vmovq %xmm0, (%eax)
----------------
efriedma wrote:
> This appears to be one instruction more... but maybe worth avoid 256-bit operations on x86?
Right - this is the test I mentioned in the initial summary. Given the current HW implementation choices (frequency throttling based on count of vector ops), I think this is the preferred form despite the extra instruction.

https://reviews.llvm.org/D54392