[PATCH] D81343: [AArch64] custom lowering for i128 popcount

Mon Jun 8 14:25:15 PDT 2020

efriedma accepted this revision.
efriedma added a comment.
This revision is now accepted and ready to land.

LGTM

================
Comment at: llvm/test/CodeGen/AArch64/popcount.ll:10
+; CHECK-NEXT:    add x8, x0, #8 // =8
+; CHECK-NEXT:    ld1 { v0.d }[1], [x8]
+; CHECK-NEXT:    cnt v0.16b, v0.16b
----------------
shawnl wrote:
> efriedma wrote:
> > shawnl wrote:
> > > shawnl wrote:
> > > > shawnl wrote:
> > > > > efriedma wrote:
> > > > > > Why are we generating two loads here?  Something related to the BITCAST legalization?
> > > > > Yes, it should be:
> > > > > 
> > > > > > ldr q0, [x0]
> > > > > 
> > > > > 
> > > > Yes, it is going to AArch64ISelLowering.cpp:14006
> > > > 
> > > > 
> > > > ```
> > > >   case ISD::LOAD: {
> > > >     assert(SDValue(N, 0).getValueType() == MVT::i128 &&
> > > >            "unexpected load's value type");
> > > >     LoadSDNode *LoadNode = cast<LoadSDNode>(N);
> > > >     if (!LoadNode->isVolatile() || LoadNode->getMemoryVT() != MVT::i128) {
> > > >       // Non-volatile loads are optimized later in AArch64's load/store
> > > >       // optimizer.    // <======This is not happening
> > > >       return; 
> > > >     }
> > > > 
> > > >     SDValue Result = DAG.getMemIntrinsicNode(
> > > >         AArch64ISD::LDP, SDLoc(N),
> > > >         DAG.getVTList({MVT::i64, MVT::i64, MVT::Other}),
> > > >         {LoadNode->getChain(), LoadNode->getBasePtr()}, LoadNode->getMemoryVT(),
> > > >         LoadNode->getMemOperand());
> > > > 
> > > >     SDValue Pair = DAG.getNode(ISD::BUILD_PAIR, SDLoc(N), MVT::i128,
> > > >                                Result.getValue(0), Result.getValue(1));
> > > >     Results.append({Pair, Result.getValue(2) /* Chain */});
> > > >     return;
> > > >   }
> > > > ```
> > > With `-O0` it outputs:
> > > 
> > > ```
> > > 	ldr	x8, [x0, #8]
> > > 	ldr	d0, [x0]
> > >                                         // implicit-def: $q1
> > > 	mov	v1.16b, v0.16b
> > > 	mov	v1.d[1], x8
> > > ```
> > Hmm.  Really, I think the problem reduces to something like the following, which generates a similar ldr+add+ld1 sequence:
> > 
> >   define <2 x i64> @z(i64* nocapture nonnull readonly %p) {
> >     %b = load i64, i64* %p
> >     %p2 = getelementptr i64, i64* %p, i64 1
> >     %bb = load i64, i64* %p2
> >     %r1 = insertelement <2 x i64> zeroinitializer, i64 %b, i32 0
> >     %r2 = insertelement <2 x i64> %r1, i64 %bb, i32 1
> >     ret <2 x i64> %r2
> >   }
> > 
> > 
> > X86ISelLowering.cpp has some code specifically to handle this; see EltsFromConsecutiveLoads.  Maybe some of it should be ported to AArch64.
> Why is this not optimized to:
> 
> ```
> define <2 x i64> @z(i64* nocapture nonnull readonly %p) {
>   %b = load i128, i128* %p, align 8
>   %r2 = bitcast <2 x i64> %r1, i128 %bb, i32 1
>   ret <2 x i64> %r2
> }
> ```
IR optimizations of that sort are very limited at the moment.  Maybe something to look into for the vectorcombine pass.

It might be hard for IR-level optimizations to catch all the interesting cases anyway, though, given how much of shuffle lowering happens in SelectionDAG.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D81343/new/

https://reviews.llvm.org/D81343