[PATCH] D15392: [InstCombine] fold trunc ([lshr] (bitcast vector) ) --> extractelement (PR25543)

Thu Dec 10 10:51:29 PST 2015

spatel added inline comments.

================
Comment at: test/Transforms/InstCombine/bitcast-bigendian.ll:24
@@ -25,1 +23,3 @@
+; CHECK-NEXT:  extractelement <2 x i32> %B, i32 1
+; CHECK-NEXT:  bitcast i32 {{.*}} to float
 ; CHECK-NEXT:  %add = fadd float %tmp24, %tmp4
----------------
hfinkel wrote:
> spatel wrote:
> > hfinkel wrote:
> > > I'm a bit surprised by this change. We used to prefer the vector bitcast over the scalar one, and with this change, we prefer the scalar bitcast after the abstract. I can see benefits to this as a canonical form (even through some backends will need to work somewhat harder to retail good code quality: vector bitcasts are often cheaper than scalar ones). However, what happens when both elements are extracted? Do we end up with multiple scalar bitcasts?
> > > 
> > I didn't think much of that difference for the existing test case, but for the case that I think you're asking about:
> >   define float @test2(<2 x i32> %A) {
> >     %tmp28 = bitcast <2 x i32> %A to i64
> >     %tmp23 = trunc i64 %tmp28 to i32
> >     %tmp24 = bitcast i32 %tmp23 to float
> > 
> >     %tmp = bitcast <2 x i32> %A to i64
> >     %lshr = lshr i64 %tmp, 32
> >     %tmp2 = trunc i64 %lshr to i32
> >     %tmp4 = bitcast i32 %tmp2 to float 
> > 
> >     %add = fadd float %tmp24, %tmp4
> >     ret float %add
> >   }
> > 
> > Yes, we'll get multiple scalar bitcasts:
> >   define float @test2(<2 x i32> %A) {
> >     %tmp23 = extractelement <2 x i32> %A, i32 0
> >     %tmp24 = bitcast i32 %tmp23 to float
> >     %tmp2 = extractelement <2 x i32> %A, i32 1
> >     %tmp4 = bitcast i32 %tmp2 to float
> >     %add = fadd float %tmp24, %tmp4
> >     ret float %add
> >   }
> > 
> > And the codegen for that is decidedly worse on all of PPC64+Altivec, AArch64, and x86-64 than what we used to get:
> >   define float @test2(<2 x i32> %A) {
> >     %1 = bitcast <2 x i32> %A to <2 x float>
> >     %tmp24 = extractelement <2 x float> %1, i32 0
> >     %2 = bitcast <2 x i32> %A to <2 x float>
> >     %tmp4 = extractelement <2 x float> %2, i32 1
> >     %add = fadd float %tmp24, %tmp4
> >     ret float %add
> >   }
> > 
> > Should I create a bitcast canonicalization instcombine to hoist bitcasts ahead of extracts?
> > Should I create a bitcast canonicalization instcombine to hoist bitcasts ahead of extracts?
> 
> I think it is likely better to emulate the current apparent preference for the vector bitcast over the scalar one(s).
> 
> However, given that your code creates a vector bit cast, what code is undoing that to prefer the scalar bitcasts?
> 
I was digging around for that, but it's an illusion. :)
If we have this sequence:
  %bc1 = bitcast <2 x i32> %A to i64
  %trunc = trunc i64 %bc1 to i32
  %bc2 = bitcast i32 %trunc to float

This patch fires at the trunc without ever seeing the 2nd bitcast, and says, "That's an extract!" and it eliminates the need for the first bitcast:
  %ext = extractelement <2 x i32> %A, i32 0
  %bc2 = bitcast i32 %ext to float

So the remaining bitcasts that we're seeing in these cases are just the original instructions. Without this patch, there was no direct transform of the trunc; we only matched a (bc(trunc(bc x))) pattern, so we could hoist the 2nd bitcast.



http://reviews.llvm.org/D15392