[PATCH] D30296: v2f32 ops

Thu Feb 23 12:18:31 PST 2017

efriedma added a comment.

On platforms where the widened operation is legal or custom-lowered, we want to use that operation.  (For example, if v4f32 uitofp is custom, we also want to use that lowering for v2f32.)

It looks like x86 in particular is legalizing v2f32 uint_to_fp nodes earlier than it should, so the DAGCombine can't trigger after type legalization.

It looks like on ARM, we can't perform the transform after type legalization because we end up with a 32-bit sitofp (because ARM doesn't have 16-bit registers).  The "SrcVT != InVT" check could be extended to handle this case.

That said, if you don't want to mess with other targets, maybe you could change the new check in DAGCombine to "if (LegalTypes && !TLI.isOperationLegalOrCustom(Opcode, VT))"?

================
Comment at: test/CodeGen/X86/cvtv2f32.ll:21
+; X32-NEXT:    cvtsd2ss %xmm3, %xmm2
+; X32-NEXT:    insertps {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[2,3]
 ; X32-NEXT:    mulps %xmm1, %xmm0
----------------
This looks worse.

================
Comment at: test/CodeGen/X86/cvtv2f32.ll:31
+; X64-NEXT:    cvtsi2ssq %rax, %xmm2
+; X64-NEXT:    insertps {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[2,3]
 ; X64-NEXT:    mulps %xmm1, %xmm0
----------------
This... is arguably better, I guess, but you're not really reaching it in any principled manner.

================
Comment at: test_CodeGen_ARM_vdup.diff:83
++	vdup.32	q9, d0[0]
++	vsub.f32	q8, q9, q8
+ 	vmov	r0, r1, d16
----------------
The old code was loading directly to a vector register (vld1 is a splat load, vmovl is a sign-extend, and vcvt is the int->float conversion).  The new code is loading to an integer register (ldrsh), moving to an fp register (vmov), converting to fp (vcvt), then splatting the result (vdup).  The new code is slightly worse.  (Ultimately, we want to do the splat before the int->fp conversion here because we can fold the splat into the load.)

================
Comment at: test_CodeGen_X86_MLICMbug.diff:58
++	insertps	$16, %xmm1, %xmm0 ## xmm0 = xmm0[0],xmm1[0],xmm0[2,3]
++	insertps	$32, %xmm2, %xmm0 ## xmm0 = xmm0[0,1],xmm2[0],xmm0[3]
+ 	movaps	%xmm0, 0
----------------
This looks worse.

https://reviews.llvm.org/D30296