[LLVMdev] X86 - Help on fixing a poor code generation bug

Thu Dec 5 09:58:27 PST 2013

Hi Andrea, 

Thanks for working on this. I can see two approaches to solving this problem. The first one (that you suggested) is to catch this pattern after register allocation. The second approach is to eliminate this redundancy during instruction selection. Can you please look into catching this pattern during iSel? The idea is that ADDSS does an ADD plus BLEND operations, and you can easily catch them. You can add a new target specific DAGCombine or a table-ten pattern. You should also handle mul/add/sub. 

define <4 x float> @foo(<4 x float> %A, <4 x float> %B) nounwind readnone ssp uwtable {
  %1 = extractelement <4 x float> %B, i32 0
  %2 = extractelement <4 x float> %A, i32 0
  %3 = fadd float %2, %1                                                 //  Both the fadd and the insert element below should be matched into 
  %4 = insertelement <4 x float> %A, float %3, i32 0      //   an ADDSS which does an ADD and a BLEND in one instruction. 
  ret <4 x float> %4
}

Thanks,
Nadav

On Dec 5, 2013, at 7:35 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:

> Hi all,
> 
> I noticed that the x86 backend tends to emit unnecessary vector insert
> instructions immediately after sse scalar fp instructions like
> addss/mulss.
> 
> For example:
> /////////////////////////////////
> __m128 foo(__m128 A, __m128 B) {
>  _mm_add_ss(A, B);
> }
> /////////////////////////////////
> 
> produces the sequence:
>  addss  %xmm0, %xmm1
>  movss  %xmm1, %xmm0
> 
> which could be easily optimized into
>  addss  %xmm1, %xmm0
> 
> The first step is to understand why the compiler produces the
> redundant instructions.
> 
> For the example above, the DAG sequence looks like this:
> 
>  a0 :     f32 = extract_vector_elt ( A, 0)
>  b0 :     f32 = extract_vector_elt ( B, 0)
>  r0 :     f32 = fadd a0, b0
>  result : v4f32 = insert_vector_elt ( A, r0, 0 )
> 
> (with A and B of type v4f32).
> 
> The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or
> X86ISD::INSERTPS depending on the target's SSE feature level.
> 
> To start I checked if this bug was caused simply by the lack of
> specific tablegen patterns to match the complex sequence described
> above into a single ADDSS instruction.
> 
> However X86InstrSSE.td already defines an instruction X86::ADDSSrr as
> a commutative SSE scalar fp add instruction (that works on F32
> ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes
> and it will be translated to 'addss' in assembly.
> 
> At this stage, the MOVSS/INSERTPS is still required since the ADDSS
> alone would not be equivalent to the hardware 'movss' instruction.
> 
> I then started investigating the possibility of adding a pass that
> runs at 'postRegAlloc' stage.
> 
> Before RegAlloc it may not be safe to remove the redundant MOVSSrr
> because of the TwoAddressInstruction Pass; this may decide to commute
> the operands of the ADDSS/MULSS.
> 
> It is possible to write a pass that scans through each basic block in
> a function looking for opportunities to fold instructions based on the
> following patterns:
> 
>  //////
>  B<def, tied1> = #NAME#SSrr B<kill, tied0>, A
>  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
>    ==>
>  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
>  /////
> 
>  /////
>  B<def, tied1> = #NAME#PSrr B<kill, tied0>, A
>  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
>    ==>
>  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
>  /////
> 
> With 'NAME' in {ADD, SUB, MUL, DIV}.
> 
> I managed to create a very simple prototype pass which simplifies the
> code according to pattern #1 (see the patch in attachment).
> However, identifying patterns at postRegAlloc stage is definitely more
> complicated than matching patterns on dags: instructions are not in
> the SSA form; the pass must keep track of defs and kills and leave the
> liveness info in a consistent state; some instructions may have side
> effects etc.
> Last but not least, the register mapping implemented by the register
> allocator and how instructions are scheduled may strongly affect the
> ability of a pass like this to pattern match specific sequences of
> instructions.
> 
> In conclusion, my questions are:
> Would a patch be acceptable that introduces a new MachineFunctionPass
> that runs at 'postRegAlloc' stage?
> If not, is there a better way to fix this bug?
> 
> Thanks,
> Andrea Di Biagio
> SN Systems - Sony Computer Entertainment Group
> <patch.diff>_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev