[LLVMdev] X86 - Help on fixing a poor code generation bug

Thu Dec 5 08:54:32 PST 2013

Hey Andrea,

On Thu, Dec 5, 2013 at 10:35 AM, Andrea Di Biagio
<andrea.dibiagio at gmail.com> wrote:
> Hi all,
>
> I noticed that the x86 backend tends to emit unnecessary vector insert
> instructions immediately after sse scalar fp instructions like
> addss/mulss.
>
> For example:
> /////////////////////////////////
> __m128 foo(__m128 A, __m128 B) {
>   _mm_add_ss(A, B);
> }
> /////////////////////////////////
>
> produces the sequence:
>   addss  %xmm0, %xmm1
>   movss  %xmm1, %xmm0
>
> which could be easily optimized into
>   addss  %xmm1, %xmm0
>
> The first step is to understand why the compiler produces the
> redundant instructions.
>
> For the example above, the DAG sequence looks like this:
>
>   a0 :     f32 = extract_vector_elt ( A, 0)
>   b0 :     f32 = extract_vector_elt ( B, 0)
>   r0 :     f32 = fadd a0, b0
>   result : v4f32 = insert_vector_elt ( A, r0, 0 )
>
> (with A and B of type v4f32).
>
> The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or
> X86ISD::INSERTPS depending on the target's SSE feature level.
>
> To start I checked if this bug was caused simply by the lack of
> specific tablegen patterns to match the complex sequence described
> above into a single ADDSS instruction.
>
> However X86InstrSSE.td already defines an instruction X86::ADDSSrr as
> a commutative SSE scalar fp add instruction (that works on F32
> ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes
> and it will be translated to 'addss' in assembly.
>
> At this stage, the MOVSS/INSERTPS is still required since the ADDSS
> alone would not be equivalent to the hardware 'movss' instruction.
>
> I then started investigating the possibility of adding a pass that
> runs at 'postRegAlloc' stage.
>
> Before RegAlloc it may not be safe to remove the redundant MOVSSrr
> because of the TwoAddressInstruction Pass; this may decide to commute
> the operands of the ADDSS/MULSS.
>
> It is possible to write a pass that scans through each basic block in
> a function looking for opportunities to fold instructions based on the
> following patterns:
>
>   //////
>   B<def, tied1> = #NAME#SSrr B<kill, tied0>, A
>   A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
>     ==>
>   A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
>   /////
>
>   /////
>   B<def, tied1> = #NAME#PSrr B<kill, tied0>, A
>   A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
>     ==>
>   A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
>   /////
>
> With 'NAME' in {ADD, SUB, MUL, DIV}.
>
> I managed to create a very simple prototype pass which simplifies the
> code according to pattern #1 (see the patch in attachment).
> However, identifying patterns at postRegAlloc stage is definitely more
> complicated than matching patterns on dags: instructions are not in
> the SSA form; the pass must keep track of defs and kills and leave the
> liveness info in a consistent state; some instructions may have side
> effects etc.
> Last but not least, the register mapping implemented by the register
> allocator and how instructions are scheduled may strongly affect the
> ability of a pass like this to pattern match specific sequences of
> instructions.
>
> In conclusion, my questions are:
> Would a patch be acceptable that introduces a new MachineFunctionPass
> that runs at 'postRegAlloc' stage?
> If not, is there a better way to fix this bug?
>
> Thanks,
> Andrea Di Biagio
> SN Systems - Sony Computer Entertainment Group

With my compiler, which has a proprietary frontend and Builtin
lowerings, we do produce the expected assembly:

> # BB#0:                                 # %file test.c, line 3, bb1
>        addss   %xmm1, %xmm0            # test.c:5
>        ret                             # test.c:5

The IR produced is:

>  %r = load <4 x float>* %A, align 16, !dbg !11,
>  %r3 = load <4 x float>* %B, align 16, !dbg !11
>  %r4 = call <4 x float> @llvm.x86.sse.add.ss(<4 x float> %r, <4 x float> %r3), !dbg !11
>  ret <4 x float> %r4, !dbg !11

Since the int_x86_sse_add_ss intrinsic operates on vectors, not
scalars, there's no need for the inserts and extracts.

Of course, with my compiler, we miss out on any optimization
opportunities that may arise from using the IRBuilder. So, this is not
ideal either. But, we do avoid partial register updates... so we've
got that going for us. ;)

I do realize that this does not solve your problem. Just thought that
it might help design an ideal solution.

HTH,
Cameron