[PATCH] D127115: [RFC][DAGCombine] Make sure combined nodes are added back to the worklist in topological order.

Sat Jun 11 16:41:40 PDT 2022

deadalnix added inline comments.

================
Comment at: llvm/test/CodeGen/X86/load-partial.ll:117
+; SSE2-NEXT:    shufps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0,2]
+; SSE2-NEXT:    retq
+;
----------------
deadalnix wrote:
> Before:
> ```
> SelectionDAG has 9 nodes:
>   t0: ch = EntryToken
>       t2: i64,ch = CopyFromReg t0, Register:i64 %0
>     t33: v4f32,ch = load<(dereferenceable load (s128) from %ir.2, align 4)> t0, t2, undef:i64
>   t22: ch,glue = CopyToReg t0, Register:v4f32 $xmm0, t33
>   t23: ch = X86ISD::RET_FLAG t22, TargetConstant:i32<0>, Register:v4f32 $xmm0, t22:1
> ```
> 
> After:
> ```
> SelectionDAG has 19 nodes:
>   t0: ch = EntryToken
>   t2: i64,ch = CopyFromReg t0, Register:i64 %0
>       t31: f64,ch = load<(dereferenceable load (s64) from %ir.2, align 4)> t0, t2, undef:i64
>     t32: v2f64 = scalar_to_vector t31
>   t33: v4f32 = bitcast t32
>             t15: i64 = add nuw t2, Constant:i64<8>
>           t16: f32,ch = load<(dereferenceable load (s32) from %ir.8)> t0, t15, undef:i64
>         t35: v4f32 = scalar_to_vector t16
>       t38: v4f32 = X86ISD::SHUFP t35, t33, TargetConstant:i8<48>
>     t40: v4f32 = X86ISD::SHUFP t33, t38, TargetConstant:i8<-124>
>   t22: ch,glue = CopyToReg t0, Register:v4f32 $xmm0, t40
>   t23: ch = X86ISD::RET_FLAG t22, TargetConstant:i32<0>, Register:v4f32 $xmm0, t22:1
> ```
> 
> This is definitively serious.
This turns out to be pretty interesting when take a step back. Before the diff we had:
```
SelectionDAG has 20 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
      t31: f32 = extract_vector_elt t30, Constant:i64<0>
      t32: f32 = extract_vector_elt t30, Constant:i64<1>
        t15: i64 = add nuw t2, Constant:i64<8>
      t16: f32,ch = load<(dereferenceable load (s32) from %ir.8)> t0, t15, undef:i64
    t27: v4f32 = BUILD_VECTOR t31, t32, t16, undef:f32
  t22: ch,glue = CopyToReg t0, Register:v4f32 $xmm0, t27
      t28: f64,ch = load<(dereferenceable load (s64) from %ir.2, align 4)> t0, t2, undef:i64
    t29: v2f64 = scalar_to_vector t28
  t30: v4f32 = bitcast t29
  t23: ch = X86ISD::RET_FLAG t22, TargetConstant:i32<0>, Register:v4f32 $xmm0, t22:1
```

And after:
```
SelectionDAG has 16 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t31: f64,ch = load<(dereferenceable load (s64) from %ir.2, align 4)> t0, t2, undef:i64
        t32: v2f64 = scalar_to_vector t31
      t33: v4f32 = bitcast t32
        t15: i64 = add nuw t2, Constant:i64<8>
      t16: f32,ch = load<(dereferenceable load (s32) from %ir.8)> t0, t15, undef:i64
    t19: v4f32 = insert_vector_elt t33, t16, Constant:i64<2>
  t22: ch,glue = CopyToReg t0, Register:v4f32 $xmm0, t19
  t23: ch = X86ISD::RET_FLAG t22, TargetConstant:i32<0>, Register:v4f32 $xmm0, t22:1
```

The former is crunched down to almost nothing while the later gets turned into SHUFP. The magic seems to happen when legalizing the BUILD_VECTOR, which gets turned into an aggregated load, while `v4f32 = insert_vector_elt t33, t16, Constant:i64<2>` gets expanded into a `v4f32 = vector_shuffle<0,1,4,3> t33, t35` which in turn is legalized into a series of SHUFP.

It seems a bit strange to me that a legalization does most of the optimization here, and it turns out to be somewhat fragile.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D127115/new/

https://reviews.llvm.org/D127115